Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Introduction VRIJE UNIVERSITEIT BRUSSEL Faculteit Geneeskunde en Farmacie Laboratorium voor Farmaceutische en Biomedische Analyse N EW TRENDS IN MULTIVARIATE ANALYSIS AND C ALIBRATION Frédéric ESTIENNE Thesis presented to fulfil the requirements for the degree of doctor in Pharmaceutical Sciences Academic year : 2002/2003 Promotor : Prof. Dr. D.L. MASSART 1 New Trends in Multivariate Analysis and Calibration 2 Introduction ACKNOWLEDGMENTS First of all, I would like to thank Professor Massart for allowing me to spend these almost four years in his team. The knowledge I acquired, the experience I gained, and most probably the reputation of this formation gave a new and by far better start to my professional life. For the rest, the list of people I have to thank would be too long to be printed here. Not even mentioning I might accidentally omit someone. So I will probably play it the safe way and simply thank everyone I enjoyed studying, working, having fun, gossiping (etc …) with during all these years. Thank you all ! 3 New Trends in Multivariate Analysis and Calibration T ABLE OF CONTENTS ACKNOWLEDGMENTS 2 T ABLE OF CONTENTS 4 LIST OF ABBREVIATIONS 6 ___________ INTRODUCTION ___________ 8 ___________ I . MULTIVARIATE ANALYSIS AND CALIBRATION ___________ 12 “Chemometrics and modelling.” ___________ 12 II . COMPARISON OF M ULTIVARIATE CALIBRATION M ETHODS ___________ 38 “A Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data Sets. Part II : Predictive Ability under Extrapolation Conditions” 40 “A Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data Sets. Part III : Robustness Against Instrumental Perturbation Conditions” 70 “The Development of Calibration Models for Spectroscopic Data using Multiple Linear 99 Regression” 4 Introduction ___________ III . NEW T YPES OF DATA : NATURE OF THE DATA SET ___________ 168 “Multivariate calibration with Raman spectroscopic data : a case study” 170 “Inverse Multivariate calibration Applied to Eluxyl Raman data“ 200 ___________ IV . NEW T YPES OF DATA : STRUCTURE AND SIZE ___________ 212 “Multivariate calibration with Raman data using fast PCR and PLS methods” 214 “Multi-Way Modelling of High-Dimensionality Electro-Encephalographic Data” 225 “Robust Version of Tucker 3 Model” 250 ___________ CONCLUSION ___________ 270 274 P UBLICATION LIST 5 New Trends in Multivariate Analysis and Calibration LIST OF ABBREVIATIONS ADPF Adaptative-degree polynomial filter AES Atomic emission spectroscopy ALS Alternating least squares ANOVA Analysis of variance ASTM American society for testing material CANDECOMP Canonical Decomposition CCD Coupled charge device CV Cross-validation DTR De-trending EEG Electro-encephalogram FFT Fast Fourier transform FT Fourier Transform GA Genetic algorithm GC Gas chromatography ICP Induced coupled plasma IR Infrared k-NN k-nearest neighbours LMS Least median of squares LOO Leave-one-out LV Latent variable LWR Locally weighted regression MCD Minimumm covariance determinant MD Mahalanobis distance MLR Multiple Linear Regression MSC Multiple scatter correction MSEP Mean squared error of prediction MVE Minimum volume ellipsoid 6 Introduction MVT Multivariate trimming NIPALS Nonlinear iterative partial least squares NIR Near-infrared NL-PCR Non-linear principal component regression NN Neural networks NPLS N-way partial least squares OLS Ordinary least squares PARAFAC Parallel factor analysis PC Principal component PCA Principal component analysis PCC Partial correlation coefficient PCR Principal component regression PCRS Principal component regression with selection of PCs PLS Partial least squares PP Projection pursuit PRESS Prediction error sum of squares QSAR Quantitative structure-activity relationship RBF Radial basis function RCE Relevant components extraction RMSECV Root mean squared error of cross validation RMSEP Root mean squared error of prediction RVE Relevant variable extraction SNV Standard normal variate SPC Statistical process control SVD Singular value decomposition TLS Total least squares UVE Uninformative variables elimination 7 New Trends in Multivariate Analysis and Calibration N EW TRENDS IN MULTIVARIATE ANALYSIS AND C ALIBRATION INTRODUCTION Many definitions have been given for Chemometrics. One of the most frequently quoted of these definitions [1] states the following : Chemometrics is a chemical discipline that uses mathematics, statistics and formal logic (a) to design or select optimal experimental procedures; (b) to provide the maximum relevant chemical information by analysing chemical data; and (c) to obtain knowledge about chemical systems. This thesis focuses specifically on points (b) and (c) of this definition, and a particular emphasis is placed on multivariate methods and how they are used to model data. It should be noted that, while modelling is probably the most important area of chemometrics, there are many other applications such as method validation, optimisation, statistical process control, signal processing, etc. Modelling methods can be divided into two groups of methods, even if these two groups are often widely overlapping. In multivariate data analys is, models are used directly for data interpretation. In multivariate calibration, models relate the data to a given property in order to predict this property. Modelling methods in general are introduced in Chapter 1. The most common multivariate data analysis and calibration methods are presented as well as some more advanced ones, in particular methods applying to data with complex structure. A particularity of chemometrics is that many methods used in the field were developed in other areas of science before they were imported to chemistry. This is for instance the case for Partial Least Squares, which was initially developed to build economical models. Chemometrics also covers a very wide domain of application, and specialists in each field develop or modify methods best suited for their particular applications. These factors lead to the fact that many methods are often available for a given 8 Introduction problem. The first step of the chemometrical methodology is therefore to select the most appropriate method to use. The importance of this step is illustrated in Chapter 2. Multivariate calibration methods are compared on data with different structures. This comparison is performed in situations challenging for the methods (data extrapolation, instrumental perturbation). A detailed description of the steps necessary to develop a multivariate calibration model is also provided using Multiple Linear Regression as a reference method. Multivariate calibration and Near Infrared (NIR) spectroscopy have a parallel history. NIR could only be routinely implemented through the use of sophisticated chemometrical tools and the arising of modern computing. Chemometrical methods were then widely promoted by the remarkable achievements of multivariate calibration applied to NIR data. For many years, multivariate calibration and NIR spectroscopy were therefore almost synonym for the non-specialist. In the last few years, chemometrical methods proved very efficient on other types of analytical data. This was sometimes the case even for analytical methods that were not considered as necessitating sophisticated data treatment. It is shown in chapter 3 how Raman spectroscopic data can benefit from chemometrics in general and multivariate calibration in particular, allowing the use of Raman in a growing number of industrial applications. This chapter also illustrates the importance of method selection in chemometrics, and shows that the choice of the most appropriate method to use can depend on many factors, for instance quality of the data set. During the last years, data treated by chemometricians tend to become more and more complex. This complexity can be understood in terms of volume of data, or in terms of data structure. The increasing size of chemometrical data sets has several causes. For instance, combinatorial chemistry and high throughput screening are designed to generate important volumes of data. Collections of samples recorded during time also tend to get larger and larger. The improvement of analytical instruments leads to better spectral resolutions and therefore larger data sets (sometimes several tens of thousands of items). This last point is illustrated in chapter 4. It is shown how calibration methods specifically designed to be fast can considerably reduce computation time required for calibration and prediction of new samples. Complexity of a data set can also be understood in terms of data structure. Methods developed in the area of psychometrics allowing to treat data that are not only multivariate, but also multidimodal were recently introduced in the chemometrical field. Chapter 4 shows how this kind of 9 New Trends in Multivariate Analysis and Calibration methods can be used to extract information from a very complex data set with up to 6 modes. This chapter gives another illustration of the fact that chemometrical methods can be applied to new types of data, even out of the strict domain of chemistry, since the multidimodal methods are applied to pharmaceutical Electro- Encephalographic Data. Another example is given showing how these methods can be adapted in order to be made more robust toward difficult data sets. R EFERENCES [1] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. de Jong, P.J. Lewi and J. Smeyers- Verbeke, Handbook of Chemometrics, Elsevier, Amsterdam, 1997. 10 Introduction 11 New Trends in Multivariate Analysis and Calibration CHAPTER I MULTIVARIATE A NALYSIS AND C ALIBRATION Adapted from : CHEMOMETRICS AND MODELLING Computational Chemistry Column, Chimia, 55, 70-80 (2001). F. Estienne, Y. Vander Heyden and D.L. Massart Farmaceutisch Instituut, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium. E-mail: fabi@fabi.vub.ac.be 12 Chapter 1 – Multivariate Analysis and Calibration 1. Introduction There are two types of modelling. Modelling can in the first place be applied to extract useful information from a large volume of data, or to achieve a better understanding of complex phenomena. This kind of modelling is sometimes done through the use of simple visual representations. Depending on the type of data studied and the field of application, modelling is then referred to as exploratory multivariate analysis or data mining. Modelling can in the second place be applied when two or more characteristics of the same objects are measured or calculated and then related to each other. It is for instance possible to relate the concentration of a chemical compound to an instrumental signal, the chemical structure of a drug to its activity or instrumental responses to sensory characteristics. In these situations, the purpose of modelling usually is, after a calibration process, to make predictions (e.g. predict the concentration of a certain analyte in a sample from a measured signal), but it can sometimes simply be to verify the nature of the relationship. The two types of modelling strongly overlap. The methods introduced in this chapter will therefore not be presented as being exploratio n or calibration oriented, but rather will be introduced by rank of increasing complexity of the type of data or modelling problem they are applied to. 2. Univariate regression 2.1. Classical univariate least squares : straight line models Before introducing some of the more sophisticated methods, we should look shortly at the classical univariate least squares methodology (often called ordinary least squares – OLS), which is what analytical chemists generally use to construct a (linear) calibration line. In most analytical techniques the concentration of a sample cannot be measured directly but is derived from a measured signal that is in direct relation with the concentration. Suppose the vector x represents the concentrations of samples and y the corresponding measured instrumental signal. To be able to define a model y = f(x) a relationship between x and y has to exist. The simplest and most convenient situation is when the relation is linear which leads to a model of the type : 13 New Trends in Multivariate Analysis and Calibration y = b0 + b1 x (1) which is the equation of a straight line. The coefficients b0 and b1 represent the intercept and the slope of the line. Relationships between y and x that follow a curved line can for instance be represented by a regression model of the type : y = b0 + b1 x + b11 x2 (2) The least squares regression analysis is a methodology that allows to estimate the coefficients of a given model. For calibration purposes one usually focuses on straight- line models which we also will do in the rest of this section. Conventionally the x- values represent the so-called controlled or independent variable, i.e. the variable that is considered not to have a measurement error (or a negligible one), which is the concentration in our case. The y values represent the dependent variable, i.e. the measured response, which is considered to have a measurement error. The least squares approach allows to obtain b0 and b1 values such that the model fits the measured points (xi, yi ) best. Fig. 1. Straight line fitting through a series of measured points. The true relationship between x and y is considered to be y = β 0 + β1 x while the relationship between each xi and its measured yi can be represented as yi = b0 + b1 xi + ei. The signal yi is composed of a 14 Chapter 1 – Multivariate Analysis and Calibration component predicted by the model, b0 + b1x, and a random component, ei, the residual (Fig. 1). The least squares regression finds the estimates b0 and b1 for β 0 and β 1 by calculating the values b0 and b1 for which ∑ei2 = ∑ (yi – b0 – b1 xi)2 , the sum of the squared residuals, is minimal. This explains the name “least squares”. Standard books about regression, including least squares approaches are [1,2]. Analytical chemists can find information in [3,4]. 2.2. Some variants of the univariate least square straight line models A fundamental assumption of OLS is that there are only errors in the direction of y. In some instances, two measured quantities are related to each other and the assumption then does not hold, because there are also measurement errors in x. This is for instance the case when two analytical methods are compared to each other. Often one of these methods is a reference method and the other a new method, which is faster or cheaper and it is wanted to demonstrate that the results of both methods are sufficiently similar. A certain number of samples are analysed with both methods and a straight line model relating both series of measurements is obtained. If β 0 as estimated from b 0 is not more different from 0 than an a priori accepted bias and β 1 as estimated by b1 is not more different from 1 than a given amount, then one can accept that for practical purposes y = x. In its simplest statistical expression, this means that it is tested that β 0 = 0 and β1 = 1 or to put it in another way that b0 is statistically different from 0 and/or b1 is statistically different from 1. If this is the case then it is concluded that the two methods do not yield the same result but that there is a constant (intercept) or proportional (slope) systematic error or bias. This means that one should calculate b0 and b1 and at first sight this could be done by OLS. However both regression variables (not only yi but now also xi) are subject to error, as already mentioned. This violates one of the key assumptions of the OLS calculations. It has been shown [4-7] that the computation of b0 and b 1 according to the OLS-methods leads to wrong estimates of β 0 and β 1 . Significant errors in the least squares estimate of b1 can be expected if the ratio between the measurement error on the x values and the range of the x values is large. In that case OLS should not be used. To obtain correct values for b0 and b1 the sum of least squares must now be obtained in the direction given in figure 2. Such methods are sometimes called errors in variables models or orthogonal least squares. Detailed studies of the application of models of these types can be found in [8,9]. 15 New Trends in Multivariate Analysis and Calibration Fig. 2. The errors-in-variables model. Another possibility is to apply inverse regression. The term inverse is applied in opposition to t he usual calibration procedure. Calibration consists of measuring samples with a known characteristic and deriving a calibration line (or more generally a model). A measurement is then carried out for an unknown sample and its concentration is derived from the measurement result and the calibration line. In view of the assumptions of OLS, the measurement is the y-value and the concentration the x-value, i.e. measurement = f (concentration) (3) This relationship can be inverted to become Concentration = f (measurement) (4) OLS is then applied in the usual way, meaning that the sum of the squared residuals is minimised in the direction of y, which is now the concentration. This may appear strange, since, when the calibration line is computed, there are no errors in the concentrations. However, if it is taken into account that there will be an error in the predicted concentration of the unknown sample, then minimising in this way means that one minimises the prediction errors, which is what is important to the analytical chemist. It has been shown indeed that better results are obtained in this way [10-12]. The analytical 16 Chapter 1 – Multivariate Analysis and Calibration chemist should therefore really apply eq. (4), instead of the usual eq. (3). In most cases the difference in prediction qua lity between both approaches is very small in practice, so that there is generally no harm in applying eq. (3). We will see however that when multivariate calibration is applied, inverse regression is the rule. It should be noted that, when the aim is not to predict y-values, but to obtain the best possible estimates of β 0 and β 1, inverse regression performs worse than the usual procedure. Fig. 3. The leverage effect. 2.3. Robust regression One of the most often occurring difficulties for an experimentalist is that of the presence of outliers. The outliers may be due to experimental error or to the fact that the proposed model does not represent the data well enough. For example, if the postulated model is a straight line, and measurements are made in a concentration range where this is no longer true, the measurements obtained in that region will be model outliers. In figure 3 it is clear that the last point is not representative for the straight line fitted by the rest of the data. The outlier attracts the regression line computed by OLS. It is said to exert leverage on the regression line. One might think that outliers can be discovered by examining the residuals towards the line. As can be observed this is not necessarily true : the outlier’s residua l is not much larger than that of some other data points. 17 New Trends in Multivariate Analysis and Calibration To avoid the leverage effect, the outlier(s) should be eliminated. One way to achieve this is to use more efficient outlier diagnostics than simply looking at residuals. Cook’s squared distance or the Mahalanobis distance can for instance be used. A still more elegant way is to apply so-called robust regression methods. The easiest to explain is called the single median method [13]. The slope between each pair of points is computed. For instance the slope between points 1 and 2 is 1.10, between 1 and 3 1.00, between 5 and 6 6.20. The complete list is 1.10, 1.00, 1.03, 0.95, 2.00, 0.90, 1.00, 0.90, 2.23, 1.10, 0.90, 2.67, 0.70, 3.45, 6.20. These are now ranked and the median slope (here the 8-th value 1.03) is chosen. All pairs of points of which the outlier is one point have high values and end up at the end of the ranking, so that they do not have an influence on the chosen median slope : even if the outlier was still more distant, the selected media n would still be the same. A similar procedure for the intercept, which we will not explain in detail, leads to the straight line equation y = 0.00 + 1.03 x, which is close to the line obtained with OLS after eliminating the outlier. The single median method is not the best robust regression method. Better results are obtained with the least median of squares method (LMS) [14], the iteratively re-weighted [15] or bi-weight regression [16]. Comparing results of calibration lines obtained with OLS and with a robust method is one way of finding outliers towards a regression model [17]. 3. Multiple Linear Regression 3.1. Multivariate (multiple) regression Multivariate regression, also often called multiple regression or multiple linear regression (MLR) in the linear case, is used to obtain values for the b-coefficients in an equation of the type : y = b0 + b1 x1 + b2 x2 + … bm xm (5) where x1 , x2 , …, xm are different variables. In analytical spectroscopic applications, these variables could be the absorbances obtained at different wavelengths, y being a concentration or other characteristic of the samples to be predicted, in QSAR (the study of quantitative structure-activity relationships) they could be variables such as hydrophobicity (log P), the Hammett electronic 18 Chapter 1 – Multivariate Analysis and Calibration parameter σ, with y being some measure of biological activity. In experimental design, equations of the type y = b0 + b1 x1 + b2 x2 + b12x1 x2 + b11x1 2 + b22 x2 2 (6) are used to describe a response y as a function of the experimental variables x1 and x2 . Both equations (5) and (6) are called linear, which may surprise the non-initiated, since the shape of the relationship between y and (x1 ,x2 ) is certainly not linear. The term linear should be understood as linear in the regression parameters. An equation such as y = b0 + log (x – b1 ) is non- linear [2]. It can be observed from the applications cited higher that multiple regression models occur quite often. We will first consider the classical solution to estimate the coefficients. Later we will describe some more sophisticated methodologies introduced by chemometricians, such as those based on latent vectors. As for the univariate case, the b- values are estimates of the true b-parameters and the estimation is done by minimising a (sum of) squares. It can be shown that b = (XT X)-1 XT y (7) where b is the vector containing the b-values from eq. (5), X is an nxm matrix containing the x-values for n samples (or objects as they are often called) and m variables and y is the vector containing the measurements for the n samples. A difficulty is that the inversion of the XT X matrix leads to unstable results when the x-variables are very correlated. There are two ways to avoid this problem. One is to select variables (variable selection or feature selection) such that correlation is reduced, the other is to combine the variables in such a way that the resulting summarising variables are not correlated (feature reduction). Both feature selection and feature reduction lead to a smaller number of variables than the initial number of variables, which by itself has important advantages. 19 New Trends in Multivariate Analysis and Calibration 3.2. Wide data matrices Chemists often produce wide data matrices, characterised by a relatively small number of objects (a few ten to a few hundred) and a very large number of variables (many hundreds, at least). For instance, analytical chemists now often apply very fast spectroscopic methods, such as near infrared spectroscopy (NIR). Because of the rapid character of the analysis, there is no time for dissolving the sample or separation of certain constituents. The chemist tries to extract the information required from the spectrum as such and to do so he has to relate a y-value such as an octane number of gasoline samples or a protein content of wheat samples to the absorbance at 500 to, in some cases, 10 000 wavelengths. The e.g. 1000 variables for 100 objects constitute the X matrix. Such matrices contain many more columns than rows and are therefore often called wide. Feature selection/reduction the n takes on a completely different complexity compared to the situations described in the preceding sections. It should be remarked that variables in such matrices are often very correlated. This can for instance be expected for two neighbouring wavelengths in a spectrum. In the following sections, we will explain which methods chemometricians use to model very large, wide and highly correlated data matrices. 3.3. Feature selection methods 3.3.1. Stepwise Selection The classical approach, which is found in many statistical packages, is the so-called stepwise regression, a feature selection method. The so-called forward selection procedure consists of first selecting the variable that is best correlated with y. Suppose this is found to be xi. The model at this stage is restricted to y = f (xi). Then, one tests all other variables by adding them to the model, which then becomes a model in two variables y = f (xi,xj). The variable xj which is retained together with xi is the one which, when added to the mode l, leads to the largest improvement compared to the original model y = f (xi). It is then tested whether the observed improvement is significant. If not, the procedure stops and the model is restricted to y = f(xi). If the improvement is significant, xj is incorporated definitively in the model. It is then investigated which variable should be added as the third one and whether this yields a significant improvement. The procedure is repeated until finally no further 20 Chapter 1 – Multivariate Analysis and Calibration improvement is obtained. The procedure is based on analysis of variance and several variants such as backwards elimination (starting with all variables and eliminating successively the least important ones) or a combination of forward and backward methods have been proposed. It should be noted that the criteria applied in the analysis of variance are such that selected variables are less correlated. In certain contexts such as in experimental design or QSAR, the reason for applying feature selection is not only to avoid the numerical difficulties described higher, but also to explain relationships. The variables that are included in the regression equation have a chemical and physical meaning and when a certain variable is retained it is considered that the variable influences the y- value, e.g. the biological activity, which then leads to proposals for causal relationships. Correct feature selection then becomes very important in those situations to avoid making wrong conclusions. One of the problems is that the procedures involve regressing many variables on y and chance correlation may then occur [18]. There are other difficulties, for instance, the choice of experimental conditions, the samples or the objects. These should cover the experimental domain as well as possible and, where possible, follow an experimental design. This is demonstrated, for instance, in [19]. Outliers can also cause problems. Detection of multivariate outliers is not evident. As for the univariate regression, robust regression is possible [14, 20]. An interesting example in which multivariate robust regression is applied concerns an experimental design [21] carried out to optimise the yield of an organic synthesis. 3.3.2. Genetic algorithms for feature selection Genetic algorithms are general optimisation tools aiming at selecting the fittest solution to a problem. Suppose that, to keep it simple, 9 variables are measured. Possible solutions are represented in figure 4. Selected variables are indicated by a 1, non-selected variables by a 0. Such solutions are sometimes, in analogy with genetics, called chromosomes in the jargon of the specialists. By random selection a set of such solutions is obtained (in real applications often several hundreds). For each solution an MLR model is built using an equation such as (5) and the sum of squares of the residuals of the objects towards that model is determined. In the jargon of the field, one says that the fitness of each solution is determined : the smaller the sum of squares the better the model describes the data and the fitter the corresponding solutions are. 21 New Trends in Multivariate Analysis and Calibration Fig. 4. A set of solutions for feature selection from nine variables for MLR Then follows what is described as the selection of the fittest (leading to names such as genetic algorithms or evolutionary computation). For instance out of the, say 100 original solutions, the 50 fittest are retained. They are called the parent generation. From these is obtained a child generation by reproduction and mutation. Reproduction is explained in figure 5. Two randomly chosen parent solutions produce two child solutions by cross over. The cross over point is also chosen randomly. The first part of solution 1 and the second part of solution 2 together yield child solution 1’. Solution 2’ results from the first part of solution 2 and t he second of solution 1. 22 Chapter 1 – Multivariate Analysis and Calibration Fig. 5. Genetic reproduction step algorithms: the The child solutions are added to the selected parent solutions to form a new generation. This is repeated for many generations and the best solution from the final generation is retained. Each generation is additionally submitted to mutation steps. Here and there, randomly chosen bits of the solution string are changed (0 to 1 or 1 to 0). This is applied in figure 6. Fig. 6. Genetic algorithms: the mutation step. The need for the mutation step can be understood from figure 5. Suppose that the best solution is close to one of the child solutions in that figure, but should not include variable 9. However, because the 23 New Trends in Multivariate Analysis and Calibration value for variable 9 is 1 in both parents, it is also unavoidably 1 in the children. Mutation can change this and move the solutions in a better direction. Genetic algorithms were first proposed by Holland [22]. They were introduced in chemometrics by Lucasius et al [23] and Leardi et al [24]. They were applied for instance in QSAR and molecular modeling [25], conformational analysis [26], multivariate calibration for the determination of certain characteristics of polymers [27] or octane numbers [28]. Reviews about applications in chemistry can be found in [29,30]. There are several competing algorithms such as simulated annealing [31] or the immune algorithm [32]. 4. Feature reduction : Latent Variables The alternative to feature selection is to combine the variables in what we called earlier summarising variab les. Chemometricians call this latent variables and the obtaining of such variables is called feature reduction. It should be understood that in this case no variables are discarded. 4.1. Principal Component Analysis The type of latent variable most commonly used is the principal component (PC). To explain the principle of PCs, we will first consider the simplest possible situation. Two variables (x1 and x2 ) were measured for a certain number of objects and the number of variables should be reduced to one. In principal component analysis (PCA) this is achieved by defining a new axis or variable on which the objects are projected. The projections are called the scores, s1 , along principal component 1, PC 1 (Fig. 7). 24 Chapter 1 – Multivariate Analysis and Calibration Fig. 7. Feature reduction of two variables, x1 and x2 , by a principal component. The projections along PC1 preserve the information present in the x1 -x2 plot, namely that there are two groups of data. By definition, PC1 is drawn in the direction of the largest variation through the data. A second PC, PC2 , can also be obtained. By definition it is orthogonal to the first one (Fig. 8-a). The scores along PC 1 and along PC 2 can be plotted against each other yielding what is called a score plot (Fig. 8-b). b) Fig. 8. a) second PC and b) score plot of the data in Fig. 7. a) 25 New Trends in Multivariate Analysis and Calibration The reader observes that PCA decorrelates : while the data points in the x1 -x2 plot are correlated they are not longer so in the s1 -s2 plot. This also means that there was correlated and therefore redundant information present in x1 and x2 . PCA picks up all the important information in PC1 and the rest, along PC2, is noise and can be eliminated. By keeping only PC1 , feature reduction is applied : the number of variables, originally two, has been reduced to one. This is achieved by computing the score along PC1 as : s = w1 x1 + w2 x2 (8) In other words the score is a weighted sum of the original variables. The weights are known as loadings and plots of the loadings are called loading plots. This can now be generalised to m dimensions. In the m-dimensional space, PC1 is obtained as the axis of largest variation in the data, PC 2 is orthogonal to PC1 and is drawn into the direction of largest remaining variation around PC1 . It therefore contains less variation (and information) than PC1 . PC3 is orthogonal to the plane of PC1 and PC2 . It is drawn in the direction of largest variation around that plane, but contains less variation than PC3. In the same way PC4 is orthogonal to the hyperplane PC 1 ,PC2 ,PC 3 and contains still less variation, etc. For a matrix with dimensions n x m, N = min (n, m) PCs can be extracted. However, since each of them contains less and less information, at a certain time they contain only noise and the process can be stopped before reaching N. If only d << N PCs are obtained, then feature reduction is achieved. A very important application of principal components is to visually display the information present in the data set and most multivariate data applications therefore start with score and/or loading plots. The score plots give information about the objects and the loading plots about the variables. Both can be combined into a biplot, which are all the more effective after certain types of data transformation, e.g. spectral mapping [33]. In figure 9, a score plot is shown for an investigation into the Maillard reaction, a reaction between sugars and amino acids [34]. The samples consist of reaction mixtures of different combinations of sugars and aminoacids. The variables are the areas under the peaks of the reaction mixtures. The reactions are very complex: 159 different peaks were observed. Each of the samples is therefore characterized by its value for 159 variables. The PC 1 -PC 2 score plot of figure 9 can be seen as a projection of the samples from 159-dimensional space to the two-dimensional space that preserves best the variance in the data. In the score plot different symbols are given to the samples according to 26 Chapter 1 – Multivariate Analysis and Calibration the sugar that was present and it is observed for instance that samples with rhamnose occupy a specific location in the score plot. This is only possible if they also occupy a different place in the original 159dimensional space, i.e. their GC chromatogram is different. By studying different parts of the data and by including the information from the loading plots, it is then possible to understand the effect of the starting materials on the obtained reaction mixture. Fig. 9. PCA score plot of samples from the Maillard reaction. The samples with rhamnose have symbol ¡. Principal components have been used in many different fields of application. Whenever a table of samples x variables is obtained and some correlation between the variables is expected a principal components approach is useful. Let us consider an environmental examp le [35]. In figure 10 the score plot is shown. The data consist of air samples taken at different times in the same sampling location. For each of the samples a capillary GC chromatogram was obtained. The different symbols given to the samples indicate different wind directions prevailing at the time of sampling. Clearly the wind direction has an effect on the sample compositions. To understand this better, figure 11 gives a plot of the loadings of a few of the variables involved. It is observed that the lo adings on PC1 are all positive and not very different. Referring to eq. (5), and remembering that the loadings are the weights (the wvalues) this means that the score on PC 1 is simply a weighted sum of the variables and therefore a global indicator of pollution. The samples with highest score on PC1 are those with the highest degree of pollution. Along PC2 some variables have positive loadings and others negative loadings. Those of 27 New Trends in Multivariate Analysis and Calibration the aliphatic variables are positive and those of the aromatic variables are negative. It follows that samples with positive scores contain more aliphatic than aromatic variables Fig. 10. PCA score plot of air samples. Fig. 11. PCA loading plot of a few variables measured on the air samples. Combining PC1 and PC2, one can then conclude that samples with symbol x have an aliphatic character and that the total content increases with higher values on PC 1 . The same reasoning can be held for the samples with symbol • : they have an aromatic character. In fact, one could define new aliphaticity and aromaticity factors as in figure 12. This can be done in a more formal way using what is called factor analysis. 28 Chapter 1 – Multivariate Analysis and Calibration Fig. 12. New fundamental discovered on a score plot factors 4.2. Other latent variables There are other types of latent variables. In projection pursuit [34,36] a latent variable is chosen such that, instead of largest variation in the data set, it describes the largest inhomogeneity. In that way clusters or outliers can be observed more easily. Figure 13 shows the result applied to the Maillard data of figure 9 and it appears that the cluster of rhamnose samples can now be observed more clearly. Fig. 13. Projection pursuit plot of samples from the Maillard reaction. The samples with rhamnose have symbol ¡. If the y-values are not characteristics observed for a set of samples, but the class belongingness of the samples (e.g. samples 1-10 belong to class A, samples 11-25 to class B), then a latent variable can be defined that describes the largest discrimination between the classes. Such latent variables are called 29 New Trends in Multivariate Analysis and Calibration canonical variates or sometimes linear discriminant functions and are the basis for supervised pattern recognition methods such as linear discriminant analysis. In the partial least squares (PLS) sectio n, still another type of latent factor will be introduced. 4.3. N-way methods Some data have a more complex structure than the classical 2-way matrix or table. Typical examples are for instance met in environmental chemistry [37]. A set of n variables can be measured in m different locations at p different times. This leads to a 3-way data set with dimensions n x m x p. The three ways (or modes) are the variable mode, the location mode and the time mode. This can of course be generalised to a higher number of modes, but for the sake of simplicity we will here restrict figures and formulas to 3-way. The classical approach to study such data is to perform what is called unfolding. Unfolding consists in rearranging a 3-way matrix into a 2-way matrix. The 3-way array can be considered as several 2-way tables (slices of the original matrix), and these tables can be put next to each other, leading to a new 2-way array (Fig. 14). This rearranged matrix can be treated with PCA. Considering the example of figure 14, the scores will carry information about the locations, and the loadings mixed information about the two other modes. Fig. 14. Unfolding of a 3-way matrix, performed preserving the 'Location' dimension. Unfolding can be performed in different directio ns so that each of the three modes is successively preserved in the unfolded matrix. In this way, three different PCA models can be built, the scores of each of these models giving information about one of the modes. This approach is called the Tucker1 model. It is the first of a series of Tucker models [38]. The most important of these is the Tucker3 30 Chapter 1 – Multivariate Analysis and Calibration model. Tucker3 is a true n-way method as it takes into account the multi-way structure of the data. It consists in building, through an iterative process, a score matrix for each of the modes, and a core matrix defining the interactions between the modes. As in PCA, the components in each mode are constrained to be orthogonal. The number of components can be different in each mode. A graphical representation of the Tucker3 model for 3-way data is given in figure 15. It appears as a sum, weighted by the core matrix G, of outer products between the factors stored as columns in the A, B and C score matrices. Fig. 15. Graphical representation of the Tucker 3 model. n, m and p are the dimensions of the original matrix X. w1, w2 and w3 are the number of components extracted on mode 1, 2 and 3 respectively, corresponding to the number of columns of the loading matrices A, B and C respectively. Another common n-way model is the Parafac-Candecomp model that was proposed simultaneously by Chan and Harchman [39, 40]. Information about n-way methods (and software) can be found in ref. [41-43]. Applications in process control [44,45], environmental chemistry [37,46], food chemistry [47], curve resolution [48] and several other fields have been published. 5. Calibration on latent variables 5.1. Principal component regression (PCR) Until now we have applied latent variables only for display purposes. Principal components can however also be used as the basis of a regression method. It is applied among others when the x-values constitute a wide X matrix, for example for NIR calibration (see earlier). Instead of the original xvalues one applies the reduced ones, the scores. Suppose m variables (e.g. 1000) were measured for n samples (e.g. 100). As explained earlier this requires either feature selection or feature reduction. The 31 New Trends in Multivariate Analysis and Calibration latter can be achieved by replacing the m x-values by the scores on the k significant PC scores (e.g. 5). The X matrix now no longer consists of 100 x 1000 absorbance values but of 100 x 5 scores since each of the 100 samples is now characterized by 5 scores instead of 1000 variables. The regression model is : y=a1 s1 + a2 s2 +…+a5s5 (9) Since: s=w1x1 + w2 x2 + … w 1000x1000 (10) eq (9) becomes: y=b1x1 + b2 x2 + … b1000x1000 (11) By using the principal components as intermediates it is therefore possible to solve the wide X matrix regression problem. It should also be noted that the principal components are by definition not correlated, so that the correlation problem mentioned earlier is therefore also solved. 5.2. Partial least squares (PLS) The aim of partial least squares is the same as that of PCR, namely to model a set of y-values with the data contained in an (often) wide matrix of correlated variables. However the approach is different. In PCR, one works in two steps. In the first the scores are obtained and only the X matrix is involved, in the second y- values are related to the scores. In PLS this is done in only one step. The latent variables are obtained, not with the variation in X as criterion as is the case for principal components, but such that the new latent variable shows maximal covariance between X and y. This means that the latent variable is now built immediately in function of the relationship between y and X. In principle one therefore expects that PLS would perform better than PCR, but in practice they often perform equally well. A tutorial can be found in [49]. Several algorithms are available. A very effective one requiring the least computer time according to our experience is SIMPLS [50]. 32 Chapter 1 – Multivariate Analysis and Calibration 5.3. Applications of PCR and PLS PCR and PLS have been applied in many different fields. The following references constitute a somewhat haphazard selection from a very large literature. There are many analytical applications in the pharmaceutical industry [51], the petroleum industry [52], food science [53], environmental chemistry [54]. The methods are used with near or mid infrared [55], chromatographic [56], Raman [57], UV [58], potentiometric [59] data. A good overview of applications in QSAR is found in [60]. 5.4. PLS2 and other methods describing relationship between two tables Instead of relating one y-value to many x-values, it is possible to model a set of y-values with a set of x-values. This means that one relates two matrices Y and X, or in other words two tables. For instance, one could measure for a certain set of samples a numbe r of sensory characteristics on the one hand and obtain analytical measures on the other. This would yield two tables as depicted in figure 16. One could then wonder if it is possible to predict the sensory characteristics from the (easier to measure) chemical measurements or at least to understand which (combinations) of analytical measurements are related to which sensory characteristics. At the same time one wants to obtain information about the structure of each of the two tables (e.g. which analytical variables give similar information). PLS2 can be used for this purpose. Other methods that can be applied are for instance canonical correlation and reduced rank regression. An example relating 20 measurements of mechanical strength of meat patties to the sensory evaluation of textural attributes can be found in [61] and a comparison of methods in [62]. Fig. 16. Relating two 2-way tables. 33 New Trends in Multivariate Analysis and Calibration 5.5. Generalisation It is also possible to relate multi-way models to a vector of y-values or to 2-way tables. The same way as with 2-way data, the latent variables obtained in multi-way models are then used to build the regression models [63]. The multi-way analog to PCR would consist in modelling the original data with Tucker3 or Parafac, and then regress the dependent y-variable on the obtained scores. A more sophisticated N-way version of PLS (N-PLS) was also developed [64]. The principle of N-PLS is to fit a model similar to Parafac, but aiming at maximizing the covariance between the dependent and independent variables instead of fitting a model in a least squares sense. The usefulness of such approaches will be apparent from figure 17. In process analysis, one is concerned with the quality of finished batches and this can be described by a number of quality parameters. At the same time for each batch, a number of variables can be measured on the process in function of time [65]. This yields a two-way table on the one hand and a three-way one on the other. Relating these tables allows predicting the quality of a batch from the measurements made during the process. Fig. 17. Relating a two-way and a three-way table. 34 Chapter 1 – Multivariate Analysis and Calibration 6. Conclusion The most common chemometrical modelling methods were introduced in this chapter, together with some more advanced ones, in particular methods applying to data with complex structure. These concepts will be developed in further chapters R EFERENCES [1] N.R. Draper and H. Smith, Applied Regression Analysis, Wiley, New York, 1981. [2] J. Mandel, The Statistical Analysis of Experimental Data, Dover reprint, 1984, Wiley &Sons, 1964, New York. [3] D.L. MacTaggart, S.O. Farwell, J. Assoc.Off. Anal. Chem., 75, 594, 1992. [4] J.C. Miller and J.N.Miller, Statistics for Analytical Chemistry, Ellis Horwood, Chichester, 3rd ed., 1993. [5] W.E. Deming, Statistical Adjustment of Data, Wiley, New York, 1943. [6] P.T. Boggs, C.H. Spiegelman, J.R. Donaldson and R. B. Schnabel, J. Econometrics, 38, 169, 1988. [7] P.J.Cornbleet and N.Gochman, Clin. Chem., 25, 432, 1979. [8] C. Hartmann, J. Smeyers-Verbeke and D.L.Massart, Analusis, 21, 125, 1993. [9] J.Riu and F.X. Rius, J. Chemometr. 9, 343, 1995. [10] R.G. Krutchkoff, Technometrics, 9, 425, 1967. [11] V. Centner, D.L. Massart and S. de Jong, Fresenius J. Anal.Chem., 361, 2, 1998. [12] B. Grientschnig, Fresenius J. Anal.Chem. 367, 497, 2000. [13] H. Theil, Nederlandse Akademie van Wetenschappen Proc., Scr. A, 53, 386, 1950. [14] P.J. Rousseeuw and A.M. Leroy, Robust Regression and Outlier Detection, Wiley, New York, 1987. [15] G.R. Phillips and E.R. Eyring, Anal. Chem., 55, 1134, 1983. [16] F. Mosteller and J.W. Tukey, Data Analysis and regression, Addison-Wesley, Reading, 1977. [17] P. Van Keerberghen, J. Smeyers-Verbeke, R. Leardi, C.L. Karr and D.L. Massart, Chemom. Intell. Lab. Systems, 28, 73, 1995. [18] J.G. Topliss and R.J. Costello, J. Med. Chem., 15, 1066,1972. 35 New Trends in Multivariate Analysis and Calibration [19] M. Sergent, D. Mathieu, R. Phan-Tan-Luu and G. Drava, Chemom. Intell. Lab. Syst., 27, 153, 1995. [20] A.C. Atkinson, J. Am. Stat. Assoc. 89, 1329, 1994. [21] S. Morgenthaler and M.M. Schumacher, Chemom. Intell. Lab. Systems, 47, 127, 1999. [22] J.H. Holland, Adaption in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI, 1975, revised reprint, MIT Press, Cambridge, 1992. [23] C.B. Lucasius, M.L.M. Beckers and G. Kateman, Anal. Chim. Acta, 286, 135, 1994. [24] R. Leardi, R. Boggia and M. Terrile, J. Chemom., 6, 267, 1992. [25] J. Devillers ed., Genetic Algorithms in Molecular Modeling, Academic Press, London, 1996. [26] M.L.M. Beckers, E.P.P.A. Derks, W.J. Melssen and L.M.C. Buydens, Comput. Chem., 20, 449, 1996. [27] D. Jouan-Rimbaud, D.L.Massart, R. Leardi and O.E. de Noord, Anal. Chem., 67, 4295, 1995. [28] R. Meusinger and R. Moros, Chemom. Intell. Lab. Systems, 46, 67, 1999. [29] P. Willet, Trends. Biochem, 13, 516, 1995. [30] D.H. Hibbert, Chemom. Intell. Lab. Syst., 19, 277, 1993. [31] J.H. Kalivas, J. Chemom., 5, 37, 1991. [32] X.G. Shao, Z.H. Chen and X.Q. Lin, Fresenius J. Anal. Chem., 366, 10, 2000. [33] P.J. Lewi, Arzneim. Forschung, 26, 1295, 1976. [34] Q. Guo, W. Wu, F. Questier, D.L.Massart, C. Boucon and S. de Jong, Anal. Chem., 72, 2846. [35] J. Smeyers-Verbeke, J.C. Den Hartog, W.H.Dekker, D. Coomans, L. Buydens and D.L. Massart, Atmos. Environ., 18, 2471, 1984. [36] J.H. Friedman, J. Am. Stat. Assoc., 82, 249, 1987. [37] P. Barbieri, C.A. Andersson, D.L. Massart, S. Predonzani, G. Adami and G.E. Reisenhofer, Anal. Chim. Acta, 398, 227, 1999. [38] L. R. Tucker, Psychometrika, 31, 279, 1966. [39] R. Harshman, UCLA working papers in phonetics, 16, 1, 1970. [40] J. D. Carrol, J. Chang, Psychometrika, 45, 283, 1970. [41] C. A. Andersson, R. Bro, Chemom. Intell. Lab. Sys., 52, 1, 2000. [42] M. Kroonenberg, Three- mode Principal Component Analysis. Theory and Applications, DSWO Press, Leiden, 1983, reprint 1989. [43] R. Henrion, Chemom. Intell. Lab. Sys., 25, 1, 1994. 36 Chapter 1 – Multivariate Analysis and Calibration [44] P. Nomikos and J.F. MacGregor, AIChE Journal, 40, 1361, 1994. [45] D.J. Louwerse and A.K. Smilde, Chem. Eng. Sci., 55, 1225, 2000. [46] R. Henrion, Chemom. Intell. Lab. Sys., 16, 87, 1992. [47] R. Bro, Chemom. Intell. Lab. Sys., 46, 133, 1998. [48] A. de Juan, S.C. Rutan, R. Tauler and D.L. Massart, Chemom. Intell. Lab. Sys., 40, 19, 1998. [49] P. Geladi and B.R. Kowalski, Anal. Chim. Acta, 185, 1, 1986. [50] S. de Jong, Chemom. Intell. Lab. Syst., 18, 251, 1993. [51] K.D. Zissis, R.G. Brereton, S. Dunkerley and R.E.A. Escott, Anal. Chim. Acta, 384, 71, 1999. [52] C.J. de Bakker and P.M. Fredericks, Applied Spectroscopy, 49, 1766, 1995. [53] S. Vaira, V.E. Mantovani, J.C. Robles, J.C. Sanchis and H.C. Goicoechea, Anal. Letters, 32, 3131, 1999. [54] V. Simeonov, S. Tsakovski and D.L. Massart, Toxicological & Environmental Chemistry, 72, 81, 1999. [55] J.B. Cooper, K.L. Wise, W.T. Welch, M.B. Summer, B.K. Wilt and R.R. Bledsoe, Applied Spectroscopy, 51, 1613, 1997. [56] M.P. Montana, N.B. Pappano, N.B. Debattista, J. Raba and J.M. Luco, Chromatographia, 51, 727, 2000. [57] O. Svensson, M. Josefson and F.W. Langkilde, Chemom. Intell. Lab. Sys., 49, 49, 2000. [58] F. Vogt, M. Tacke, M. Jakusch and B. Mizaikoff, Anal. Chim. Acta, 422, 187, 2000. [59] M. Baret, D.L. Massart, P. Fabry, C. Menardo and F. Conesa, Talanta, 50, 541, 1999. [60] S. Wold in Chemometric Methods in Molecular Design, H. van de Waterbeemd ed., VCH, Weinheim, 1995. [61] S. Beilken, L.M. Eadie, I. Griffiths, P.N. Jones and P.V. Harris, J. Food Sci., 56, 1465, 1991. [62] B.G.M. Vandeginste, D.L. Massart, L.M.C. Buydens, S. de Jong, P.J. Lewi and J. SmeyersVerbeke, Handbook of Chemometrics and Qualimetrics: Part B, Chapter 35, Elsevier, Amsterdam, 1998. [63] R. Bro and H.Heimdal, Chemom. Intell. Lab. Systems, 34, 85, 1996. [64] R. Bro, J. Chemom., 10, 47, 1996. [65] C. Duchesne and J.F. McGregor, Chemom. Intell. Lab. Systems, 51, 125, 2000. 37 New Trends in Multivariate Analysis and Calibration CHAPTER II COMPARISON OF M ULTIVARIATE CALIBRATION METHODS This chapter focuses specifically on multivariate calibration. As stated in the introduction of this thesis, a particularity of chemometrics is that many methods are often available for a given problem. This chapter therefore includes comparative studies and proposed methodologies aiming at helping in the selection of the most appropriate multivariate calibration method. In the first two papers in this chapter : “A Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data Sets. Part II : Predictive Ability under Extrapolation Conditions.” and “A Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data Sets. Part III : Robustness Against Instrumental Perturbation Conditions”, methods are compared in challenging situations where the prediction of new samples requires mild extrapolation (part II), or where new data is affected by instrumental perturbation (part III). This work follows a first comparative study (part I) in which the various methods were compared on industrial data sets in situation where the previously mentioned difficulties did not occur [1]. The conclusions drawn in this first paper are presented in this chapter. A third paper published on the internet : “The Development of Calibration Models for Spectroscopic Data using Multiple Linear Regression” proposes a complete methodology for the development of multivariate calibration models, from data acquisition to the prediction of new samples. This methodology is developed here in the case of Multiple Linear Regression. However, most of the scheme is easily transposable to most calibration methods considering their particularities developed in the first two publications of this chapter. Some specific aspects of Multiple Linear Regression are developed in details, in particular the challenging problem of avoiding Random Correlation during variable selection. This paper is adapted from a publication devoted to Principal Component 38 Chapter 2 – Comparison of Multivariate Calibration Methods Regression, and to which the author contributed by performing some of the calculations and participating to the redaction of the manuscript. This chapter gives an overview of the methods used for multivariate calibration and the way these methods should be used on data classically treated by chemometricians. In this sense, it can be considered as a state of the art for multivariate calibration. R EFERENCES [1] V. Centner, G. Verdú-Andrés, B. Walczak, D. Jouan-Rimbaud, F. Despagne, L. Pasti, R. Poppi, D.L. Massart and O.E. de Noord, Appl. Spectrometry 54 (4) (2000) 608-623. 39 New Trends in Multivariate Analysis and Calibration A COMPARISON OF M ULTIVARIATE CALIBRATION TECHNIQUES APPLIED TO EXPERIMENTAL NIR DATA SETS. PART II : PREDICTIVE ABILITY UNDER EXTRAPOLATION CONDITIONS. Chemometrics and Intelligent Laboratory Systems, 58 2 (2001) 195-211. F. Estienne , L. Pasti, V. Centner, B. Walczak +, F. Despagne, D. Jouan Rimbaud, O. E. de Noord 1 ,D.L. Massart * ChemoAC, Farmaceutisch Instituut, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium. E-mail: fabi@fabi.vub.ac.be + on leave from : Silesian University Katowice Poland 1 Shell International Chemicals B. V., Shell Research and Technology Centre Amsterdam P. O. Box 38000 1030 BN Amsterdam The Netherlands ABSTRACT The present study compares the performance of different multivariate calibration techniques when new samples to be predicted can fall outside the calibration domain. Results of the calibration methods are investigated for extrapolation of different types and various levels. The calibration methods are applied to five near-IR data sets including difficulties often met in practical cases (non- linearity, nonhomogeneity and presence of irrelevant variables in the set of predictors). The comparison leads to general recommendations about what method to use when samples requiring extrapolation can be expected in a calibration application. * Corresponding author K EYWORDS : Multivariate calibration, method comparison, extrapolation, non- linearity, clustering. 40 Chapter 2 – Comparison of Multivariate Calibration Methods 1 - Introduction Calibration techniques enable to relate instrumental responses consisting of a set of predictors X (i.e. the NIR spectra) to a chemical or physical property of interest y (the response factor). The choice of the most appropriate calibration method is crucial in order to obtain calibration models with good performances in prediction of the property y of new samples. When performing calibration, two situations can occur. The first case is met when it is possible to produce artificially the samples to analyse. Statistical design such as factorial or mixture designs can then be used to generate the calibration set [1-2]. The second situation is met when it is not possible to synthesise the calibration samples, for instance for natural products (e.g. petroleum, wheat) or complex mixtures generated from industrial plants (e.g. gasoline, polymers). This second situation was considered in the present work. In this case, the selection of calibration samples is performed over a population of available samples. It is difficult to foresee the full extent of all sources of variations to be encountered for new samples on which a prediction will be carried out. Therefore, some samples may fall outside the calibration space, leading to a certain degree of extrapolation in the prediction of these new samples. Although it is often stated that extrapolation is not allowed, in many practical situations the time delay caused by new laboratory analysis and model updating is not acceptable. The aim of the present work is to evaluate in a general case the performance of calibration methods when such mild extrapolation occurs. To investigate the effect of the extrapolation on the performance of the different calibration models, two types of extrapolation were considered : • X-space extrapolation : objects of the test subset are situated outside the space spanned by the objects of the calibration set, but may have y-values within the calibration range. • y-value extrapolation : objects in the test subset have a higher or a lower y value than the objects in the calibration set. The methods to be compared were selected on the basis of the results obtained in the first part of this study [3]. In this first part, the comparison between the performance of calibration methods in terms of predictive ability was performed under conditions excluding extrapolation. Only the methods that yielded good results in this first stage of the comparison have been used in this part. The data sets are the same as those investigated in the first part of the study, except for one that was added because of its interesting structure (clustered and non-linear). The data sets include difficulties often met in practice, 41 New Trends in Multivariate Analysis and Calibration namely data clustering, non- linearity, and presence of irrelevant variables in the set of predictors. In this study, objects of the test subsets were selected so that their prediction requires extrapolation. The performance of the calibration methods was evaluated on the basis of predictive ability of the models. 2 - Theory 2.1 – Calibration techniques In the following, a short description of the applied calibration methods is given, essentially to explain the notation used. More details about the reported methods can be found in Ref. [3] and in the references mentioned for each method. 2.1.1 - Full spectrum latent variables methods Principal Component Regression (PCR) The original data matrix X(n,m) is converted by a linear transformation in a set of orthogonal latent variables, denoted T(n,a) and called Principal Components (PCs); n is the number of objects and a is the model complexity. The PCR model relates the response factor y to the scores T : a y = ∑ b i Ti + e i =1 (1) where bi is the ith coefficient, and e is the error vector. To estimate the model complexity, Leave-One-Out (LOO) Cross Validation (CV) was applied. The number of PCs leading to the minimum Root Mean Square Error of Prediction by CV (RMSECV) was chosen as optimal model complexity in first approximation. This value was validated by means of the randomisation test [4] to reduce the risk of overfitting. Variants of PCR were also used. Principal Component Regression with variable selection (PCRS) is a PCR in which the PCs are selected according to correlation with y. Non-linear Principal Component 42 Chapter 2 – Comparison of Multivariate Calibration Methods Regression (NL-PCR) [5] consists in applying the PCR model to the matrix obtained as the union of the original variables (X) and their squared values (X2 ). Partial Least Squares Regression (PLS) In PLS, the model can be described as : u = f(t) + d (2) where f is a linear function, d is the vector of residuals and u and t are linear combinations of y and X respectively. The coefficients of the linear transformation (f) can be obtained iteratively by maximising the square of the product: (u't) [6]. Spline Partial Least Squares (Spline-PLS) was also applied [7]. In the Spline-PLS version of PLS, the principles of the method are the same but the relationship denoted by f is a spline function (i.e. a piecewise polynomial function) instead of a linear relationship [6]. The model complexity was optimised by mea ns of the LOO-CV procedure followed by the randomisation test. 2.1.2 - Variable selection/elimination methods The variable selection methods used in this study are Stepwise selection (Step) and Genetic Algorithm (GA) applied in both the original and the Fourier domain (GA-FT). Multiple Linear Regression (MLR) is applied to the selected variables. The variable elimination methods are essentially based on the Uninformative Variable Elimination (UVE) algorithm (UVE-PLS), and the Relevant Component Extraction (RCE) PLS method. Multiple Linear Regression (MLR) The MLR model is given by : y = bP +e (3) 43 New Trends in Multivariate Analysis and Calibration where b is the vector of the regression coefficients, P is the matrix of the selected variables in the original or in the transformed domain and e is the error vector. The randomisation test was applied in Stepwise selection to optimise the model. Genetic Algorithm [8-9] The first parameter to choose in GA is the maximum number of variables to be entered in the model. The algorithm starts by randomly building a subset of solutions having a number of variables smaller or equal to the given maximum. The possible solutions are selected depending on the fitness of the obtained model, evaluated on the basis of LOO-RMSECV. The input parameters of the hybrid GA [10] applied were the following : Number of chromosomes in the population 20 Probability of cross-over 50% Probability of mutation 1% Stopping criterion 200 evaluations Frequency of the background backward selection 2 per cycle of evaluations Response to be maximised (RMSEP)-1 A threshold value equal to the PLS RMSECV increased by 10% was introduced for RMSEP, which means that only solutions with an RMSEP lower than this value were considered as acceptable solutions. The maximum number of variables allowed in the strings was set equal to the complexity of the optimal PLS model increased by two. The same parameters were used in the original and in the Power Spectra (PS) domain to find the optimal solution. In the original domain, all the variables were entered in the selection procedure for initial random selection. In the PS domain, only the first 50 coefficients were selected as input to GA [11]. Uninformative Variable Elimination-PLS [12] In PLS methods the calibration model is described by Eq. (2). In the linear case the relationship between the X scores (i.e. t) and the y scores (i.e. u) can be described by : 44 Chapter 2 – Comparison of Multivariate Calibration Methods u = bt + d (5) where b is the coefficients vector. UVE-PLS aims at improving the predictive ability of the final model by removing from the X matrix information not related to y. The criterion used to identify the noninformative variables is the stability of the PLS regression coefficient b. The input parameters used were: cut-off level: 99% number of random variables: 200 scaling constant: 10-10 Relevant Component Extraction [13] RCE is a modification of the UVE algorithm to operate in the wavelet domain. The spectra are decomposed to the last decomposition level using the Discrete Wavelet Trans form with the optimal filter selected from the Daubechies family. An algorithm is applied to separate the coefficients related to the signal from those related to the noise. The PLS model is built using only the selected wavelet coefficients. 2.1.3 - Local methods The methods described are Locally Weighted Regression-PLS (LWR-PLS), and Radial Basis FunctionPLS (RBF-PLS). Locally Weighted Regression-PLS [14] For each new object, a PLS model is built by considering only the objects of the calibration set that are similar to the selected one. The similarity is measured on the basis of the Euclidean distance calculated in the original measurement space [15]. The contribution of each similar object to the model is weighted using the distance from the selected object. The optimisation of the model complexity and of the number of similar objects is performed by means of LOO-CV. 45 New Trends in Multivariate Analysis and Calibration Radial Basis Function-PLS [16] RBF-PLS is a global method, which means that one model is valid for all the data set objects. The local property comes from the transformation of the original X matrix. In fact, PLS is applied to the y response factor and the A activation matrix. The activation matrix represents a non- linear distance matrix of X. The non- linearity is due to the exponential relationship (i.e. Gaussian function) between the elements of the activation matrix and the Euclidean distance between pairs of points. The parameters to be optimised are the width of the Gaussian function and the complexity of the PLS model The optimisation procedure requires the calibration set to be split into a training and a monitoring set (see data splitting section). 2.1.4 - Neural Network (NN) The X data matrix is compressed by means of PC transformation. The most relevant PCs, selected on the basis of explained variance, are used as input to the NN. The number of hidden layers was set to 1. The transfer function used in the hidden layer was non- linear (i.e. hyperbolic tangent). Both linear or non-linear transfer functions were used in the output layer. The weights were optimised by means of the Levenberg-Marquardt algorithm [17]. A method was applied to find the best number of nodes to be used in the input and hidden layers based on the contribution of each node [18]. The optimisation procedure of NN also requires the calibration set to be split into a training and a monitoring set (see data splitting section). 2.2 – Prediction performance within domain The first part of this study [3], in which prediction was performed within the calibration domain, led to several general conclusions. It was shown that Stepwise MLR can lead to very good results with very simple models for linear cases, sometimes outperforming the full spectrum methods. PCR, when performed with variable selection, always gave results comparable to PLS, with sometimes slightly higher complexities. In case of non- linearity, non-linear modifications of PCR and PLS were always outperformed by Neural Networks or LWR. This last method appeared as a generally good performer 46 Chapter 2 – Comparison of Multivariate Calibration Methods as its results were always at least as good as those of PCR/PLS. Another approach found to be interesting was UVE. This method enabled to improve the prediction precision and could be used as a diagnostic tool to see to what extent variables included in X were useful for the prediction. 2.3 – Calibration and prediction sets As previously mentioned, the aim of the present work is to evaluate the performance of calibration methods under mild extrapolation conditions i.e. in the presence of extreme samples in the prediction subset. The data sets were therefore split into two subsets, the calibration set, for the modelling part (including optimisation), and the prediction (or test) set, for evaluation of the predictive ability of the model. 2.3.1 - Data Splitting The calibration set should contain an appropriate number of objects in order to describe accurately the relation between X and y. 2/3 of the total number of objects were included in the calibration set and the other 1/3 were selected to constitute the prediction set. For each data set a certain number of different prediction sets were considered (i.e. 3 to 4 in X space extrapolation and 2 in y space extrapolation). The predictive ability of the calibration model was computed for each prediction set and for the combined prediction set. X-space extrapolation For homogeneous data sets the whole data set was considered, whereas for clustered data the extreme samples were selected from each cluster of the data. The inhomogeneous data sets were therefore divided in clusters on the basis of a PC score plot. Starting from the obtained clusters, various algorithms can be applied to select the extreme samples, and the distribution of the selected samples will depend on the characteristics of the splitting algorithm used. The prediction subset samples had to be selected so that they contain some extreme samples and span the range of variation. The Kennard and Stone [19] algorithm was used for this purpose on the PCA scores. This algorithm is appropriate as it starts with selecting extreme objects. Four different prediction subsets were built for all the data sets, 47 New Trends in Multivariate Analysis and Calibration except in one case where this number was reduced to three because of a lower total number of objects. The number of prediction samples selected from each cluster was chosen to be proportional to the ratio of the number of objects the cluster contains, and the total number of objects present in the data set. The Kennard and Stone algorithm was applied in the Euclidian distance space, starting from the object furthest from the mean value. After a first prediction subset was created, the corresponding objects were removed from the data set, and the selection procedure was iterated on the remaining samples to obtain the second prediction subset, etc. As a consequence of the applied splitting procedure, the degree of extrapolation decreases as the number of test subsets increases. This procedure was applied to each cluster of the data set, and the corresponding prediction subsets were merged to yield the global prediction subsets for the whole data set. y-value extrapolation In this case the data sets were not divided in clusters, and 2 test subsets were selected for each data set. The objects were sorted in ascending order of y value. The first 1/6 of the total number of objects with the lowest y values constituted the test subset 1, and the last 1/6 of the total number of objects with the largest y values constituted the test subset 2. The remaining 2/3 objects were kept in the calibration set. The test subset obtained as union of the two test subsets was also used to verify the performance of the models. 2.3.2 - Optimisation of the calibration model Two different approaches were applied to optimise the parameters of the model, namely crossvalidation and prediction testing. The latter was used to optimise the NN topology and the width of the Gaussian function in RBF -PLS. It consists in dividing the calibration set in training and monitoring sets. When applying NN or RBF methods, several models are built with different parameter values. The optimal model parameters are considered to be those that lead to the best predictive ability when the models are applied to the monitoring set. The splitting of the calibration set into training and monitoring sets was achieved by applying the Duplex algorithm [20]. For all the other methods, internal validation (namely LOO-CV) was used to optimise the model. The squared prediction residual value for object i is given by : 48 Chapter 2 – Comparison of Multivariate Calibration Methods e i = ( ŷ i − y i ) 2 2 (12) The procedure is repeated for each object of the calibration set, and the prediction error sum of squares (PRESS) can then be calculated as : n n i =1 i =1 PRESS = ∑ (ŷ i − y i ) 2 = ∑ e 2i (13) The Root Mean Square Error of Cross Validation (RMSECV) is defined as the square root of the mean value of PRESS n PRESS = n RMSECV = ∑ (y i =1 i − ŷ i ) 2 (14) n The RMSECV obtained for different values of the model parameters, for instance the number of components in a PLS model, are compared in a statistical way by means of the randomisation test, with the model showing the lowest RMSECV. A model with higher RMSECV but lower complexity can be retained if its RMSECV is not significantly different from the lowest one. 2.3.3 - Predictive ability of the model The predictive ability of the optimal model is calculated as a Root Mean Square Error of Prediction RMSEP, on the test subset nt RMSEP = ∑ (ŷ i =1 i − yi ) 2 (15) nt where nt is the number of samples in the test subset. 49 New Trends in Multivariate Analysis and Calibration The randomisation [4] test was used to test the prediction results obtained by the same method at various complexities for significant differences. The aim was then to optimise the complexity of the model. Once the models have been optimised, randomisation test could have been used to test for significant differences the results obtained with different methods. This would have allowed determining whether a method performs significantly better than another. Another interesting approach based on two-way analysis of variance, called CVANOVA [21], could also have been used for this purpose. However, statistical significance testing is needed only to compare relatively similar results. It was known from the previous comparative study performed on the same data [3] that very important differences could be expected from one method to another. Moreover, when the differences in prediction results between two methods are so small that significance testing is needed to come to a conclusion, in practice other criteria come into play for the selection of the best method to be used. For instance, the simplest or most easily interpretable method will then usually be preferred. Small differences between prediction results obtained with different methods were therefore not investigated for significance. 3 - Experimental Five data sets were studied. Except for WHEAT, the data sets were provided by industry. In the following and in Table 1, a brief description of the data is given. Table 1. Description of the five experimental data sets. data set linearity/non-linearity clustering WHEAT linear minor (2 clusters on PC3) POLYOL linear strong (2 clusters on PC1) POLYMER strongly non-linear strong (4 clusters on PC1) GASOLINE slightly non-linear strong (3 clusters on PC2) DIESEL OIL strongly non-linear Inhomogeneous data 50 Chapter 2 – Comparison of Multivariate Calibration Methods 3.1 – WHEAT data The data set was proposed by Kalivas [22] as a reference calibration data set. It contains 100 NIR spectra of wheat samples measured in diffuse reflectance between 1100 and 2500 nm, sampled each 2 nm. The amount of protein and the moisture content are the measured response factors, but only the latter was considered in the present study because of the poor precision of the reference method in the protein determination. The data were pre-treated by offset correction in order to remove the parallel shifts between spectra. One outlying object [3] was removed. The PC1/PC3 plot of the remaining 99 samples is plotted in figure 1. Two clusters can clearly be seen on the third PC. The clusters differ from each other on the y values, as one of them contains all the samples with a low y value and the other those with a high y value. Fig. 1. Wheat data set : PC1 - PC3 score plot. The numbers 1 to 4 refer to the prediction set for X space extrapolation to which the objects belong. In the X extrapolation study, 4 prediction subsets, each of them co ntaining 10 samples (see also Fig. 1), and a calibration set of 59 objects, were obtained. When necessary the latter was divided in a monitoring and a training set of 19 and 40 samples respectively. In y extrapolation, two test subsets of 20 elements were considered. 51 New Trends in Multivariate Analysis and Calibration 3.2 – POLYOL data The data set consists of NIR spectra of polyether polyols, recorded from 1100 to 2158 nm with 2 nm sampling step. The measurements were recorded by means of a NIRSystems Inc., Silver Spring, MD. The response factor is the hydroxyl number of the polyols. The baseline shift was removed by offset correction, and the first and last 15 wavelengths were not considered. Three objects were identified as outliers in a previous study [23] and eliminated resulting in a data set with 84 samples. At least two clusters were identified in the data set on the first PC (Fig. 2) and this was taken into account in defining 4 prediction subsets of 8 samples each in the X extrapolation study, and 2 sets of 16 objects in the y-space extrapolation. The other 52 objects constituted the calibration set. When required the calibration set was split into a training set of 35 samples and a monitoring set of 17 samples. Fig. 2. Polyol data set: PC1 - PC2 score plot. The numbers 1 to 4 refer to the prediction set for X space extrapolation to which the objects belong. 3.3 – POLYMER data The data set was obtained by recording the NIR spectra of a polymer in the range from 1100 to 2498 nm at regular intervals of 2 nm. The response factor is the mineral compound content, and it is known from a previous study that the data set is non- linear. It has also been shown that a non-constant baseline 52 Chapter 2 – Comparison of Multivariate Calibration Methods shift is present [13]. Applying the Standard Normal Variate transformation (SNV) solved this problem [24]. The presence of 4 clusters in this data set can be observed on the first PC (Fig. 3). Fig. 3. Polymer data set: PC1 - PC2 score plot. The numbers 1 to 3 refer to the prediction set for X space extrapolation to which the objects belong. The initial set of 54 samples was divided into 3 prediction subsets of 6 samples for the X extrapolation study and in 2 prediction test subsets of 9 objects for the y-space extrapolation study. The calibration set was made of 36 samples. For external model validation methods, the calibration set was split in training set (24 samples) and monitoring set (12 samples). 3.4 – GASOLINE data The data set was obtained by recording the NIR spectra of gasoline compounds in the range 800-1080 nm (step 0.5), the aim is to model the octane number (y values). A preliminary analysis of the data indicated the presence of baseline shift and drift. Using the first derivative of the spectra [3] reduced the effects of those instrumental components. It was also shown that the data contains three clusters related to y and visible on the PC1–PC2 plot (Fig. 4), and that there is a slight non- linearity in the relationship between X and y. 53 New Trends in Multivariate Analysis and Calibration Fig. 4. Gasoline data set: PC1 - PC2 score plot. The numbers 1 to 4 refer to the test set for X space extrapolation to which the objects belong. Four subsets of 11 samples or 2 subsets of 22 samples were chosen to test the methods, and the remaining 88 samples (out of 132) were used as calibration set. When necessary the calibration set was divided in a training set of 62 objects and a monitoring set of 26 objects. 3.5 – DIESEL OIL data The data set consists of NIR spectra of different diesel oils obtained in the range from 833 to 2501 nm (4150 data points). The y value to predict was the viscosity of the oil. The recorded NIR range was reduced to 1587-2096 nm by removing the second and third overtones from the spectra, resulting in spectra of 795 points. The baseline component of the spectra was then removed by subtracting a linear background contribution defined using the first and the last points of the considered range. The spectra of 108 samples were recorded. Two of them were duplicate samples, the responses of these objects were therefore averaged. Two objects affected by the presence in the sample of heavier petroleum constituents, and therefore identified as outliers, were removed from the data set. The data set was in this way reduced to 104 spectra. A preliminary analysis of the data showed a strongly non- linear relationship. Moreover, zones of unequal dens ity are present in the data set, as shown in figure 5. 54 Chapter 2 – Comparison of Multivariate Calibration Methods Fig. 5. Diesel oil data set: PC1 - PC2 score plot. The numbers 1 to 4 refer to the test set for X space extrapolation to which the objects belong. Four prediction subsets of 9 objects or 2 subsets of 18 objects were obtained to quantify the predictive ability of the models in the different extrapolation approaches. The calibration set containing 68 samples was when necessary split in a training set of 48 objects and a monitoring set of 20 objects. 4 – Results and discussion 4.1 – WHEAT data It was shown in a previous study [3] that for this data set, the relationship between X and y is linear, and that most of the X variables are informative in building the calibration models. The prediction subsets used in the X extrapolation study are reported in figure 1. In most of the considered methods the RMSEP obtained for the prediction subsets is statistically equal to the RMSECV. This is true also for prediction subset 1, which contains the samples furthest from the cluster centroids. It seems that samples with extreme X values are not extreme for the models. Because of the independence of the RMSEP from the X values, it seems that the most important contribution to the RMSEP is related to the imprecisio n of the y values. While comparing the performance of the calibration models in X extrapolation (Table 2), we can see that most of the tested calibration methods give similar results in 55 New Trends in Multivariate Analysis and Calibration terms of RMSEP. One expects the linear methods to yield the best results on this data set, and this is indeed the case, especially for MLR. Table 2. Wheat data set, X-space extrapolation, RMSEP values. Method test 1 test 2 test 3 test 4 test 1+2+3+4 Complexity CV PLS 0.231 0.227 0.272 0.214 0.237 3 factors 0.228 PCR 0.246 0.252 0.249 0.218 0.241 3 components 0.241 PCRS 0.246 0.252 0.249 0.218 0.241 Selected PCs : 1-3 0.241 Step MLR 0.230 0.177 0.319 0.244 0.248 Selected Variables : 428 603 0.210 GA 0.256 0.284 0.253 0.222 0.254 Selected Variables : 424 435 488 0.195 FT GA 0.264 0.220 0.390 0.280 0.295 Selected FT coeff. : 3 5 7 11 17 22 0.256 UVE PCR 0.240 0.239 0.251 0.208 0.235 3 components 0.233 UVE PLS 0.229 0.223 0.277 0.213 0.237 3 factors 0.225 RCE PLS 0.256 0.289 0.567 0.368 0.389 3 factors, 74 wavelet coef. 0.268 NL PCR 0.296 0.266 0.342 0.278 0.297 4 components 0.268 spline PLS 1.152 0.776 1.016 0.769 0.943 3 factors, 1st degree, 1 knot 0.387 LWR 0.231 0.227 0.272 0.214 0.237 3 factors, using all objects 0.228 RBF PLS 0.208 0.194 0.331 0.354 0.281 4 factors, Gauss. funct. width : 0.01 0.271 NN 0.819 0.273 0.332 0.222 0.503 0.276 0.263 0.223 0.525 0.250 Selected PCs : 1-3, 2 hidden nodes Selected PCs : 1-3, 1 hidden node 0.167 0.187 The reasons for the better performance of MLR methods within the calibration domain are given in Ref [4]. The moisture content determination is actually close to a univariate calibration problem, treating it in a multivariate way has a bad influence on the quality of the prediction. The percentage of variables considered as relevant in the UVE models (i.e PCR and PLS) is larger than 70%. This explains why comparable results are obtained for methods based on variable elimination (UVE) and the equivalent 56 Chapter 2 – Comparison of Multivariate Calibration Methods full spectrum methods. LWR-PLS and PLS lead to equivalent results. All the calibration samples were used to construct the model for LWR, in this case, the model becomes global and equivalent to a PLS model. This confirms the linearity of the data. Any non-linearity would have implied the use of a smaller number of samples to build the local linear function approximations. The non- linear methods did not improve the prediction of new samples compared to the linear ones (MLR, PCR, PLS), and the non-linear extension of the latent variables methods, especially Spline-PLS, gave the worst results. The results of the NN model yielding the smallest RMSECV values (i.e. two hidden nodes), and the results of the optimised model (i.e. one hidden node) are reported. It can be seen that only the optimised model gives good results in prediction. Generally, flexible methods such as NN and Spline-PLS can yield large errors in extrapolation because they tend to overfit the calibration data. All the features of the calibration set are then taken into account, so that the differences with the extrapolation test set are enhanced. After optimisation of the NN model, the RMSECV obtained using the topology with the smallest number of hidden nodes was less than 10% larger than the RMSECV obtained with the more complex topology. The simple topology was therefore used. More reliable results are obtained by using this procedure. The results obtained for the y extreme objects are similar to those reported above (Table 3), most of the methods yield comparable RMSEP. The worst performance can be observed in the case of Spline-PLS, and the best with the MLR variable selection methods, especially stepwise. 57 New Trends in Multivariate Analysis and Calibration Table 3. Wheat data set, y-space extrapolation, RMSEP values. Method test 1 test 2 test 1+2 Complexity CV PLS 0.262 0.541 0.425 3 factors 0.148 PCR 0.264 0.553 0.434 3 components 0.149 PCRS 0.264 0.553 0.434 Selected PCs : 1-3 0.149 Step MLR 0.240 0.408 0.334 Selected Variables : 444 532 0.148 GA 0.266 0.494 0.397 Selected Variables : 46 155 302 445 525 0.148 FT GA 0.281 0.538 0.429 Selected FT coeff. : 2 6 10 17 25 0.151 UVE PCR 0.265 0.554 0.434 3 components 0.148 UVE PLS 0.264 0.548 0.431 3 factors 0.118 RCE PLS 0.277 0.535 0.426 4 factors, 89 wavelet coef. 0.149 NL PCR 0.278 0.552 0.437 5 components 0.163 spline PLS 0.619 0.602 0.610 3 factors, 1st degree, 1 knot 0.270 LWR 0.270 0.541 0.427 3 factors, using all objects 0.148 RBF PLS 0.313 0.549 0.447 6 factors, Gauss. funct. width : 0.11 0.184 NN 0.276 0.564 0.444 Selected PCs : 1-3, 1 hidden node 0.117 4.2 – POLYOL data When examining this data set within the calibration domain [3], a strong clustering tendency and a linear relation between X and y were observed. The y values are not responsible for the clustering. The predictive ability of the models investigated within the domain was shown to be similar. This is no longer the case when X-space extrapolation is considered. In extrapolation, the test subset samples are selected on the edges of the clusters (Fig. 2). Methods based on MLR with variable selection now yield the worst RMSEP results, although they yield the lowest cross validation error (Table 4). 58 Chapter 2 – Comparison of Multivariate Calibration Methods Table 4. Polyol data set, X-space extrapolation, RMSEP values. Method test 1 test 2 test 3 test 4 test 1+2+3+4 Complexity CV PLS 4.789 5.503 5.247 3.686 4.856 6 factors 1.294 PCR 4.916 4.888 5.034 3.103 4.556 6 components 1.818 PCRS 4.293 3.735 4.214 2.554 3.764 Selected PCs : 1-3 6 1.537 Step MLR 8.512 8.478 6.297 5.753 7.368 GA 7.257 6.721 6.390 3.790 6.186 FT GA 6.223 6.363 5.568 3.688 5.564 UVE PCR 5.993 6.523 6.048 4.281 5.774 6 components 1.354 UVE PLS 5.265 5.641 4.721 3.835 4.913 6 factors 1.156 RCE PLS 6.064 6.481 6.475 4.749 5.984 5 factors, 121 wavelet coef. 1.347 NL PCR 6.031 6.443 6.394 4.071 5.817 8 components 1.868 spline PLS 7.219 8.380 8.830 6.525 7.793 6 factors, 1st degree, 1 knot 2.260 LWR 4.781 5.113 6.336 4.896 5.318 6 factors, using 22 objects 1.234 RBF PLS 6.675 6.577 6.294 4.469 6.070 NN 10.092 8.529 7.111 4.843 7.884 Selected Variables : 450 356 146 293 31 380 Selected Variables : 156 190 417 461 495 Selected FT coeff. : 2 3 4 6 9 13 18 22 25 7 factors, Gauss. funct. width : 0.05 Selected PCs : 1-3 5 6 9, 3 hidden nodes 1.049 0.950 1.318 1.187 1.097 The latter is consistent with the very good prediction performance of MLR within the experimental domain observed in [3]. The best results are obtained by applying global methods. In particular PCRS seems to perform well. It is more parsimonious than PCR and PLS. A slightly lower prediction error is obtained with the variable reduction methods (UVE-PLS and PCR) than with the full variables ones (PLS and PCR) within the calibration domain. Opposite results are obtained for the predictive ability of the extrapolated samples. LWR does not lead to improvement in prediction compared to PLS. In LWR the number of calibration samples used to build the local model is approximately equal to the number 59 New Trends in Multivariate Analysis and Calibration of samples in each of the two main clusters. The Euclidian distance used to select the nearest neighbours is mainly related to the information present in the first PC that takes into account the clustering. The Euclidian distance is less related to higher order PCs that are more related to y. Therefore, little or no improvement in y prediction is obtained by splitting the data set in clusters. As expected for these linear data the non-linear methods do not improve the predictive ability, and SplinePLS and NN show very poor prediction of the data outside the calibration domain. In analysing the y extreme samples, one can see (Table 5) that most of the methods show the same performance as discussed for X-space extrapolation. Table 5. Polyol data set, y-space extrapolation, RMSEP values. Method test 1 test 2 test 1+2 Complexity PLS 3.318 5.843 4.751 6 factors 1.336 PCR 5.008 5.336 5.174 6 components 1.726 PCRS 4.921 4.197 4.573 Selected PCs : 1 2 5 6 1.447 Step MLR 3.759 10.73 8.039 GA 2.440 7.680 5.698 FT GA 3.659 5.871 4.891 UVE PCR 3.324 6.857 5.388 6 components 1.368 UVE PLS 4.578 5.694 5.166 5 factors 1.716 RCE PLS 2.344 7.988 5.887 7 factors, 74 wavelet coef. 0.921 NL PCR 3.300 5.863 4.757 10 components 1.391 spline PLS 7.530 15.932 12.460 5 factors, 1st degree, 1 knot 3.847 LWR 3.318 5.843 4.751 6 factors, using 26 objects 1.336 RBF PLS 4.054 5.765 4.983 1.721 NN 3.260 6.715 5.278 7 factors, Gauss. funct. width : 0.09 Selected PCs : 1-6 9 10, 3 hidden nodes Selected Variables : 450 356 146 293 31 380 Selected Variables : 100 165 200 332 422 436 Selected FT coeff. : 3 4 13 15 18 19 23 60 CV 1.225 0.851 0.927 0.868 Chapter 2 – Comparison of Multivariate Calibration Methods 4.3 – POLYMER data In all the considered extrapolated spaces, methods based on MLR with variable selection, especially stepwise MLR, yield the worst performances both within the calibration domain (RMSECV) and in extrapolation conditions (i.e. RMSEP). The RMSEP values reported in Table 6 show that most of the non-linear and local methods logically outperform the linear ones for this non- linear data set. Table 6. POLYMER data set, X-space extrapolation, RMSEP values. Method test 1 test 2 test 3 test 1+2+3 Complexity CV PLS 0.079 0.087 0.047 0.073 6 factors 0.044 PCR 0.093 0.086 0.063 0.081 9 components 0.059 PCRS 0.081 0.085 0.043 0.072 Selected PCs : 1-5 7 8 0.043 Step MLR 0.112 0.112 0.068 0.100 Selected Variables : 458 38 64 0.062 GA 0.058 0.078 0.040 0.061 FT GA 0.110 0.086 0.046 0.085 UVE PCR 0.080 0.084 0.042 0.071 8 components 0.045 UVE PLS 0.083 0.092 0.051 0.077 5 factors 0.041 RCE PLS 0.093 0.085 0.043 0.077 8 factors, 128 wavelet coef. 0.051 NL PCR 0.079 0.081 0.0488 0.071 7 components 0.040 spline PLS 0.076 0.082 0.035 0.068 4 factors, 1st degree, 2 knots 0.036 LWR 0.044 0.012 0.016 0.028 1 factor, using 5 objects 0.013 RBF PLS 0.093 0.069 0.029 0.069 8 factors, Gauss. funct. width : 0.19 0.014 NN 0.051 0.019 0.016 0.033 Selected PC : 1-3, 3 hidden nodes 0.017 61 Selected Variables : 133 239 412 515 671 Selected FT coeff. : 15 23 25 26 30 0.031 0.039 New Trends in Multivariate Analysis and Calibration The difference in performance is larger for NN and LWR than for non- linear modifications of PLS and PCR. In X-space extrapolation, Spline-PLS gives slightly better result than PLS and NL-PCR fits the test subsets better than PCR. However, the use of NL-PCR does not lead to a better predictive ability compared to PCRS. In the previous study, in which only test subsets within the calibration domain were considered, the largest differences were found between the local non-linear methods and all the others. The good performance of the LWR method in extrapolation is due to its local properties. The variable reduction methods (UVE-PLS, UVE-PCR) do not yield better results, and in some cases as for RCE-PLS the results are worse. Quite similar results are also obtained in y extrapolation conditions (Tables 7). Table 7. POLYMER data set, y-space extrapolation, RMSEP values. Method test 1 test 2 test 1+2 Complexity CV PLS 0.062 0.078 0.070 5 factors 0.050 PCR 0.069 0.072 0.070 7 components 0.048 PCRS 0.069 0.072 0.070 Selected PCs : 1-7 0.048 Step MLR 0.131 0.067 0.104 Selected Variables : 458 487 0.082 GA 0.144 0.084 0.118 FT GA 0.096 0.088 0.092 UVE PCR 0.066 0.081 0.074 8 components 0.048 UVE PLS 0.053 0.080 0.068 5 factors 0.053 RCE PLS 0.107 0.093 0.100 5 factors, 126 wavelet coef. 0.0514 NL PCR 0.054 0.073 0.064 7 components 0.047 spline PLS 0.033 0.068 0.053 2 factors, 1st degree, 1 knot 0.032 Selected Variables : 125 176 225 289 469 511 669 Selected FT coeff. : 3 8 9 14 17 22 24 26 31 62 0.042 0.050 Chapter 2 – Comparison of Multivariate Calibration Methods 4.4 – GASOLINE data The response factor is the octane number, which is generally determined with poor precision with the reference method. It should be remembered that the RMSEP’s are also influenced by the precision of the reference method. Therefore, it can be difficult to see differences in the performance of the multivariate calibration methods. A previous study [3] indicated that the data set is slightly non linear and clustered. One can see in Table 8 that the results in extrapolation of all the methods are very similar. Table 8. GASOLINE data set, X-space extrapolation, RMSEP values. test 1 test 2 test 3 test 4 test 1+2+3+4 PLS 0.291 0.248 0.196 0.177 0.233 9 factors 0.179 PCR 0.337 0.299 0.186 0.160 0.257 14 components 0.183 PCRS 0.291 0.182 0.158 0.162 0.206 Selected PCs : 1-7 10-14 0.178 Step MLR 0.315 0.256 0.210 0.198 0.249 GA 0.254 0.142 0.216 0.178 0.202 FT GA 0.309 0.173 0.165 0.169 0.217 UVE PCR 0.315 0.115 0.170 0.156 0.203 15 components 0.158 UVE PLS 0.308 0.137 0.187 0.163 0.209 9 factors 0.161 RCE PLS 0.262 0.182 0.162 0.163 0.197 9 factors, 51 wavelet coef. 0.162 NL PCR 0.279 0.175 0.171 0.157 0.201 15 components 0.172 spline PLS 0.466 0.209 0.194 0.255 0.301 9 factors, 1st degree, 1 knot 0.185 LWR 0.291 0.278 0.196 0.177 0.241 9 factors, using all objects 0.179 RBF PLS 0.240 0.113 0.154 0.155 0.172 NN 0.239 0.243 0.222 0.186 0.224 Method 63 Complexity Selected Variables : 309 456 550 120 226 358 Selected Variables : 141 266 372 428 485 495 517 535 Selected FT coeff. : 3 5 6 8 10 12 15 22 26 35 20 factors, Gauss. funct. width : 3.2 Selected PCs : 1-3 6-9 12, 6 hidden nodes CV 0.175 0.135 0.171 0.154 0.197 New Trends in Multivariate Analysis and Calibration As was described in Ref. [3], the variable reduction methods improve the prediction results within the calibration domain. However, the RMSEP values show that theses methods do not improve the results in the extrapolated domain. Slightly better results are obtained using RBF -PLS, and the worst prediction is achieved by Spline-PLS. The methods yield similar results also when y-extreme samples are considered. The most remarkable difference is found for the NN results, which are the worst for both of the test subsets (Table 9). Table 9. GASOLINE data set, y-space extrapolation, RMSEP values. Method test 1 test 2 test 1+2 Complexity CV PLS 0.244 0.346 0.299 9 factors 0.184 PCR 0.240 0.374 0.314 14 components 0.178 PCRS 0.256 0.406 0.339 Selected PCs : 1-3 5-8 11 13 14 0.176 Step MLR 0.293 0.234 0.265 GA 0.222 0.367 0.303 FT GA 0.236 0.433 0.349 UVE PCR 0.240 0.364 0.308 13 components 0.183 UVE PLS 0.226 0.292 0.261 9 factors 0.164 RCE PLS 0.252 0.484 0.386 7 factors, 39 wavelet coef. 0.166 NL PCR 0.240 0.374 0.314 13 components 0.178 spline PLS 0.286 0.625 0.486 9 factors, 1st degree, 1 knot 0.187 LWR 0.244 0.346 0.299 9 factors, using all objects 0.182 RBF PLS 0.204 0.299 0.256 NN 0.606 0.979 0.814 Selected Variables : 292 371 239 307 554 378 94 354 Selected Variables : 16 139 266 280 429 454 475 515 Selected FT coeff. : 1 2 6 10 17 25 23 factors, Gauss. funct. width : 2.7 Selected PCs : 1-7 9 10 12 13, 6 hidden nodes 64 0.186 0.171 0.181 0.170 0.086 Chapter 2 – Comparison of Multivariate Calibration Methods 4.5 – DIESEL OIL data This is another example of a clustered and non- linear calibration problem. In such a situation the nonlinear methods should show the best predictive ability. When the RMSECV values are compared (Table 10), which means when the prediction within the domain is investigated, the linear methods, such as MLR with variable selection, yield a better fit than local or non- linear methods. Moreover, there is then no difference between PLS, PCR and their non- linear modifications. Table 10. DIESEL OIL data set, X-space extrapolation, RMSEP values. Method test 1 test 2 test 3 test 4 test 1+2+3+4 Complexity CV PLS 1.596 0.476 0.364 0.419 0.878 7 factors 0.351 PCR 1.634 0.658 0.437 0.419 0.931 9 components 0.310 PCRS 1.910 1.083 0.873 0.623 1.223 Selected PCs : 1-3 5 9 0.287 Step MLR 1.517 0.704 0.408 0.894 GA 1.780 0.544 0.569 0.643 1.025 FT GA 1.598 0.721 0.530 0.554 0.957 Selected Variables : 186 270 433 588 Selected Variables : 342 490 651 706 Selected FT coeff. : 2 5 16 17 32 34 37 UVE PCR 1.675 1.66 6 0.872 0.67 9 0.609 0.51 4 0.583 0.44 7 1.034 0.96 2 5 components 9 components 0.432 0.297 UVE PLS 1.713 0.647 0.345 0.377 0.951 7 factors 0.320 RCE PLS 1.633 0.624 0.541 0.498 0.948 6 factors, 52 wavelet coef. 0.306 NL PCR 1.325 0.597 0.626 0.569 0.841 1.354 5 components 9 components 0.534 0 .356 spline PLS 1.384 0.425 0.274 0.344 0.757 7 factors, 1st degree, 1 knot 0.312 LWR 0.633 0.516 0.465 0.481 0.528 3 factors, using 36 objects 0.350 RBF PLS 2.398 0.443 1.277 1.132 1.488 NN 1.239 0.412 0.557 0.518 0.756 0.480 65 16 factors, Gauss. funct. width : 0.16 Selected PCs : 1-8 11, 4 hidden nodes 0.280 0.239 0.317 0.288 0.140 New Trends in Multivariate Analysis and Calibration Table 11. DIESEL OIL data set, y-space extrapolation, RMSEP values. Method test 1 test 2 test 1+2 Complexity PLS 0.329 1.456 1.056 6 factors 0.179 PCR 0.603 1.319 1.026 9 components 0.160 PCRS 0.558 1.463 1.106 Selected PCs : 1-4 6-9 0.169 Step MLR 1.031 1.070 1.051 GA 0.873 1.090 0.987 FT GA 0.895 1.302 1.117 UVE PCR 0.903 1.033 0.970 12 components 0.118 UVE PLS 0.313 1.464 1.059 6 factors 0.163 RCE PLS 1.031 1.189 1.113 7 factors, 28 wavelet coef. 0.116 NL PCR 1.167 1.074 1.121 12 components 0.124 spline PLS 0.296 1.898 1.358 6 factors, 1st degree, 1 knot 0.156 LWR 0.755 1.392 1.120 6 factors,using 19 objects 0.094 RBF PLS 0.863 0.755 0.811 NN 0.451 1.720 1.257 Selected Variables : 470 305 638 205 674 716 516 Selected Variables : 137 217 241 245 246 278 300 413 484 571 Selected FT coeff. : 6 9 12 14 18 23 31 38 23 factors, Gauss. funct. width : 0.33 Selected PCs : 1-4 6-10 12, 4 hidden nodes CV 0.109 0.099 0.127 0.117 0.041 RMSEP values obtained for the X-extrapolation test subsets confirm that non- linear methods outperform the linear ones in this case. It can also be seen that local methods perform well. In fact, LWR outperforms all the other methods. It is the only method still able to correctly predict test set 1. For the other test sets, Spline-PLS does remarkably well. In general the error in prediction of non- linear methods is lower than for linear ones. For instance Spline-PLS and NL-PCR are slightly more efficient than the ir linear counterparts. NN is also suitable for modelling the data. Results concerning the y extrapolation are reported in Table 11. It can be seen that the predictive ability is almost the same for all the considered methods. The reason for this lies in the fact that the calibration 66 Chapter 2 – Comparison of Multivariate Calibration Methods set shows a linear behaviour after removal of the extreme y values. For this reason, the non- linear models cannot be trained in an appropriate way, and do not benefit from their non- linear properties. 5 - Conclusion It should first be noted that the conclusions are different when one investigates prediction in the calibration domain or outside this domain. For instance, MLR is excellent for linear data within the domain but in case of extrapolation, it seems to be less stable as the performance does not always seem to depend on the degree of extrapolation. Therefore, one should preferably be sure whether new samples to predict would lie within the calibration domain or not. If not, it seems that one should first try to decide whether the calibration problem is linear or not. In case of linear relationship between the X variables and response values y, linear models should outperform the non- linear ones in prediction of new samples when there is extrapolation in the Xspace. MLR always yields the best results on a linear case inside the calibration domain. However, it is less stable, and therefore performs less well than PCR and PLS in all types of extrapolation. The results obtained using PLS are comparable with the results of PCR, especially if selection of PCs is performed (PCRS). For non- linear calibration problems, the non- linear and local calibration methods yielded the best results. The improvement in prediction is smaller for non- linear modifications of PLS and PCR than for NN, RBF-PLS and LWR. The latter methods are more flexible and can well describe non-polynomial relationships. In particular when data are also clustered, local methods (LWR) outperform all the other methods. Most of the studied calibration methods yield similar results when slightly non-linear data are considered. Among all the studied methods PLS, PCR and LWR should be recommended because of their robustness in this context, by which we mean that the performance is maintained quite constant with the increase of extrapolation level. Investigating the behaviour of the methods in case of instrumental changes and perturbations will be the next step to have a more global knowledge about the comparative robustness of calibration methods. 67 New Trends in Multivariate Analysis and Calibration ACKNOWLEDGMENTS We thank the Fonds voor Wetenschappelijk Onderzoek (FWO), the DWTC, and the Standards Measurement and Testing program of the EU (SMT Programme contract SMT4 -CT95-2031) for financial assistance. R EFERENCES 1) F. Cahn, S. Compton, Appl. Spectrosc., 42 (1988) 865-884. 2) L. Zhang, Y. Liang, J. Jiang, R Yu, K. Fang, Anal. Chim. Acta, 370 (1998) 65-77H. 3) V.Centner, J. Verdu-Andreas, B. Walczak, D. Jouan-Rimbaud, F. Despagne, L. Pasti., R. Poppi, D.L. Massart and O. E. de Noord., Appl. Spectrosc. 54 (2000) 608-623. 4) H. van der Voet, Chemom. Intell. Lab. Syst., 25 (1994) 313-323. 5) J. Verdu-Andres, D. L. Massart, C. Menardo, C. Sterna, Anal. Chim. Acta, 389 (1999) 115-130. 6) I. E. Frank, J. H. Friedman, Technom., 35 (1993) 109-148. 7) Wold, S., Chemom. Intel. lab. syst., 14 (1992) 71-84. 8) C. B. Lucasius, G. Kateman, Chemom. Intel. lab. syst., 25 (1994) 99-145. 9) R. Leardi, J. Chemom., 8 (1994) 65-79. 10) D. Jouan-Rimbaud, R. Leardi, D. L. Massart, O. E. de Noord, Anal. Chem., 67 (1995) 42954301. 11) L. Pasti, D. Jouan-Rimbaud, D. L. Massart, O. E. de Noord, Anal. Chim. Acta, 364 (1998) 253263. 12) V. Centner, D. L. Massart, O. E. de Noord, S. de Jong, B. M. Vandeginste, C. Sterna, Anal. Chem., 68 (1996) 3851-3858. 13) D. Jouan-Rimbaud, R. Poppi, D. L. Massart, O. E. de Noord, Anal. Chem., 69 (1997) 43174323. 14) T. Næs, T. Isaksson, B. R. Kowalski, Anal. Chem., 66 (1994) 249-260. 15) V. Centner, D.L. Massart. Anal. Chem., 70 (1998) 4206-4211. 16) B. Walczak, D. L. Massart, Anal. Chim. Acta, 331 (1996) 177-185. 17) R. Fletcher, Practical Methods of optimization, Wiley, N.Y., 1987. 18) F. Despagne, D.L. Massart, Chemom. Intel. lab. syst., 40 (1998) 145-163. 68 Chapter 2 – Comparison of Multivariate Calibration Methods 19) R.W. Kennard and L.A. Stone, Technometrics, 11 (1969) 137-148. 20) R.D. Snee, Technometrics 19 (1977) 415-428. 21) U.G. Indahl, T. Næs, J. Chemometrics, 12 (1998) 261-278. 22) J.H. Kalivas, Chemom. Intel. lab. syst., 37 (1997) 255-259. 23) V. Centner, D. L. Massart, O. E. de Noord, Anal. Chim. Acta, 330 (1996) 1-17. 24) R.J. Barnes, M.S. Dhanoa and S.J. Lister, Appl. Spectrosc., 43 (1989) 772-777. 69 New Trends in Multivariate Analysis and Calibration A Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data Sets. Part III : Robustness Against Instrumental Perturbation Conditions. Submitted for publication. F. Estienne , F. Despagne, B. Walczak +, O. E. de Noord1 ,D.L. Massart* ChemoAC, Farmaceutisch Instituut, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium. E-mail: fabi@fabi.vub.ac.be + on leave from : Silesian University Katowice Poland 1 Shell International Chemicals B. V., Shell Research and Technology Centre Amsterdam P. O. Box 38000 1030 BN Amsterdam The Netherlands ABSTRACT This work is part of a more general research aiming at comparing the performance of multivariate calibration methods. In the first and second parts of the study, the performances of multivariate calibration methods were compared in situation of interpolation and extrapolation respectively. This third part of the study deals with robustness of calibration methods in the case whe re spectra corresponding to new samples of which the y value has to be predicted can be affected by instrumental perturbations not accounted for in the calibration set. This type of perturbations can happen due to instrument ageing, replacement of one or several parts of the spectrometer (e.g. the detector), use of a new instrument, or modifications in the measurement conditions, like the displacement of the instrument to a different location. Even though no general rules could be drawn, the variety of data sets and calibration methods used enabled to establish some guidelines for multivariate calibration in this unfavourable case when instrumental perturbation arises. * Corresponding author K EYWORDS : Multivariate calibration, method comparison, instrumental change, extrapolation, nonlinearity, clustering. 70 Chapter 2 – Comparison of Multivariate Calibration Methods 1 – Introduction This study is part of a more general research aiming at comparing the performance of multivariate calibration methods. These methods enable to relate instrumental responses consisting of a set of predictors X to a chemical or physical property of interest y (the response factor). The choice of the most appropriate method is a crucial step in order to obtain a good prediction of the property y of new samples. Methods were compared using sets of industrial Near-Infrared (NIR) data, chosen such that they include difficulties often met in practice, namely data clustering, non- linearity, and presence of irrelevant variables in the set of predictors. The comparative study was performed in three separate steps : • In the first part of the study [1], the performances of multivariate calibration methods were compared in the ideal situation where test samples are within the calibration domain (interpolation). • In the second part of the study [2], the performance of multivariate calibration methods were compared in a situation which can sometimes not be avoided in practice : the case where some test samples fall outside the calibration domain (extrapolation). Extrapolation occurring in the X-space and in the Y-space was considered. • This third part of the study deals with the case where spectra corresponding to new samples of which the y value has to be predicted can be affected by instrumental perturbations not accounted for in the calibration set. The robustness of a calibration model is challenged in this situation in which exactly superimposing replicate spectra of a stable standard is impossible. The instrumental perturbations can be due to instrument ageing, replacement of one or several parts of the spectrometer (e.g. the detector), use of a new instrument, or modifications in the measurement conditions, like the displacement of the instrument to a different location. In all cases a degradation of the prediction results must be expected. This third part of the method comparison study aims at evaluating the robustness of the different calibration methods in the presence of such perturbations. 71 New Trends in Multivariate Analysis and Calibration 2 - Experimental 2.1 - Multivariate calibration methods tested Only the methods that performed best in the first and second part of the comparative study [1,2] were retained for this part. The calibration methods used in each part of the comparative study are summarised in Table 1. Table 1. Methods used in the different parts of the comparative study. Part 3 is the current study. Method PCR PCR-sel TLS-PCR TLS-PCR-sel PLS-cv PLS-rand PLS-pert Brown MLR-step GA FT-GA UVE-PCR UVE-PCR-sel UVE-PLS RCE-PLS NL-PCR NL-PCR-sel NL-UVE-PCR NL-UVE-PCR-sel poly-PCR SPL-PLS kNN LWR RBF-PLS FT-NN PC-NN OBS-NN PART 1 X X X X X X X X X X X X X X X X X X X X X X X X 72 PART 2 X X X PART 3 X X X X X X X X X X X X X X X X X X X X X Chapter 2 – Comparison of Multivariate Calibration Methods 2.1.1 - Principal component regression (PCR) In classical PCR (sometimes referred to as top-down PCR) [3], the number A of Principal Components (PC) is optimised by Leave One Out (LOO) Cross Validation (CV). PCs from PC1 to PCA are retained in order of the variance in the original data matrix X they explain. A limitation of this approach is that, in some cases, information related to the property to be predicted y is found in high-order PCs, which account for only a small amount of spectral variance. An alternative version called PCR with best subset selection (PCR-sel) was therefore used. In this method, PCs are selected according to their correlation with the target property y [1]. Model complexity was estimated by LOOCV followed by a randomisation test [4]. This test allows to determine whether models with lower complexity have significantly worse predictive ability and should therefore not be used. 2.1.2 - Partial least squares and its variants Contrarily to PCR, the Latent Variables (LV) in PLS [5,6] are calculated to maximise covariance between X and y. Latent variable selection as performed in PCR is therefore not necessary. The model complexity A in PLS can be determined in several ways. The most classical way is to perform LOOCV and retain the complexity associated with the minimum LOOCV error (PLS-cv). However this approach is rat her conservative since the removal of one sample at a time corresponds to a small statistical perturbation of the calibration set. The complexity of the model chosen is often too high. Use of a randomisation test often allows to reduce the complexity of the selected models (PLS-rand), but in some cases it carries a risk of underfitting, i.e. too few LVs can be retained [7]. This is why an alternative validation method for selecting optimal model complexity based on the simulation of instrumental perturbatio ns on a subset of calibration sample spectra (PLS-pert) [7] was developed. This method aims at determining the number of LVs beyond which models are unnecessarily sensitive to instrumental perturbations affecting the spectra. 2.1.3 - Methods based on variable selection/elimination In stepwise Multiple Linear Regression (MLR-step), original variables are selected iteratively according to their correlation with the target property y [8]. For a selected variable xi, a regression 73 New Trends in Multivariate Analysis and Calibration coefficient bi is determined and tested for significance using a t- test at a critical level α ( α = 5% was used in this study). If the coefficient is found to be significant, the variable is retained and another variable xj is selected according to its partial correlation with the residuals obtained from the model built with xi. This procedure is called forward selection. The significance of the two regression coefficients bi and bj associated with the two retained variables is then again tested, and the nonsignificant terms are elim inated from the equation (backward elimination). Forward selection and backward elimination are alternatively repeated until no significant improvement of the model fit can be achieved by including more variables and all regression terms already selected are significant. In order to reduce the risk of overfitting due to retaining too many variables, a procedure based on LOOCV followed by a randomisation test is applied to test different sets of variables for significant differences in prediction. Genetic algorithms (GA) are probabilistic optimisation tools inspired by the “survival of the fittest” principle of Darwinian theory of natural evolution and the mechanisms of natural genetics [9]. They can be used in calibration to select a small subset of original variables to model y using MLR [10,11]. Instead of performing the selection on the set of numerous correlated original variables, one can apply GA to transformed variables such as power spectrum coefficients obtained by Fourier transform (FTGA) [12]. In this case the variable selection is carried out in the frequency domain, from the first fifty power spectrum coefficients only. 2.1.4 - Methods based on uninformative variable elimination The idea behind the uninformative variable elimination PLS (UVE-PLS) method is to reduce significantly the number of original variable before calculating LVs in the final PLS model [13]. This is done by removing original variables that are considered unimportant. One first generates vectors of random variables that are attached to each spectrum in the data set. Then a PLS model is built on the set of artificially augmented spectra, and all variables with regression coefficients not significantly more reliable than the regression coefficients of the dummy variables are truncated. (The reliability of a coefficient is calculated as the ratio of its magnitude to its standard deviation estimated by leave-oneout jackknifing). After reduction of the number of original variables, a new PLS model is built. Model complexities for variable elimination and final modelling are determined by LOOCV. The advantage of 74 Chapter 2 – Comparison of Multivariate Calibration Methods the UVE-PLS approach is that, since noisy or redundant variables have been eliminated, the models built after variable elimination will be more parsimonious and robust than classical PLS models. 2.1.5 - Methods based on local modelling In locally weighted regression (LWR), a dedicated local model is developed for each new prediction sample [14]. This can be advantageous for data sets that exhibit some clustering or some non-linearity that can be approximated by local linear fits. For each point to be predicted, a local PLS model is built using the closest (in terms of Euclidian norm in the X space) calibration points. In this study, the points were given uniform weights in the local model [15]. The radial basis function PLS method (RBF-PLS) bears some similarities with LWR [16]. The PLS algorithm is applied to the M and y matrices instead of the X and y matrices. M(n × n) is called the activation matrix (with n the number of samples). Its elements are Gaussian functions placed at the positions of the calibration objects. Thus a form of local modelling is performed as in LWR. The PLS algorithm relates the non- linearly transformed distance measures in M to the target property in y. The width of Gaussian functions and number of LVs are optimised by prediction testing using a training and a monitoring set similarly to Neural Networks. 2.1.6 - Methods using Neural Networks (NN) Back-propagation NN using PCs as inputs was used in this study (PC-NN). A method was applied to find the best number of nodes to be used in the input and hidden layers based on the contribution of each node [17]. NN models using Fourier transform power spectrum coefficients (FT-NN) were also used. Optimisation of the set of input coefficients was performed on the first 20 coefficients by trialand-error (the variance propagation approach for sensitivity estimation can not be applied in this case since Fourier coefficients are not orthogonal). All NN models had one hidden layer and were trained with the Levenberg-Marquardt algorithm [18]. Hyperbolic tangent and/or linear functions were used in the nodes of the hidden and output layers. 75 New Trends in Multivariate Analysis and Calibration 2.2 - Data sets used All data sets were described in detail in the two first parts of this comparative study [1,2]. The main characteristics of the data sets used in the comparative study are summarised in Table 2. Table 2. Main characteristics of the experimental NIR data sets. Data set WHEAT POLYOL GASOLINE 1 POLYMER GAS OIL 1 Linearity/nonlinearity Linear Linear Slightly nonlinear Strongly nonlinear Nonlinear Clustering Strong (PC3) Strong (PC1) Strong (PC2) Strong (PC1) Minor (PC1-PC2) 2.2.1 - WHEAT data This data set was submitted to the Chemometrics and Intelligent Laboratory Systems database of proposed standard reference data set by Kalivas [19]. It consists of NIR spectra of wheat samples with specified moisture content. Samples were measured in diffuse reflectance from 1100 to 2500 nm (2 nm step) on a Bran & Luebbe instrument. Offset correction was performed on the spectra to eliminate baseline shift. After offset correction, a PCA revealed a separation in two clusters on PC3. This separation can be linked to the clustering present on the y values. An isolated sample on this PC was detected as an outlier and removed from the data. 2.2.2 - POLYOL data This data set consists of NIR spectra used for the determination of hydroxyl number in polyether polyols. Spectra were recorded on a NIR Systems 6250 instrument from 1100 to 2158 nm (2 nm step). An offset correction was applied to eliminate a baseline shift between spectra. The data set contains two clusters due to the presence of a peak at 1690 nm only in some of the spectra [10]. The data set is clustered, the clustering can be seen on a PC1-PC2 score plot. 76 Chapter 2 – Comparison of Multivariate Calibration Methods 2.2.3 - GASOLINE data This data set was studied for the determination of gasoline MON. The NIR spectra were recorded on a PIONIR 1024 spectrometer instrument from 800 to 1080 nm (0.5 nm step). Spectra were pre-processed with first derivatives to eliminate a baseline shift and to separate overlapping peaks. This data set contains three clusters due to gasolines with different grades and it is non- linear. 2.2.4 – POLYMER data This data set was used for the determination of the amount of a minor mineral component in a polymer. NIR spectra were recorded from 1100 to 2498 nm (2 nm step). SNV transformation was applied to remove a curved baseline shift between spectra. This data set is clus tered and strongly non- linear in the X-y relationship and in the X-space. 2.2.5 – GAS OIL data This data set was studied for modelling the viscosity of hydro-treated gas oil samples. The NIR spectra were recorded on a NIR interferometer between 4770 and 6300 cm-1 (1.9 cm-1 step). Spectra were converted from wavenumbers to wavelengths and linear baseline correction was performed to correct for a baseline drift. Clusters and zones of unequal density are present in the data set due to the fact that the samples come from three different batches. This data set is non-linear, but this non- linearity can only be seen due to the presence of two extreme samples. These extreme samples could have been misinterpreted as outliers, but people in charge of data acquisitio n established through expert knowledge that this was not the case. 77 New Trends in Multivariate Analysis and Calibration 2.3 - Design of the method comparison study Models were developed using calibration samples, and their predictive ability was evaluated on perturbation- free test samples as was done in the first part of the comparative study [1]. Perturbations were then simulated on the spectra of the test samples. The following types of perturbation were studied : • detector noise • change in optical pathlength • wavelength shift • slope in baseline • baseline offset • stray light For each calibration method, the prediction error on the perturbed test samples was evaluated and compared to the prediction error on perturbation-free samples. Therefore, this study provided not only information on the performance of calibration methods in the presence of perturbation, but also on the relative degradation of performance compared to perturbation-free test samples. The perturbations were simulated as follows: 78 Chapter 2 – Comparison of Multivariate Calibration Methods 2.3.1 - Detector noise Gaussian white noise can affect detectors in spectroscopy. Since the measured transmitted or reflected light is log-transformed to absorbance, the Gaussian white noise becomes heteroscedastic (Fig. 1). To simulate detector noise in each data set, the maximum peak height of the mean spectrum was first determined. White noise was then simulated with a standard deviation equal to a fraction of the maximum peak height and added to the transmission or reflection spectra before they were logtransformed into absorbance. For the GASOLINE data, the raw spectra before application of the fist derivative were used. Fig. 1. POLYOL data. Standard deviation of simulated detector noise. Absorbance scale. 79 New Trends in Multivariate Analysis and Calibration 2.3.2 - Change in optical pathlength In spectroscopy, scattering due to different particle sizes, presence of water in a sample or change of the sample cell can modify the effective pathlength of the radiation. This multiplicative effect causes a modification in absorbance (Fig. 2). Fig. 2. GAS OIL 1 data. Influence of a 2.5% optical pathlength change. Let xbe the absorbance value at a given wavelength. After a change ∆l of the optical pathlength l, the absorbance for the same sample at the same wavelength becomes : x path = x (1 + ∆l ) l (1) 80 Chapter 2 – Comparison of Multivariate Calibration Methods 2.3.3 - Wavelength shift Imperfections in the optics or mechanical parts of spectrometers can cause wavelength shifts. To simulate wavelength shifts, a second-order polynomial was fitted to each spectrum using 3-point spectral windows. Once the polynomial coefficients were obtained for each window, the shifted absorbance values were interpolated at the position defined by the shift value ∆λ (Fig. 3). Fig. 3. POLYMER data. Influence of a 2 nm wavelength shift. 81 New Trends in Multivariate Analysis and Calibration 2.3.4 - Baseline slope Baseline slope is often related to multiplicative perturbations such as stray light or optical pathlength change. A slope is determined as a fraction of the maximum signal of the mean spectrum and added to all spectra of the data set (Fig. 4). Fig. 4. WHEAT data. Influence of a 3% baseline slope. 82 Chapter 2 – Comparison of Multivariate Calibration Methods 2.3.5 - Baseline offset Baseline offsets can be due to imperfection in optics, fouling of the sample cell or even changes in the cell positioning of the fiber optic. The baseline offset was determined as a fraction of the maximum signal in the mean spectrum and added to all spectra (Fig. 5). Fig. 5. GAS OIL 1 data. Influence of a 2% baseline offset. 83 New Trends in Multivariate Analysis and Calibration 2.3.6 - Stray light Stray light is the fraction of detected light that was not transmitted through the sample. It is usually caused by imperfections in the optical parts of the instruments. At a given wavelength, the effect of stray light is simulated before log-transformation by adding a fraction s of the maximum signal in the mean spectrum (F ig. 6). Fig. 6. GAS OIL 1 data. Influence of 1% stray light. Therefore the absorbance for a sample at a given wavelength in the presence of stray light is calculated as : ( x stray = −log 10− x + s ) (2) 84 Chapter 2 – Comparison of Multivariate Calibration Methods Some instrumental perturbations were not applied to experimental data sets that had been pre-processed in order to remove instrumental effects of the same type. For each experimental data set, perturbations were adjusted by visual evaluation of the perturbation effect on the spectra. Details on the simulated perturbations can be found in Table 3. Table 3. Perturbations applied to the experimental data sets. Data set POLYMER GAS OIL 1 WHEAT POLYOL GASOLINE Wavelength shift 0.5 nm 0.5 cm-1 0.5 nm 0.5 nm 0.5 nm Pathlength Stray light 1% 2.50% 2.50% 1% 0.50% 0.50% 0.20% 1% Detector noise 0.03% 0.20% 0.20% 0.20% 0.08% Baseline offset 0.50% - Baseline slope 1% 1% 1% All calibration and test samples were selected with the Duplex design [20], therefore prediction results on perturbation- free test samples differ from the prediction results obtained in the first part of the comparative study [1]. For the GASOLINE data, due to the large imprecision and bias in the reference MON for the 30 samples used as test set in the first part, only the 132 samples that were used as calibration set in the first part were retained. Details on data splitting for each data set are provided in Table 4. Table 4 : Number of calibration and test samples for the different experimental data sets. Data set WHEAT POLYOL GASOLINE POLYMER GAS OIL Calibration 59 52 88 36 69 85 Test 40 32 44 18 35 New Trends in Multivariate Analysis and Calibration 3 – Results and Discussion 3.1 - Results of the previous parts The first part of the study (interpolation) showed that PCR, preferably with PC selection, yields similar prediction results as PLS. PLS is however sometimes more parsimonious. Variable selection/elimination can have a positive influence on the predictive ability of a calibration model. In particular, the MLR-step variable selection method yields prediction results on linear problems that are comparable and sometimes even better than the full spectrum calibration methods. The UVE-based methods can be applied with the aim to improve the precision of prediction, but also as a diagnostic tool. This screening step enables to determine to what extent the variables in X are relevant to model y. For linear problems, the linear methods resulted in better predictions, and for non- linear problems, NNbased methods, and in some cases local calibration techniq ues, outperformed linear methods. LWR performed particularly well in interpolation. In the second part of the comparative study (extrapolation), it was seen that the relative performances of the different calibration methods change when predictions are performed outside the calibration domain. The degradation of prediction also depends on the nature of the extrapolation (X-space or yspace). In case of a linear relationship between X variables and y, linear models outperformed the nonlinear ones in predic tion of new y values that constitute extrapolations in the X space. In all types of extrapolation, PLS and PCR always outperformed MLR-based methods. Results obtained using PLS were again similar to those of PCR-sel. Performances of PCR and PLS degraded in situations of extrapolation, but this degradation was never catastrophic, which is an attractive feature compared to other methods. As expected, the performance of linear methods degraded more for non- linear data. The performance of non- linear or local methods can also degrade significantly for such data, in particular when the data set is clustered. No particular improvement due to the use of variable selection/elimination methods was observed in situations of extrapolation. More generally, it can not be said that some methods are bad performers in situations of extrapolation. It is impossible to find a method that would systematically outperform the others, but certain methods such as MLR-step can be less reliable. 86 Chapter 2 – Comparison of Multivariate Calibration Methods 3.2 - Prediction in the presence of instrumental perturbation 3.2.1 - WHEAT data The prediction results and model parameters are reported in Table 5. The RMSEP values are reported for perturbation- free test samples (“clean”) and test samples after simulation of different instrumental perturbations. Table 5. WHEAT data. Model parameters and prediction results. COMPLEXITY Used variables CLEAN Noise Pathlength Shift PLS-rand 3 latent variables 1-3 0.209 0.219 0.224 0.214 0.268 0.270 +4.8% +7.3% +2.1% +28.0% +29.1% PLS-cv 3 latent variables 1-3 0.209 0.219 0.224 0.214 0.268 0.270 +4.8% +7.3% +2.1% +28.0% +29.1% PLS-pert 3 latent variables 1-3 0.209 0.219 0.224 0.214 0.268 0.270 +4.8% +7.3% +2.1% +28.0% +29.1% PCR-sel 3 principal components 1,3,7 0.206 0.213 0.219 0.219 0.229 0.252 +3.2% +5.8% +6.0% +10.9% +22.2% MLRstep 2 wavelengths 441,480 0.224 FT-GA 6 coefficients 3,5,7,17,26, 37 0.223 1-3 Method UVEPLS RCE-PLS LWR MODEL 567/701 wavelengths 3 latent variables 170 coefficients 3 latent variables 33 nearest neighbors 3 latent variables Slope Stray light 0.374 0.234 0.297 0.231 0.225 +66.2% +3.9% +32.3% +3.0% +0.1% 0.269 0.231 0.213 0.233 0.223 +20.5% +3.6% -4.2% +4.5% +0.2% 0.209 0.218 0.219 0.215 0.265 0.257 +4.7% +5.0% +2.9% +27.0% +23.4% 1-3 0.210 0.219 0.222 0.213 0.266 0.265 +4.5% +6.1% +1.4% +26.9% +26.2% 1-3 0.227 0.235 0.248 0.230 0.302 0.295 +3.4% +9.3% +1.6% +33.4% +30.1% RBF-PLS 7 latent variables 1-7 0.200 0.205 0.223 0.206 0.216 0.284 +2.4% +11.3% +2.8% +8.0% +42.1% PC-NN Topology : 3-2-1 1,3,4 0.215 0.221 0.219 0.220 0.260 0.241 +2.7% +1.9% +2.0% +20.9% +12.2% FT-NN Topology : 5-2-1 3,4,6,9,10 0.217 0.224 0.215 0.225 0.244 0.234 +5.3% -0.5% +15.6% +12.4% +8.1% 87 New Trends in Multivariate Analysis and Calibration In the absence of perturbations, all methods perform equally well and the models are relatively parsimonious. The models being parsimonious, they can expected to be robust, explaining the fact that simulated perturbations have very little influence on most calibration methods. The MLR-step model uses only two variables, which is highly desirable from the point of view of model interpretation and robustness towards some types of perturbations. This model is however the most sensitive to increasing detector noise and wavelength shift. The wavelength shift is the same as on the POLYMER data and the pathlength change is higher than on GAS OIL data, but they have little influence on this data set. This illustrates that the effect of perturbations on NIR calibration models does not only depend on the calibration methods and the magnitude of the perturbations, but also on the data themselves. Overall, pathlength change and wavelength shift have relatively little effect, and changes in slope and stray light are the most influential perturbations. They mainly affect calibration methods that use LVs or PCs for modelling or preprocessing, because absorbance values at all wavelengths in the linear combinations are modified so that the impact of the perturbations is amplified. MLR-step and Fourier transform based-methods (FT-GA, FT-NN) are more robust with respect to these perturbations. 88 Chapter 2 – Comparison of Multivariate Calibration Methods 3.2.2 - POLYOL data The prediction results and model parameters are reported in Table 6. Table 6. POLYOL data. Model parameters and prediction results. Method MODEL COMPLEXITY Used variables Clean PLS-rand 6 latent variables 1-6 2.488 PLS-cv 8 latent variables 1-8 1.587 PLS-pert 6 latent variables 1-6 2.488 PCR-sel 7 principal components 1-6,8 2.039 MLRstep 6 wavelengths FT-GA 8 coefficients UVEPLS RCE-PLS LWR 206/499 wavelengths 7 latent variables 139 coefficients 7 latent variables 37 nearest neighbors 8 latent variables RBF-PLS 26 latent variables 2.498 +0.4% 1.585 -0.1% 2.498 +0.4% 2.047 +0.4% Pathleng th 2.522 +1.4% 1.771 +11.6% 2.522 +1.4% 2.154 +5.6% 2.644 +6.3% 1.704 +7.3% 2.644 +6.3% 2.118 +3.9% Stray light 2.630 +5.7% 1.891 +19.2% 2.630 +5.7% 2.190 +7.4% 3.768 +27.3% 3.037 +2.6% 3.310 +11.9% 2.816 -4.8% 3.336 +12.7% 1.517 2.743 +80.8% 2.387 +57.3% 1.704 +12.3% 1.604 +5.7% 2.594 +71.0% 1-7 1.741 2.047 +17.5% 1.739 -0.1% 1.836 +5.4% 1.819 +4.5% 1.922 +10.4% 1-7 1.679 1.887 +12.4% 1.677 -0.1% 1.753 +4.4% 1.764 +5.0% 1.797 +7.0% 1-8 1.568 1.779 +13.5% 1.392 -11.2% 2.274 +45.0% 1.778 +13.4% 2.873 +83.2% 1-26 1.820 1.951 +7.2% 3.355 +84.3% 1.865 +2.5% 2.160 +18.6% 2.203 +6.5% 2.866 -0.5% 3.579 +73.0% 3.206 +11.4% 2.197 +6.2% 2.882 +0.1% 2.342 +13.2% 2.909 +1.0% 489,144,37 7,449, 403,350 2,3,7,9,15,2 0,34,49 Noise Slope 2.556 +2.7% 1.766 +11.2% 2.556 +2.7% 2.363 +15.9% 2.959 PC-NN Topology : 6-3-1 1,2,4,6,8,9 2.069 FT-NN Topology : 6-2-1 1-4,7,15 2.879 3.664 +101.3 % 3.764 +81.9% 3.143 +9.2% Shift In the absence of perturbations, the best results are obtained with PLS-cv, LWR and FT-GA, that use 8 LVs or coefficients. However they are more affected by detector noise (in particular FT-GA) than more parsimonious methods like PLS-rand, PLS-pert, FT-NN or PC-NN that use only 6 LVs or coefficients. PCR-sel is more robust than PC-NN with respect to slope and pathlength change. This difference in 89 New Trends in Multivariate Analysis and Calibration robustness is not due to the intrinsic non-linear nature of NN applied to a linear model, but to the large sensitivity of PC 9 (retained in PC-NN and not in PCR-sel) with respect to these perturbations. Overall, PLS-rand, PLS-pert, FT-NN and RCE-PLS are robust with respect to all perturbations. PCR-sel, PLScv and UVE-PLS are also relatively robust. RCE-PLS, PLS-cv, and UVE-PLS offer the best compromise in terms of performance both in the presence and in the absence of perturbations. The performances of FT-NN and MLR-step are relatively similar. It seems that for this data set, the most parsimonious models (MLR-step, PCR-sel, PLS-rand, PLS-pert, PC-NN) lack some explanative power, and that this loss is not compensated with a better robustness with respect to perturbations, which is unusual. 90 Chapter 2 – Comparison of Multivariate Calibration Methods 3.2.3 - GASOLINE 1 data The prediction results and model parameters are reported in Table 7. Table 7. GASOLINE 1 data. Model parameters and prediction results. Method MODEL COMPLEXITY PLS-rand 10 latent varia bles Used variables 1-10 Clean Noise 0.198 0.250 +26.3% Pathleng th 0.2398 +21.0% PLS-cv 12 latent variables 1-12 0.196 0.275 +39.9% 0.292 +48.9% PLS-pert 9 latent variables 1-9 0.179 0.218 +21.6% 0.184 +2.8% PCR-sel 9 principal components 0.278 0.306 +9.9% 0.342 +22.8% MLR-step 11 wavelengths 0.237 1.798 +657.2 % FT-GA 9 coefficients 0.220 UVE-PLS 141/561 wavelengths 8 latent variables 59 coefficients 6 latent variables 15,7,10,13,1 5 460,348,35 2,307, 552,295,52 4,166 7,9,15,16,2 1,25,29, 33,35 1-8 1-6 RCE-PLS LWR 87 nearest neighbors 12 latent variables RBF-PLS 16 latent variables Shift Slope 0.218 +9.8% 0.196 +0.1% 0.390 +98.8% 0.176 -2.1% 0.240 +34.0% 0.257 -7.7% 0.461 +65.8% 0.259 +9.2% 1.135 +472.7 % 1.066 +443.3 % 1.275 +611.4 % 1.472 +429.5 % 0.378 +59.2% Stray light 0.324 +63.6% 0.275 +15.8% 0.355 +49.5% 0.453 +105.6 % 0.283 +44.6% 0.363 +64.7% 0.350 +59.0% 0.220 +0.0% 0.414 +87.8% 0.290 +48.1% 0.2143 +9.6% 0.379 +93.8% 0.185 0.267 +44.3% 0.228 +22.9% 1-12 0.196 0.274 +39.9% 0.396 +113.5 % 0.292 +48.9% 0.479 +158.6 % 0.390 +98.8% 1-16 0.184 0.2513 +36.7% 0.294 +60.1% 0.492 +151.7 % 0.916 +394.3 % 1.066 +443.3 % 0.872 +374.1 % 1.138 +503.5 % 0.233 +1.7% 0.196 PC-NN Topology : 8-1-1 1-5,7,8,10 0.188 0.197 +4.3% 0.227 +20.4% FT-NN Topology : 7-1-1 1,2,6,7,11,1 6,18 0.229 0.236 +3.2% 0.252 +10.3% 91 0.196 +0.1% 0.188 +2.4% 0.189 +0.3% 0.876 +283.1 % 0.392 +113.1 % 0.282 +49.7% 0.227 -0.7% New Trends in Multivariate Analysis and Calibration In the absence of perturbations, the best results are obtained with PLS -pert, RCE-PLS, RBF-PLS and PC-NN, with RMSEP values on perturbation- free samples around 0.180. Most calibration methods are very sensitive to the detector noise, which is magnified after pre-processing with first derivative. In particular, the performance of MLR-step degrades significantly. The most robust methods with respect to noise are PCR-sel, PLS -pert and PC-NN. Robustness decreases as more LVs are retained for the three variants of PLS. All methods are sensitive to pathlength change except PLS -pert. MLR-step, PLS-rand, FT-NN and PC-NN are slightly more robust than the other ones with respect to this perturbation. The spectral differences due to shift are amplified by first derivation and one must expect large prediction errors with this perturbation. Indeed, the calibration methods are very sensitive to wavelength shift except FT-NN, and to a lesser extent MLR-step and FT-GA. The better robustness of the Fourier transform-based methods is due to the fact that the shape of the spectra is not affected by the shift. Compared to the other perturbations, the slope change has only a limited influence on all methods, except FT-NN. Unlike optical pathlength change or stray light whose efffect is wavelengthdependent, the slope effect is similar at all wavelengths, hence the first Fourier coefficient is very sensitive to this perturbation. Overall, the most robust method is FT-NN, except after addition of baseline slope that affects the first Fourier coefficient (sum of absorbances) much more than the other coefficients. FT-GA always performs worse than FT-NN, except for the baseline slope effect that does not affects the coefficients retained by GA. However, one must keep in mind that with FT-GA, selection is performed by GA on the first 50 Fourier coefficients (that contain some high frequency coefficients likely to be contaminated with noise). In FT-NN, selection of coefficients is performed by trial-and-error on the first 20 coefficients. The stray light effect has a strong influence on all methods, except FT-NN. Again, it is likely that this effect has more impact on the higher-order (higher frequency) coefficients retained by FT-GA than on the coefficients used for modelling with NN. 92 Chapter 2 – Comparison of Multivariate Calibration Methods 3.2.4 - POLYMER data Table 8. POLYMER data. Model parameters and prediction results. Method MODEL COMPLEXITY Used variables Clean Noise Shift PLS-rand 5 latent variables 1-5 0.055 PLS-cv 6 latent variables 1-6 0.051 PLS-pert 5 latent variables 1-5 0.055 PCR-sel 4 principal components 1-3,7 0.086 MLRstep 2 wavelengths 458,37 0.086 FT-GA 5 coefficients 3,12,13,15, 47 0.052 0.058 +4.2% 0.054 +6.1% 0.058 +4.2% 0.091 +6.9% 0.087 +1.4% 0.063 +19.4% 0.076 +37.9% 0.066 +29.7% 0.076 +37.9% 0.091 +7.2% 0.086 +0.1% 0.092 +75.8% 1-6 0.052 0.056 +6.3% 0.075 +42.9% 1-6 0.052 0.053 +2.3% 0.048 -6.6% 1-2 0.008 0.008 +0.0% 0.008 +0.0% 1-18 0.040 0.045 +11.5% 0.017 +0.0% 0.044 +11.2% 0.017 +0.0% 0.017 +13.5% 0.015 +4.7% UVEPLS RCE-PLS LWR 411/700 wavelengths 6 latent variables 167 coefficients 6 latent variables 4 nearest neighbors 2 latent variables RBF-PLS 18 latent variables PC-NN Topology : 1-3-1 1 0.017 FT-NN Topology : 3-3-1 6,13,14 0.015 The prediction results and model parameters are reported in Table 8. In absence of perturbations, the best results are obtained with the two non- linear methods (FT-NN, PC-NN) and a locally linear method (LWR). PC-NN and LWR are also the most robust with respect to perturbations. This robustness is due to the parsimony of the models built (2 LVs only for LWR, 1 PC only for NN): the variables in both models are not affected by the simulated perturbations. The MLR and PCR models are parsimonious and robust but they are outperformed by all other models. The PLS -based methods (PLSC-cv, PLSrand, PLS-pert, UVE-PLS) use more factors to accommodate the non-linearity, but the higher-order 93 New Trends in Multivariate Analysis and Calibration factors are affected by wavelength shifts that lead to degradation in RMSEP values. The wavelet coefficients used in RCE-PLS and the Fourier coefficients retained by FT-NN seem robust, whereas the Fourier coefficients retained by FT-GA are particularly sensitive to wavelength shift. 3.2.5 - GAS OIL data The prediction results and model parameters are reported in Table 9. Table 9. GAS OIL 1 data. Model parameters and prediction results. Method MODEL COMPLEXITY Used variables Clean PLS-rand 4 latent varia bles 1-4 0.452 PLS-cv 7 latent variables 1-7 0.338 PLS-pert 5 latent variables 1-5 0.414 PCR-sel 5 principal components 1-5 0.501 MLRstep 6 wavelengths FT-GA 9 coefficients UVEPLS RCE-PLS LWR 348/795 wavelengths 4 latent variables 256 coefficients 4 latent variables 15 nearest neighbors 3 latent variables RBF-PLS 19 latent variables 495,283,49 6,364,755,2 26 2,7,12,14,1 9,23,27, 41,43 0.494 0.327 1-4 0.435 1-4 0.421 1-3 0.478 1-19 0.227 PC-NN Topology : 8-2-1 1-8 0.251 FT-NN Topology : 11-2-1 1-3,8,1215, 17,19,20 0.281 94 0.497 1.160 Pathleng th 1.097 0.494 Stray light 1.014 +10.0% +156.7% +142.8% +9.4% +124.4% 0.495 0.614 1.128 0.359 1.407 +46.3% +81.6% +233.2% +6.1% +315.8% Noise Offset Shift 0.481 1.189 1.007 0.509 1.025 +16.2% +187.3% +143.5% +23.0% +147.7% 0.546 1.267 1.261 0.596 1.158 +9.0% +152.9% +151.9% +19.1% +131.2% 1.362 0.568 1.019 0.879 2.419 +176.0% +15.0% +106.4% +78.1% +390.0% 0.733 0.327 0.944 0.343 1.385 +124.1% +0.0% +188.6% +4.8% +323.6% 0.466 1.081 1.071 0.560 1.007 +7.1% +148.8% +146.3% +28.8% +131.6% 0.481 1.111 1.011 0.481 0.912 +14.2% +163.6% +140.1% +14.3% +116.6% 0.444 1.096 1.923 0.742 2.671 -7.1% +129.1% +302.0% +55.1% +458.2% 0.422 0.435 0.887 0.283 2.328 +85.8% +91.4% +290.6% +24.4% +924.9% 0.401 0.797 1.026 0.765 1.669 +59.9% +217.4% +308.6% +204.7% +564.7% 0.858 11.702 0.556 0.281 1.051 +205.1% +4062% +97.8% -0.1% +273.9% Chapter 2 – Comparison of Multivariate Calibration Methods In absence of perturbations, the best results are obtained with a local linear method (RBF-PLS) and the two non-linear methods (FT-NN, PC-NN). PC-NN and FT-NN use an unusually large number of input variables (8 and 11 respectively) and are therefore very sensitive to perturbations, except wavelength shift that has very little influence on Fourier coefficients with FT-NN. PLS-cv performs well in the absence of perturbations, but it is less parsimonious than models developed with PCR-sel, PLS-rand, PLS-pert or UVE-PLS. As a consequence, its performance degrades when noise is added to the test spectra and it performs similar to the other LV-based methods. Absorbance offset has a strong influence on all methods except MLR-step (because it retains only 6 original variables) and FT-GA, since the Fourier coefficients describe the shape of the spectra and this shape is not affected by the offset. However, the performance of FT-NN degrades significantly after addition of this offset because contrary to FT-GA, the first Fourier coefficient is retained. This coefficient is the sum of all absorbance values in the spectrum, and therefore sensitive to absorbance offset. All methods are affected by the multiplicative effects (change in optical pathlength and stray light). Most methods are relatively robust with respect to the wavelength shift, except MLR-step, LWR and PC-NN. 4 - Conclusions The study of prediction results in this third part of the comparative study provided information on the robustness of the different calibration methods with respect to unmodelled instrumental perturbations. In most cases, the influence of instrumental perturbations is difficult to predict because it depends on a large number of factors: • nature of the perturbation. • level of the perturbation. • preprocessing of the data. • nature of the calibration method. Some general conclusions can anyway be drawn. It can be observed that complex models (in particular those concerning GASOLINE or GAS OIL data) are very sensitive to any type of perturbation, but models with smaller complexities are more robust. Wavelength shift has a catastrophic effect on models developed with first derivative data. In order to achieve a better overview on method performances, methods were ranked according to the arbitrary scoring criterion displayed in Table 10 : 95 New Trends in Multivariate Analysis and Calibration - Column “Error < 15%” : 1 point was allowed to a given method when the relative change in RMSEP after addition of perturbation was lower than 15%. This column evaluates how many times (out of 22) a method was able to deal efficiently with instrumental perturbations. - Column “Error > 30%” : 1 point was allowed to a given method when the relative change in RMSEP after addition of perturbation was higher than 30%. This column evaluates how many times (out of 22) a method behaved particularly bad after inclusion of instrumental perturbations. - Column “Mean Error” : This column gives the mean relative error obtained for each method for the 22 different combinations of data sets and perturbation scheme. - Column “Error ranking” : Methods were ranked according to their mean relative error. The best method according to this criterion was allowed 12 points, decreasing until the worst performing method that was allowed only one point. Table 10. Evaluation of robustness with respect to instrumental perturbations. Method Error < 15% Error > 30% Mean Error Error ranking PLS-rand 12 6 52.9 10 PLS-cv 7 8 66.8 6 PLS-pert 11 6 59.9 7 PCR-sel 13 5 47.6 11 MLR-step 11 7 78.3 5 FT-GA 7 11 59.7 8 UVE-PLS 10 8 43.6 12 RCE-PLS 9 7 58.4 9 LWR 7 12 83.0 4 RBF-PLS 9 11 104.8 2 PC-NN 11 9 97.9 3 FT-NN 16 5 228.2 1 In order to further summarize the information in Table 10, a global ranking was built by adding the points obtained for “Error < 15%” with those obtained for “Error ranking”, and subtracting points obtained for “Error > 30%”. The results of this ranking are displayed in Table 11. 96 Chapter 2 – Comparison of Multivariate Calibration Methods Table 11. Score for robustness with respect to instrumental perturbations. METHOD PCR-sel PLS-rand UVE-PLS FT-NN RCE-PLS MLR-step PLS-CV, PC-NN FT-GA RBF-PLS LWR Score 19 16 14 12 11 9 5 4 0 -1 According to our ranking, FT-NN is the method that is most often able to achieve relative errors lower than 15%. However, it sometimes leads to catastrophic errors (highest mean relative error). It seems that NIR spectra described with only low-order Fourier coefficients (low frequency) lead to models more robust with respect to multiplicative effects such as stray light or optical pathlength change. In most cases, LV-based methods are relatively robust with respect to detector noise provided that the number of factors retained is not too large. Contrary to usual statements, it was not observed that NNbased models were systematically sensitive with respect to perturbations. In most cases where performance degradation was observed, it was due to the sensitivity of input variables (FT coefficients or PC scores) to perturbations, not to the NN algorithm. LWR usually performs well for prediction in the absence of perturbations (see also results in the first part of the study), but it is not particularly robust. For local models developed in LWR, if the displacement caused by perturbation in the multivariate space is too large, the nearest neighbours change and the local model is built with a different subset of calibration samples that may not be appropriate. The overall best performing methods according to our ranking are PCR-sel and PLS -rand. Although not performing spectacularly well, these two methods rarely fail too badly. Globally, it can be concluded that there is not one method that can be considered as generally more robust than the others. 97 New Trends in Multivariate Analysis and Calibration R EFERENCES [1] V. Centner, G. Verdú-Andrés, B. Walczak, D. Jouan-Rimbaud, F. Despagne, L. Pasti, R. Poppi, D.L. Massart and O.E. de Noord, Appl. Spectrometry 54 (4) (2000) 608-623. [2] F. Estienne, L. Pasti, V. Centner, B. Walczak, F. Despagne, D. Jouan Rimbaud, O. E. de Noord and D.L. Massart, Chemom. Intell. Lab. Syst. 58 2 (2001) 195-211. [3] T. Naes, H. Martens, J. Chemom. 2 (1998) 155-167. [4] H. van der Voet, Chemom. Intell. Lab. Syst. 25 (1994) 313-323. [5] H. Martens, T. Naes, Multivariate Calibration, Wiley, Chichester (1989). [6] S. de Jong, Chemom. Intell. Lab. Sys. 18 (1993) 251-263. [7] F. Despagne, D.L. Massart, O.E. de Noord, Anal. Chem. 69 (1997) 3391-3399. [8] N.R. Draper, H. Smith, Applied Regression Analysis (2nd edition), Wiley, New-York (1981). [9] D.E. Goldberg, Genetic Algorithms in Search, Optimisation and Machine Learning, AddisonWesley, Reading, MA (1989). [10] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord, Anal. Chem. 67 (1995) 4295-4301. [11] R. Leardi, R. Boggia, M. Terrile, J. Chemom. 6 (1992) 267-281. [12] L. Pasti, D. Jouan-Rimbaud, D.L. Massart, O.E. de Noord, Anal. Chim. Acta 364 (1998) 253263. [13] V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.G.M. Vandeginste, C. Sterna, Anal. Chem. 68 (1996) 3851-3858. [14] T. Naes, T. Isaksson, NIR News 5(4) (1994) 7-8. [15] V. Centner, D.L. Massart, Anal. Chem. 70 (1998) 4206-4211. [16] B. Walczak, D.L. Massart, Anal. Chim. Acta 331 (1996) 187-193. [17] F. Despagne, D.L. Massart, Chemom. Intel. lab. syst. 40 (1998) 145-163. [18] R. Fletcher, Practical Methods of optimization, Wiley, N.Y., 1987. [19] J. Kalivas, Chemom. Intell. Lab. Syst. 37 (1997) 255-259. [20] R.D. Snee, Technometrics 19 (1977) 415-428. 98 Chapter 2 – Comparison of Multivariate Calibration Methods THE DEVELOPMENT OF CALIBRATION MODELS FOR SPECTROSCOPIC DATA USING MULTIPLE LINEAR REGRESSION Based on : THE DEVELOPMENT OF CALIBRATION M ODELS FOR SPECTROSCOPIC DATA USING PRINCIPAL COMPONENT REGRESSION Internet Journal of Chemistry 2 (1999) 19, URL: http://www.ijc.com/articles/1999v2/19/ R. De Maesschalck, F. Estienne, J. Verdú-Andrés, A. Candolfi, V. Centner, F. Despagne, D. JouanRimbaud, B. Walczak +, D.L. Massart * , S. de Jong 1 , O.E. de Noord2, C. Puel3 , B.M.G. Vandeginste1 ChemoAC, Farmaceutisch Instituut Vrije Universiteit Brussel Laarbeeklaan 103 B-1090 Brussels Belgium. fabi@fabi.vub.ac.be + on leave from : Silesian University Katowice Poland 1 Shell International Chemicals B.V. Shell Research and Technology Centre Amsterdam P. O. Box 38000 1030 BN Amsterdam The Netherlands Unilever Research Laboratorium Vlaardingen P.O. Box 114 3130 AC Vlaardingen The Netherlands Centre de Recherches Elf-Antar Centre Automatisme et Informatique BP 22 69360 Solaize France ABSTRACT This article aims at explaining how to develop a calibration model for spectroscopic data analysis by Multiple Linear Regression (MLR). Building an MLR model on spectroscopic data implies selecting variables. Variable selections methods are therefore studied in this article. Before applying the method, the data has to be investigated in oder to detect for instance outliers, clustering tendency or nonlin earities. How to handle replicates and perform different data preprocessings and/or pretreatments is also explained in this tutorial. * Corresponding author K EYWORDS : Multivariate calibration, method comparison, extrapolation, non- linearity, clustering. 99 New Trends in Multivariate Analysis and Calibration 1. Introduction The development of a calibration model for spectroscopic data analysis by Multiple Linear Regression (MLR) consists of many steps, from the pre-treatment of the data to the utilisation of the calibration model. This process includes for instance outlier detection (and possible rejection), validation, and many other topics of chemometrics. Apart from general chemometrics publications [1], many books and papers are devoted to regression in general and Multiple Linear Regression in particular. This method can be approached from a general statistical point of view [2,3], or with direct application to analytical chemistry [4,5]. Readers might get confused since literature often describes several alternative approaches for each step of the calibration process, e.g. several tests have been described for the detection of outliers. Our aim is therefore not to present the general theory of involved methods, but rather to present some of the main alternatives, to help the reader in understanding them and to decide which ones to apply. Thus, a complete strategy for calibration development is presented. Much of this strategy is equally applicable to other methods such as Principal Component regression [6], partial least squares, or to some extent neural networks [7], and can be found in the related tutorials [8,9]. A specificity of MLR is that the mathematical background of the method is very simple and easy to understand. Since original variables are used, interpretation can be very straight forward. Moreover, experience shows that MLR can perform very well, even outperforming latent variables methods on certain types of spectroscopic data for which it is particularly suited. However, some specific problems arise when using MLR, e.g. the necessity to perform variable selection before calibration, or the problem of random correlation. It was therefore decided to develop a particular tutorial for MLR. Even though the tutorial was written specifically with spectroscopic data in mind, some guidelines also apply to other types of data, in particular about the specific aspects of MLR described above. MLR, also often called multivariate regression or multiple regression, is used to obtain values for the bcoefficients in an equation of the type : y = b0 + b1 x1 + b2 x2 + … bm xm (1) where x1 , x2 , …, xm are different variables. In analytical spectroscopic applications, these variables could be the absorbances obtained at different wavelengths, y being a concentration or another 100 Chapter 2 – Comparison of Multivariate Calibration Methods characteristic of the samples that has to be predicted. The b-values are estimates of the true bparameters and the estimation is done by minimising a sum of squares. It can be shown that : b = (X’X)-1 X’y (2) where b is the vector containing the b-values from eq. (1), X is an nxm matrix containing the x-values for n samples (or objects as they are often called) and m variables and y is the vector containing the measurements for the n samples. A difficulty is that the inversion of the X’X matrix leads to unstable result s when the x-variables are very correlated, which happens most of the time with spectroscopic data. There are two ways to avoid this problem. One approach consists in combining the variables in such a way that the resulting summarising variables are not correlated (feature reduction). For instance, PCR consists in relating the scores of a Principal Component Analysis (PCA) model to the property of interest y through an MLR model. This method is not described here but is covered by a specific tutorial [8]. Another way is to select specific variables such that correlation is reduced. This approach is called variable selection or feature selection, and is developed in the rest of this tutorial. As can be seen in Eq. (1), MLR is an inverse calibration method. In classical calibration the basic equation is : signal = f (concentration) (3) The measured signal is subject to noise. In the calibration step we assume that the concentration is known exactly. In multivariate calibration one often does not know the concentrations of all the compounds that influence the absorbance at the wavelengths of interest so that this model cannot be applied. The calibration model is then written as the inverse : concentration = f (signal) (4) 101 New Trends in Multivariate Analysis and Calibration In inverse calibration the regression parameters b are biased and so are the concentrations predicted using the biased model. However, the predictions are more precise than in classical calibration. This can be explained by considering that the least squares step in inverse calibration involves a minimisation of a sum of squares in the direction of the concentration and that the determination of the concentrations is precisely the aim of the calibration. It is found that for univariate calibration the gain in precision is more important than the increase in bias. The accuracy of the calibration, defined as the deviation between the experimental and the true result and therefore compromising both random errors (precision) and systematic errors (bias), is better for inverse than for classical calibration [10]. Having to use inverse calibration is in no way a disadvantage. In fact, concentrations in the calibration samples are not usually known exactly but are determined with a reference method. This means that both the y and the x-values are subject to random error, so that least squares regression is not the optimal method to use. A comparison between predictions made with regression methods that consider random errors in both the y and the x-direction (total least squares) with those using ordinary least squares (OLS) in the y or concentration direction (inverse calibration), show that the results obtained by total least squares (TLS) [11,12] are no better than those obtained by inverse calibration. Each step needed to develop a calibration model is discussed in detail in this paper. We have considered a situation in which the minimum of a priori knowledge is available and where virtually no decision has been made before beginning the measurement and method development. In many cases information is available or decisions have been taken which will have an influence on the strategy to adopt. For instance, it is possible to decide before the measurement campaign that the initial samples will be collected for developing the model and validation samples will be collected later, so that no splitting is considered (chapters 8 and 11), or to be aware that there are two types of samples but that a single model is required. In the latter case, the person responsible for the model development knows or at least suspects that there are two clusters of samples and will probably not determine a cluster tendency (chapter 6), but verify visually that there are two clusters as expected. Whatever the situation and the scheme applied in practice, the following steps are usually present: • visual evaluation of the spectra before and after pre-treatment: Do replicate spectra largely overlap, is there a baseline offset, etc. • visual evaluation of the x-space, usually by looking at score plots resulting from a PCA to look for gross outliers, clusters, etc. In what follows, it will be assumed that gross outliers have been eliminated. 102 Chapter 2 – Comparison of Multivariate Calibration Methods • visual evaluation of the y-values to verify that the expected calibration range is properly covered and to note possible inhomogeneities, which might be remedied by measuring additional samples. • selection of the samples that will be used to train the model, optimise and validate the model and the scheme which will be followed. • a first modelling trial to decide whether it is possible to reach the expected quality of model and to detect gross non- linearity if it is present. • refinement of the model by e.g. considering elimination of possible outliers, selecting the optimal number of variables, etc. • final validation of the model. • routine use and updating of the model. 2. Replicates Different types of replicates should be considered. Replicates in X are defined as replicate spectroscopic measurements of the same sample. The replicate measurement should preferably include the whole process of measuring, for instance including filling the sample holders. Replicates of the reference measurements are called replicates in y. Since the quality of the prediction does not only depend on the measurement but also on the reference method, the acquisition of replicates both in X and y, i.e. both in the spectroscopic measurement and the reference analysis, is recommended. However, since the spectroscopic measurement, e.g. NIR, is usually much easier to carry out, it is more common to have replicates in X than in y. Replicates in X increase the precision of the predictions which are obtained. Precision is used here as a general term. Depending on the way in which the precision is determined, a repeatability, an intermediate precision or a reproducibility will be obtained [13,14]. For instance, if all replicates are measured by the same person on the same day and the same instrument a repeatability is obtained. Replicates of X can be used to select the best pre-processing method (see chapter 3) and to compute the precision of the predicted values from the multivariate calibration method. The predicted y-values for replicate calibration samples can be computed. The standard deviation of these values includes information about the experimental procedure followed, variation between days and/or operators, etc. The mean spectrum for each set of replicates is used to build the model. If the model does not use the 103 New Trends in Multivariate Analysis and Calibration mean spectra, then in the validation step (chapter 11) the replicates cannot be split between the calibration and test set. It should be noted that if the means of replicates were used in the development of the model, means should also be used in the prediction phase and vice versa, otherwise the estimates of precision derived during the modelling phase may be wrong. Outlying replicates must first be eliminated by using the Cochran test [15], a univariate test for comparing variances that is described in many statistics books. This is done by comparing the variance between replicates for each sample with the sum of these variances. The absorbance values constituting a spectrum of a replicate are summed after applying the pre-processing method (see chapter 3) that will be used in the modelling stage and the variance of the sums over the replicates is calculated for each sample. The highest of these variances is selected. Calling the object yielding this variance i, we divide this variance by the sum of the variances of all samples. The result is compared to a tabulated critical value at the selected level of confidence. When the value for object i is higher than the critical one, it is concluded that i probably contains at least one outlying replicate. The outlying replicate is detected visually by plotting all replicates of object i, and removed from the data set. Due to the elimination of one or more replicates, the number of replicates for each samples can be unequal. This number is not equalised because by eliminating some replicates of other samples information is lost. 3. Signal pre -processing 3.1. Reduction of non-linearity A very different type of pre-processing is applied to correct for the non- linearity due to measuring transmittance or reflectance [16]. To decrease non- linearity problems, reflectance (R) or transmittance (T) are transformed into absorbance (A): 1 A = log 10   = − log 10 R R (5) The equipment normally provides these values directly. 104 Chapter 2 – Comparison of Multivariate Calibration Methods For solid samples another approach is the Kubelka-Munk transformation [17]. In this case, the reflectance values are transformed into Kubelka-Munk units (K/S), using the equation : K (1 − R )2 = S 2R (6) where K is the absorption coefficient and S the scatter coefficient of the sample at a given wavelength. 3.2. Noise reduction and differentiation When applying signal processing, the main aim is to remove part of the noise present in the signal or to eliminate some sources of variation (e.g. background) not related to the measured y-variable. It is also possible to try and increase the differences in the contribution of each component to the total signal and in this way make certain wavelengths more selective. The type of pre-processing depends on the nature of the signal. General purpose methodologies are smoothing and differentiation. By smoothing one tries to reduce the random noise in the instrumental signal. The most used chemometric methodology is the one proposed by Savitzky and Golay [18]. It is a moving window averaging method. The principle of the method is that, for small wavelength intervals, data can be fitted by a polynomial of adequate degree, and that the fitted values are a better estimate than those measured, because some noise has been removed. For the initial window the method takes the first 2m+1 points and fits, by least squares, the corresponding polynomial of order O. The fitted value for the point in position m replaces the measured value. After this operation, the window is shifted one point and the process is repeated until the last window is reached. Instead of calculating the corresponding polynomial each time, if data have been obtained at equally spaced intervals, the method uses tabulated coefficients in such a way that the fitted value for the centre point in the window is computed as : x*ij = m ∑ c k x i, j+ k k =−m (7) Norm 105 New Trends in Multivariate Analysis and Calibration where x*ij represents the fitted value for the center point in the window, x i, j+ k represents the 2m+1 original values in the window, ck is the appropriate coefficient value for each point and Norm is a normalising constant (Fig. 1a-b). Because the values of ck are the same for all windows, provided the window size and the polynomial degree are kept constant, the use of the tabulated coefficients simplifies and accelerates the computations. For computational use, the coefficients for every window size and polynomial degree can be obtained in [19,20]. The user must decide the size of the window, 2m+1, and the order of the polynomial to be used. Errors in the original tables were corrected later [21]. These coefficients allow the smoothing of extreme points, which in the original method of Savitzky-Golay had to be removed. Recently, a methodology based on the same technique has been proposed [22], where the degree of the polynomial used is optimised in each window. This methodology has been called Adaptive-Degree Polynomial Filter (ADPF). Another way of carrying out smoothing is by repeated measurement of the spectrum, i.e. by obtaining several scans and averaging them. In this way, the signal to noise ratio (SNR), increases with being the number of scans. 106 n s , ns Chapter 2 – Comparison of Multivariate Calibration Methods Fig. 1. 73 a) a) application of the Savitzky-Golay method (window size 7, m=3; cubic polynomial, n=3), o measured data, * smoothed data. b) smoothed results for data set in a) : ... original data, o measured data, * smoothed data… b) 107 New Trends in Multivariate Analysis and Calibration Fig. 1. c) ... 1st. derivative of the cubic polynomial in the different windows in a), * estimated 1st. derivative data. d) 1st. derivative of the data set in a) : ... real 1st. derivative, * estimated values (window size = 13, m=6; cubic polynomial, n=3). c) d) It should be noted that in many cases the instrument software will perform, if desired, smoothing by averaging of scans so that the user does not have to worry about how exactly to proceed. Often this is then followed by applying Savitzky-Golay, which is also usually present in the software of the instrument. If the analyst decides to carry out the smoothing with other software, then care must be taken not to distort the signal. Differentiation can be used to enhance spectral differences. Second derivatives remove constant and linear background at the same time. An example is shown in figure 2-b,c. Both first and second derivatives are used, but second derivatives seem to be applied more frequently. A possible reason for 108 Chapter 2 – Comparison of Multivariate Calibration Methods their popularity is that they have troughs (inverse peaks) at the location of the original peaks. This is not the case for first derivatives. In principle, differentiation of data is obtained by using the appropriate derivative of the polynomial used to fit the data in each window (Fig. 1-c,d). In practice, tables [18,21] or computer algorithms [19,20] are used to obtain the coefficients ck which are used in the same way as for eqn (7). Alternatively the differentials can be calculated from the differences in absorbance between two wavelengths separated by a small fixed distance known as the gap. One drawback of the use of derivatives is that they decrease the SNR by enhancing the noise. For that reason smoothing is needed before differentiation. The higher the degree of differentiation used, the higher the degradation of the SNR. In addition, and this is also true for smoothing data by using the Savitzky-Golay method, it is assumed that points are obtained at uniform intervals which is not always necessarily true. Another drawback [23] is that calibration models obtained with spectra pre-treated by differentiation are sometimes less robust to instrumental changes such as wavelength shifts which may occur over time and are less easily corrected for the changes. Constant background differences can be eliminated by using offset correction. Each spectrum is corrected by subtracting either its absorbance at the first wavelength (or other arbitrary wavelength) or the mean value in a selected range (Fig. 2-d). Fig. 2. NIR spectra for different wheat samples and several preprocessing methods applied to them : a) original data b) 1st. derivative c) 2nd. Derivative d) offset corrected 109 New Trends in Multivariate Analysis and Calibration Fig. 2. NIR spectra for different wheat samples and several preprocessing methods applied to them : e) SNV corrected f) detrended corrected g) detrended+SNV corrected h) MSC corrected An interesting method is the one based on contrasts as proposed by Spiegelman [24,25]. A contrast is the difference between the absorbance at two wavelengths. The differences between the absorbances at all pairs of wavelengths are computed and used as variables. In this way offset corrected wavelengths, derivatives (differences between wavelengths close to each other) are included and also differences between two peak wavelengths, etc. A difficulty is that the number of contrasts equals p(p-1)/2 which soon becomes very large, e.g. 1000 wavelengths gives 500.000 contrasts. At the moment there is insufficient experience to evaluate this method. Other methods that can be used are based on transforms such as the Fourier transform or the wavelet transform. Multivariate calibration using MLR on Fourier coefficients was compared with PCR (MLR applied on scores on principal components) [26]. Methods based on the use of wavelet coefficients have also been described [27]. One can first smooth the signal by applying Fourier or wavelet transforms to the signal [28] and then apply MLR to the smoothed signal. MLR can also be applied directly on the Fourier or the wavelet coefficients, which is probably a preferable approach. For NIR this does not seem useful because the signal contains little random (white) noise, so that the simpler techniques described above are usually considered sufficient. 110 Chapter 2 – Comparison of Multivariate Calibration Methods 3.3. Methods specific for NIR The following methods are applied specifically to NIR data of solid samples. Variation between individual NIR diffuse reflectance spectra is the result of three main sources : • non-specific scatter of radiation at the surface of particles. • variable spectral path length through the sample. • chemical composition of the sample. In calibration we are interested only in the last source of variance. One of the major reasons for carrying out pre-processing of such data is to eliminate or minimise the effects of the other two sources. For this purpose, several approaches are possible. Multiplicative Scatter (or Signal) Correction (MSC) has been proposed [29-31]. The light scattering or change in path length for each sample is estimated relative to that of an ideal sample. In principle this estimation should be done on a part of the spectrum which does not contain chemical information, i.e. influenced only by the light scattering. However the areas in the spectrum that hold no chemical information often contain the spectral background where the SNR may be poor. In practice the whole spectrum is sometimes used. This can be done provided that chemical differences between the samples are small. Each spectrum is then corrected so that all samples appear to have the same scatter level as the ideal. As an estimate of the ideal sample, we can use for instance the average of the calibration set. MSC performs best if an offset correction is carried out first. For each sample : x i = a + bx j + e (8) where xi is the NIR spectrum of the sample, and x j symbolises the spectrum of the ideal sample (the mean spectrum of the calibration set). For each sample, a and b are estimated by ordinary least-squares regression of spectrum xi vs. spectrum x j over the available wavelengths. Each value xij of the corrected spectrum xi (MSC) is calculated as : x ij (MSC) = x ij − a b ; j = 1,2,..., p (9) 111 New Trends in Multivariate Analysis and Calibration The mean spectra must be stored in order to transform in the same way future spectra (Fig. 2-h). Standard Normal Variate (SNV) transformation has also been proposed for removing the multiplicative interference of scatter and particle size [32,33]. An example is given in figure 2-a, where several samples of wheat are measured. SNV is designed to operate on individual sample spectra. The SNV transformatio n centres each spectrum and then scales it by its own standard deviation : x ij (SNV) = x ij − x i SD ; j = 1,2,..., p (10) where xij is the absorbance value of spectrum i measured at wavelength j, x i is the absorbance mean value of the uncorrected ith spectrum and SD is the standard deviation of the p absorbance values, ∑ (x p j =1 ij − xi ) p −1 2 . Spectra treated in this manner (Fig. 2-e) have always zero mean and variance equal to one, and are thus independent of original absorbance values. De-trending of spectra accounts for the variation in baseline shift and curvilinearity of powdered or densely packed samples by using a second degree polynomial to correct the data [32]. De-trending operates on individual spectra. The global absorbance of NIR spectra is generally increasing linearly with respect to the wavelength , but it increases curvilinearly for the spectra of densely packed samples. A second-degree polynomial can be used to standardise the variation in curvilinearity : x i = aλ*2 + bλ* + c + e i (11) where xi symbolises the individual NIR spectrum and λ ∗ the wavelength. For each sample, a, b and c are estimated by ordinary least-squares regression of spectrum xi vs. wavelength over the range of wavelengths. The corrected spectrum xi(DTR) is calculated b y : x i ( DTR ) = x i − aλ*2 − bλ* − c = ei (12) 112 Chapter 2 – Comparison of Multivariate Calibration Methods Normally de-trending is used after SNV transformation (Fig. 2-f,g). Second derivatives can also be employed to decrease baseline shifts and curvilinearity, but in this case noise and complexity of the spectra increases. It has been demonstrated that MSC and SNV transformed spectra are closely related and that the difference in prediction ability between these methods seems to be fairly small [34,35]. 3.4. Selection of pre-processing methods in NIR The best pre-processing method will be the one that finally produces a robust model with the best predictive ability. Unfortunately there seem to be no hard rules to decide which pre-processing to use and often the only approach is trial and error. The development of a methodology that would allow a systematic approach would be very useful. It is possible to obtain some indication during preprocessing. For instance, if replicate spectra have been measured, a good pre-processing methodology will produce minimum differences between replicates [36] though this does not necessarily lead to optimal predictive value. If only one measure per sample is given, it can be useful to compute the correlation between each of the original variables and the property of interest and do the same for the transformed variables (Fig. 3). It is likely that good correlations will lead to a good prediction. However, this approach is univariate and therefore does not give a complete picture of predictive ability. Depending on the physical state of the samples and the trend of the spectra, a background and/or a scatter correction can be applied. If only background correction is required, offset correction is usually preferable over differentiation, because with the former the SNR is not degraded and because differentiation may lead to less robust models over time. If additionally scatter correction is required, SNV and MSC yield very similar results. An advantage of SNV is that spectra are treated individually, while in MSC one needs to refer to other spectra. When a change is made in the model, e.g. if, because of clustering, it is decided to make two local models instead of one global one, it may be necessary to repeat the MSC pre-processing. Non- linear behaviour between X and y appears (or increases) after some of the pre-processing methods. This is the case for instance for SNV. However this does not cause problems provided the differences between spectra are relatively small. 113 New Trends in Multivariate Analysis and Calibration Fig. 3. Correlation coefficients between (corrected) absorbance and moisture content for spectra in fig. 2. : a) b) c) d) original data 1st. derivative 2nd. Derivative offset corrected Fig. 3. Correlation coefficients between (corrected) absorbance and moisture content for spectra in fig. 2. : e) f) g) h) SNV corrected detrended corrected detrended+SNV corrected MSC corrected 4. Data matrix pre-treatment Before MLR is performed, some scaling techniques can be used. The most popular pre-treatment, which is nearly always used for spectroscopic data sets, is column-centering. In the x- matrix, by convention, each column represents a wavelength and column- centering is thus an operation which is carried out for each wavelength over all objects in the calibration set. It consists of subtracting, for each column, the mean of the column from the individual elements of this column, resulting in a zero mean of the transformed variables and eliminating the need for a constant term in the regression model. The effect of column-centering on prediction in multivariate calibration was studied in [37]. It was 114 Chapter 2 – Comparison of Multivariate Calibration Methods concluded that if the optimal number of variables/factors decreases upon centering, a model should be made with mean-centered data. Otherwise, a model should be made with the raw data. Because this cannot be known in advance, it seems reasonable to consider column-centering as a standard operation. For spectroscopic data it is usually the only pre-treatment performed, although sometimes autoscaling (also known as column standardisation) is also employed. In this case, each element of a columncentered table is divided by its corresponding column standard deviation, so that all columns have a variance of one. This type of scaling can be applied in order to obtain an idea about the relative importance of the variables [38], but it is not recommended for general use in spectroscopic multivariate calibration since it unduly inflates the noise in baseline regions. After pre-treatment, the mean (and the standard deviation for autoscaled data) of the calibration set must be stored in order to transform future samples, for which the concentration or other characteristic must be predicted, using the same values. 5. Graphical information Certain plots should always be made. One of these is to simply plot all spectra on the same graph (Fig. 2). Evident outliers will become apparent. It is also possible to identify noisy regions and perhaps to exclude them from the model. Another plot that one should always make is the Principal Component Analysis (PCA) score plot. Many books and papers are devoted to PCA [39-41]. PCA is no t a new method, and was first described by Pearson in 1901 [42] and by Hotelling in 1933 [43]. Let us suppose that n samples (objects) have been spectroscopically measured at p wavelengths (variables). This information can be written in matrix form as:  x11  x 21 X=  ...   x n1 x12 x 22 ... xn2 ... x1p   ... x 2 p  ... ...   ... x np  (13) where x1 = [x11 x12 ...x1p] is the row vector containing the absorbances measured at p wavelengths (the spectrum) for the first sample, x2 is the row vector containing the spectrum for the second samp le and 115 New Trends in Multivariate Analysis and Calibration so on. We will assume that the reader is more or less familiar with PCA and that, as is usual in PCA in the context of multivariate calibration, the x-matrix was column-centered (see chapter 4). PCA creates new orthogonal variables (latent variables) that are linear combinations of the original x- variables. This can be achieved by the method known as singular value decomposition (SVD) of X : X nxp = U nxp Λ pxp P'pxp = Tnxp P'pxp (14) U is the unweighted (normalised) score matrix and T is the weighted (unnormalised) score matrix. They contain the new variables for the n objects. We can say that they represent the new co-ordinates for the n objects in the new co-ordinate system. P is the loading matrix and the column vectors of P are called eigenvectors or loading-PCs. The elements of P are the loadings (weights) of the original variables on each eigenvector. High loadings for certain original variables on a particular eigenvector mean that these variables are important in the construction of the new variable or score on that principal component (PC). Two main advantages arise from this decomposition. The first one is that the new variables are orthogonal (U'U=I). This has very important implications in PCR, in particular in the MLR step of the methods [6] if variables are correlated. Moreover, we assume that the first new variables or PCs, accounting for the majority of the variance of the original data, contain meaningful information, while the last ones, which account for a small amount of variance, only contain noise and can be deleted. Since PCA produces new variables, such that the highest amount of variance is explained by the first eigenvectors, the score plots can be used to give a good representation of the data. By using a small number of score plots (e. g. t1-t2 , t1-t3 , t2-t3 ), useful visual information can be obtained about the data distribution, inhomogeneities, presence of clusters or outliers, etc. We recommend that it is carried out with the centered raw data and on the data after the signal pre-processing chosen in step 3. Plots of the loadings (contribution of the original variables in the new ones) identify spectral regions that are important in describing the data and those which contain mainly noise, etc. However, the loadings plots should be used only as an indication when it comes to selecting useful variables. 116 Chapter 2 – Comparison of Multivariate Calibration Methods 6. Clustering tendency Clusters are groups of similar objects inside a population. When the population of objects is separated into several clusters, it is not homogeneous. To perform multivariate calibration modelling, the calibration objects should preferably belong to the same population. Often this is not possible, e.g. in the analysis of industrial samples, when these samples belong to different quality grades. The occurrence of clusters may indicate that the objects belong to different populations. This suggests there is a fundamental difference between two or more groups of samples, e.g. two different products are included in the analysis, or a shift or drift has occurred in the measurement technique. When clustering occurs, the reason must be investigated and appropriate action should be taken. If the clustering is not due to instrumental reasons that may be corrected (e.g. two sets of samples were measured at different times and instrumental changes have occurred) then there are two possibilities : to split the data in groups and make a separate model for each cluster, or to keep all of them in the same calibration model. The advantages of splitting the data are that one obtains more homogeneous populations and therefore, one hopes, better models. However, it also has disadvantages. There will be less calibration objects for each model and it is also considerably less practical since it is necessary to optimise and validate two or more models instead of one. When a new sample is predicted, one must first determine to which cluster it belongs before one can start the actual prediction. Another disadvantage is that the range of y-values can be reduced, leading to less stable models. For that reason, it is usually preferable to make a single model. The price one pays in doing this is a more complex and therefore potentially less robust model. Indeed, the model will contain two types of variables, variables that contain information co mmon to the two clusters and therefore have similar importance for both, and variables that correct for the bias between the two clusters. Variables belonging to the second type are often due to peaks in the spectrum that are present in the objects belonging to one cluster and absent or much weaker in the other objects. An example where two clusters occur is presented in [44]. Some of the variables selected are directly related with the property to be measured in both clusters, whereas others are related to the presence or absence of one peak. This peak is due to a difference in chemical structure and is responsible for the clustering. The inclusion of the latter variables takes into account this difference and improves the predictive ability of the model, but also increases the complexity. 117 New Trends in Multivariate Analysis and Calibration Clustering techniques have been exhaustively studied (see a review of methods in [45]). Their results can for example be presented as dendrograms. However, in multivariate calibration model development, we are less interested in the actual detailed clustering, but rather in deciding whether significant clusters actually occur. For this reason there is little value in carrying out clustering: we merely want to be sure that we will be aware of significant clustering if it occurs. The presence of clusters may be due to the y-variable. If the y-values are available in this step, they can be assessed on a simple plot of the y- values. If it is distinctly bimodal, then there are two clusters in y, which should be reflected by two clusters in X. If y-clustering occurs, one should investigate the reason for it. If objects with y-values intermediate between the two clusters are available, they should be added to the calibration and tests sets. If this is not the case, and the clustering is very strong (Fig. 4) one should realise that the model will be dominated by the differences between the clusters rather than by the differences within clusters. It might then be better to make models for each cluster, or instead of MLR to use a method that is designed to work with very heterogeneous data such as locally weighted regression (LWR) [31,46]. Fig. 4. An example of strongly clustered data. The simplest way to detect clustering in the x-data is to apply PCA and to look at the score plots. In some cases, the clustering will become apparent only in plots of higher PCs so that one must always look at several score plots. For this reason, a method such as the one proposed by Szcubialka et al [47] may have advantages. In this method, the dista nces between an object and all other objects are computed, ranked and plotted. This is done for each of the objects. The graph obtained is then 118 Chapter 2 – Comparison of Multivariate Calibration Methods compared with the distances computed in the same way for objects belonging to a normal or to a homogeneous distribution. A simple example is shown in figure 5 where the distance curves for a clustered situation are compared with that for a homogeneous distribution of the samples. Fig. 5. a) b) 119 a) plot of two hundred objects normally distributed in two variables x1 and x2 b) the distance curves of the two hundred normally distributed New Trends in Multivariate Analysis and Calibration Fig. 5. c) c) Clusterd data, normally distributed in each clustered d) the distance curves of the clustered data d) If a numerical indicator is preferred, the Hopkins index for clustering tendency (Hind) can be applied. This statistic examines whether objects in a data set differ significantly from the assumption that they are uniformly distributed in the multidimensional space [15,48,49]. It compares the distances wi between the real objects and their nearest neighbours to the distances qi between artificial objects, uniformly generated over the data space, and their nearest real neighbours. The process is repeated several times for a fraction of the total population. After that, the Hind statistic is computed as : 120 Chapter 2 – Comparison of Multivariate Calibration Methods n H ind = ∑ qi i =1 n n i =1 i =1 (15) ∑ q i + ∑ wi If objects are uniformly distributed, qi and wi will be similar, and the statistic will be close to 1/2. If clustering are present, the distances for artificial objects will be larger than for the real ones, because these artificial objects are homogeneously distributed whereas the real ones are grouped together, and the value of Hind will increase. A value for Hind higher than 3/4 indicates a clustering tendency at the 90% confidence level [49]. Figures 6-a and 6-b show the application of the Hopkins' statistic, i.e. how the qi- and wi-values are computed for two different data sets, the first unclustered and the second clustered. Because the artificial data set is homogeneously generated inside a square box that covers all the real objects and with co-ordinates determined by the most extreme points, an unclustered data set lying on the diagonal of the reference axis (Fig. 6-c) might lead to a false detection of clustering [50]. For this reason, the statistic should be determined on the PCA scores. After PCA of the data, the new axis will lie in the direction of maximum variance, in this case coincident with the main diagonal (Fig. 6-d). Since an outlier in the X-space is effectively a cluster, the Hopkins statistic could detect a false clustering tendency in this example. A modification of the original statistic has been proposed in [49] to minimise false positives. Further modifications were proposed by Forina et al [50]. 121 New Trends in Multivariate Analysis and Calibration a) Fig. 6. Hopkins statistics applied to two different data sets. Open circles represent real objects, closed circles selected real objects and asterisks represent artificial objects generated over the data space. a) H value = 0.49 b) H value = 0.73 b) 122 Chapter 2 – Comparison of Multivariate Calibration Methods Fig. 6. Hopkins statistics applied to two different data sets. Open circles represent real objects, closed circles selected real objects and asterisks represent artificial objects generated over the data space. c) c) H value = 0.69 d) H value = 0.56 (the same data set as in c, after PCA rotation) d) Clusters can become more obvious upon data pre-treatment. For instance, a cluster which is not visible from the raw data may become more apparent when applying SNV. Consequently it is better to carry out investigations concerning clustering on the data pre-treated prior to modelling. 7. Detection of extreme samples MLR is a least squares based method, and for this reason is sensitive to the presence of outliers. We distinguish between two types of outliers : outliers in the x-space and outliers towards the model. Moreover we can consider outliers in the y-space. The difference is shown in figure 7. Outliers in the xspace are points lying far away from the rest when looking at the x-values only. This means we do not 123 New Trends in Multivariate Analysis and Calibration use knowledge about the relationship between X and y. Outliers towards the model are those that present a different relationship between X and y, or in other words, samples that do not fit the model. An object can also be an outlier in y, i.e. can present extreme values of the concentration to be modelled. If an object is extreme in y, it is probably also extreme in X. Fig. 7. Illustration of the different kinds of outliers : (*1) outlier in X and outlier towards the model (*2) outlier in y and towards the model (*3) outlier towards the model (*4) outlier in X and y At this stage of the process, we have not developed the model and therefore cannot identify outliers towards the model. However, can already look for outliers in X and in y separately. Detection of outliers in y is a univariate problem that can be handled with the usual univariate tests such as the Grubbs [51,52,15] or the Dixon [5,15] test. Outliers in X are multivariate and therefore represent a more challenging problem. Our strategy will be to identify the extreme objects in X, i.e. identify objects with extreme characteristics, and apply a test to decide whether they should be considered outliers or not. Once the outliers have been identified, we must decide whether we eliminate them or simply flag them for examination after the model is developed so that we can look at outliers towards the model. In taking the decision, it may be useful to investigate whether the same object is an outlier in both y and X. If an object is outlying in concentration (y) but is not extreme in its spectral characteristics (X), then it will probably prove an outlier towards the model that at a later stage (chapter 13) and it will be necessary at the minimum to make mo dels with and without the object. A decision to eliminate the object at this stage may save work. Extreme samples in the x-space can be due to measurement or handling errors, in which case they should be eliminated. They can also be due to the presence of samples that belong to another 124 Chapter 2 – Comparison of Multivariate Calibration Methods population, to impurities in one sample that are not present in the other samples, or to a sample with extreme amounts of constituents (i.e. with very high or low quantity of analyte). In these cases it may be appropriate to include the sample in the model, as it represents a composition that could be encountered during the prediction stage. We therefore have to investigate why the outlier presents extreme behaviour, and at this stage it can be discarded only if it can be shown to be of no value to the model or detrimental to it. We should be aware however that extreme samples always will have a larger influence on the model than other samples. Extreme samples in the x-space will probably have extreme values on some variables that will have an extreme (and possibly deleterious) effect in the regression. The extreme behaviour of an object i in the x-space can be measured by using the leverage value. This measure is closely related with the Mahalanobis distance (MD) [53,54], and can be seen as a measure of the distance of the object to the centroid of the data. Points close to the center provide less information for the building model than extreme points. However, outliers in the extremes are more dangerous than those close to the center. High leverage points are called bad high leverage points, if they are outliers to the model. If they fit the true model they will stabilise the model and make it more precise, they are then called good high leverage points. However, at this stage we will rarely be able to distinguish between good and bad leverage. In the original space, leverage values are computed as : H = X(X' X) −1 X' (16) H is called the hat matrix. The diagonal elements of H, hii, are the leverage values for the different objects i. If there are more variables than objects, as is probable for spectroscopic data, X'X cannot be inverted. The leverage can then be computed in the PC space. There are two ways to compute the leverage of an object i in the PC-space. The first one is given by the equation : a hi = ∑ t 2ij (17) 2 i =1 λ i 125 New Trends in Multivariate Analysis and Calibration 2 1 a t ij hi = + ∑ n i =1 λ2 i (18) a being the minimum value of n and p and λ2i the eigenvalue of PCi. The correction by the value 1/n in eqn (18) is used if column centered data are employed, as is usual in PCA. Then a = (n-1) if (n-1) < a and a = min (n-1, p) The leverage values can also be obtained by applying an equation equivalent to eqn (16) : H = T(T' T)-1 T' (19) where T is the matrix with the weighted (unnormalised) scores obtained after PCA of X. Instead of using all the PCs, one can apply only the significant ones. Suppose that r PCs have been selected to be significant, for instance based on the total percentage of variance they explain [8]. The total leverage can then be decomposed in contributions due to the significant eigenvectors and the nonsignificant ones [53] : t2 r ij hi = ∑ = ∑ 2 i =1 λ i i =1 a t2 a ij + ∑ λ2 i = r +1 i t2 ij = h1 + h 2 i i λ2 i (20) For centered data the same correction with 1/n as in eqn (18) is applied. H1 can be also obtained by using eqn (19) with T being the matrix with the weighted scores from PC1 to PCr. Because we are only interested in the first r PCs, it seems that hi1 is a more natural leverage concept than hi, and complications derived by including noisy PCs are avoided. The value r/n ((r + 1)/n for centered data) is called average partial leverage. If the leverage of an extreme object exceeds it by a certain factor, the object is considered to be an outlier. As outlier detection limit one can then set, for examp le, hi1 > (constant x r/n), where the constant often equals 2. The leverage is related to the squared Mahalanobis distance of object i to the centre of the calibration data. One can compute the squared Mahalanobis distance from the covariance matrix, C : 126 Chapter 2 – Comparison of Multivariate Calibration Methods 1  MD i2 = ( xi − x j ) C −1 ( x i − x j )' = ( n − 1) h i −  n  (21) where C is computed as C= 1 X' X n −1 (22) X being as usual the mean-centered data matrix. In the same way as the leverage, when the number of variables exceeds the number of objects, C becomes singular and cannot be inverted. There are also two ways to calculate the Mahalanobis distance in the PC space, using either all a PCs or using only the r significant ones : 2 a t ij 2 MD i = (n − 1) ∑ = ( n − 1) 2 i =1 λ i 1  hi − n    (23) 2 r t ij 2 MD i = (n − 1) ∑ = ( n − 1) 2 i =1 λ i  1 1 hi − n    (24) where hi and h1i are computed using the centered data. X-space outlier detection can also be performed in the PC space with Rao's statistic [55]. Rao's statistic sums all the variation from a certain PC on. If there are a PCs, and we start looking at variation from PC r on, then : a D i2 = ∑ t 2ij (25) i = r +1 127 New Trends in Multivariate Analysis and Calibration A high value for D 2i means that object i shows a high score on some of the PCs that were not included and therefore cannot be explained completely by r PCs. For this reason it is then suspected to be an outlier. The method is presented here because it uses only information about X. The way in which Rao's statistic is normally used requires the number of PCs entered in the model. This number is put equal to r. What can be done to estimate this number of PCs is to follow the D value as a function of r, starting from r = 0. High values of r indicate that the object is modelled only correctly when higher PCs are included. If the number of necessary PCs is higher for this object than for the others, it will be an outlier. A test can be applied for checking the significance of high values for the Rao's statistic by using these values as input data for the single outlier Grubbs' test [15] : z= D 2test n ∑ D 2i  (26) 2 i =1 n −1 Because the information provided by each of these methods is not necessarily the same, we recommend that more than one is used, for example by studying both leverage values and Rao's statistic with Grubbs' test, in order to check if the same objects are detected. Unfortunately, outlier detection is not easy. This certainly is the case if more than one outlier is present. In that case all the above methods are subject to what is called masking and swamping. Masking occurs when an outlier goes undetected because of the presence of another, usually adjacent, one. Swamping occurs when good observations are incorrectly identified as outliers because of the presence of another, usually remote, subset of outliers (Fig. 8). Masking and swamping occur because the mean and the covariance matrix are not robust to outliers. 128 Chapter 2 – Comparison of Multivariate Calibration Methods Fig. 8. Due to the remote set of outliers (4 upper objects), there is a swamping effect on outlier (*). Robust methods have been described [56]. Probably, the best way to avoid the lack of robustness of the leverage measures is to use the Multivariate Trimming estimator (MVE) defined as the minimum volume ellipsoid covering at least (N/2)+1 points of X. It can be understood as the selection of a subset of objects without outliers in it : a clean subset. In this way, one avoids that the measured leverage being affected by the outlier. In fact in equ. (21) all objects, the outliers included, are used, so that the outliers influence the criterion that will be used to determine if an object is an outlier. For instance, when an outlier is included in a set of data, it influences the mean value of variables characterising that set. With the MVE, the densest domain in the x-space including a given amount of samples is selected. This domain does not include the possible outliers, so that they do not influence the criteria. An algorithm to find the MVE is given in [57-60]. The leverage measures based on this subset are not affected by the masking and swamping effects. A simulation study showed that in more than 90% of the cases the proposed algorithm led to the correct identification of x-space outliers, without masked or swamped observations [60]. For this reason, MVE probably is the best methodology to use, but it should be noted that there is little practical experience in its application. To apply the algorithm, the number of objects in the data set must be at least three times higher than the number of selected latent variables. A method of an entirely different type is the potential method proposed by Jouan-Rimbaud et al. [61]. Potential methods first create so-called potential functions around each individual object. Then these functions are summed (Fig. 9). In dense zones, large potentials are created, while the potential of outliers does not add to that of other objects and can therefore be detected in that way. An advantage is 129 New Trends in Multivariate Analysis and Calibration that special objects within the x-domain are also detected, for instance, an isolated object between two clusters. Such objects (we call them inliers) can in certain circumstances have the same effect as outliers. A disadvantage is that the width of the potential functions around each object has to be adjusted. It cannot be too small, because many objects would then be isolated; it cannot be too large because all objects would be part of one global potential function. Moreover, while the method does very well in flagging the more extreme objects, a decision on their rejection cannot be taken easily. Fig. 9. Adapted from D. Bouveresse, doctoral thesis (1997), Vrije universiteit Brussel, contour plot corresponding to k=4 with the 10% percentile method and with (*) the identified inlier. 8. Selection and representativity of the calibration sample subset Because the model has to be used for the prediction of new samples, all possible sources of va riation that can be encountered later must be included in the calibration set. This means that the chemical components present in the samples must be included in the calibration set; with a range of variation in concentration at least as wide, and preferab ly wider than the one expected for the samples to be analysed; that sources of variation such as different origins or different batches are included and possible physical variations (e.g. different temperatures, different densities) among samples are also covered. In addition, it is evident that the higher the number of samples in the calibration set, the lower the prediction error [62]. In this sense, a selection of samples from a larger set is contra- indicated. However, while a random selection of samples may approach a normal distribution, a selection 130 Chapter 2 – Comparison of Multivariate Calibration Methods procedure that selects samples more or less equally distributed over the calibration space will lead to a flat distribution. For an equal number of samples, such a distribution is more favourable from a regression point of view than the normal distribution, so that the loss of predictive quality may be less than expected by looking only at the reduction of the number of samples [63]. Also, from an experimental point of view, there is a practical limit on what is possible. While the NIR analysis is often simple and not costly, this cannot usually be said for the reference method. It is therefore necessary to achieve a compromise between the number of samples to be analysed and the prediction error that can be reached. It is advisable to spend some of the resources available in obtaining at least some replicates, in order to provide information about the precision of the model (chapter 2). When it is possible to artificially generate a number of samples, experimental design can and should be used to decide on the composition of the calibration samples [1]. When analysing tablets, for instance, one can make tablets with varying concentrations of the components and compression forces, according to an experimental de sign. Even then, it is advisable to include samples from the process itself to make sure that unexpected sources of variation are included. In the tablet example, it is for instance unlikely that the tablets for the experimental design would be made with t he same tablet press as those from the production process and this can have an effect on the NIR spectrum [64]. In most cases only real samples are available, so that an experimental design is not possible. This is the case for the analysis of natural products and for most samples coming from an industrial production process. One question then arises: how to select the calibration samples so that they are representative for the group. When many samples are available, we can first measure their spectra and select a representative set that covers the calibration space (x-space) as well as possible. Normally such a set should also represent the y-space well, this should preferably be verified. The chemical analysis with the reference method, which is often the most expensive step, can then be restricted to the selected samples. Several approaches are available for selecting representative calibration samples. The simplest is random selection, but it is open to the possibility that some source of variation will be lost. These are often represented by samples that are less common and have little probability of being selected. A second possibility is based on knowledge about the problem. If one is confident that we are aware of all the sources of variation, samples can be selected on the basis of that knowledge. However, this situation is rare and it is very possible that some source of variation will be forgotten. 131 New Trends in Multivariate Analysis and Calibration One algorithm that can be used for the selection is based on the D-optimal concept [65,66]. The Doptimal criterion minimises the variance of the regression coefficients. It can be shown that this is equivalent to maximising the covariance matrix, selecting samples such that the variance is maximised and the correlation minimised. The criterion comes from multivariate regression and experimental design. In our context, the variance maximisation leads to selection of samples with relatively extreme characteristics and located on the borders of the calibration domain. Kennard and Stone proposed a sequential method that should cover the experimental region uniformly and that was meant for the use in experimental design [67]. The procedure consists of selecting as the next sample (candidate object) the one that is most distant from those already selected objects (calibration objects). The distance is usually the Euclidean distance although it is possible, and probably better, to use the Mahalanobis distance. The distance are usually calculated in the PC space since spectroscopic data tend to generate a high number of variables. As starting points we either select the two objects that are most distant from each other, or preferably, the one closest to the mean. From all the candidate points, the one is selected that is furthest from those already selected and added to the set of calibration points. To do this, we measure the distance from each candidate point i0 to each point i which has already been selected and determine which is smallest ( min (di,i )) . From these we select 0 i the one for which the distance is maximal, dselected = max ( min (di,i )) . In the absence of strong 0 i0 i irregularities in the factor space, the procedure starts first selecting a set of points close to those selected by the D-optimality method, i.e. on the borderline of the data set (plus the center point, if this is chosen as the starting point). It then proceeds to fill up the calibration space. Kennard and Stone called their procedure a uniform mapping algorithm; it yields a flat distribution of the data which, as explained earlier, is preferable for a regression model. Næs proposed a procedure based on cluster analysis. The clustering is continued until the number of clusters matches the number of calibration samples desired [68]. From each cluster, the object that is furthest away from the mean is selected. In this way the extremes are covered but not necessarily the centre of the data. In the method proposed by Puchwein [69], the first step consists in sorting the samples according to the Mahalanobis distances to the centre of the set and selecting the most extreme point. A limiting distance is then chosen and all the samples that are closer to the selected point than this distance are excluded. The sample that is most extreme among the remaining points is selected and the procedure repeated, 132 Chapter 2 – Comparison of Multivariate Calibration Methods choosing the most distant remaining point, until there are no data points left. The number of selected points depends on the size of the limiting distance: if it is small, many points will be included; if it is large, very few. The procedure must therefore be repeated several times for different limiting distances until the limiting distance is reached for which the desired number of samples is selected. Figure 10 shows the results of applying these four algorithms to a 2-dimensional data set of 250 objects, designed not to be homogeneous. Clearly, the D-optimal design selects points in a completely different way from the other algorithms. The Kennard-Stone and Puchwein algorithms provide similar results. Næs method does not cover the centre. Other methods have been proposed such as "uniquesample selection" [70]. The results obtained seem similar to those obtained from the previously cited methods. An important question is how many samples must be included in the calibration set. This va lue must be selected by the analyst. This number is related to the final complexity of the model. The term complexity should be understood as the number of variables or PCs included plus the number of quadratic and interaction terms. An ASTM standard states that, if the complexity is smaller than three, at least 24 samples must be used. If it is equal or greater than four, at least 6 objects per degree of complexity are needed [58,71]. 133 New Trends in Multivariate Analysis and Calibration Fig. 10. The first 24 points selected using different algorithms : a) D-optimal design (optimal design with the three points denoted by closed circles) b) Puchwein method c) Kennard & Stone method (closest point to the mean included) d) Naes clustering method e) DUPLEX method with (o) the calibration set and (*) the test set In Chapter 11 we state that the model optimisation (validation) step requires that different independent sub-sets are created. Two sub-sets are often needed. At first sight, we might use one of the selection algorithms described above to split up the calibration set for this purpose. However, because of the sample selection step, the sub-sets would be no longer independent unless random selection is applied. Validation in such circumstances might lead us to underestimate prediction errors [72]. A selection 134 Chapter 2 – Comparison of Multivariate Calibration Methods method which appears to overcome this drawback is a modification by Snee of the Kennard-Stone method, called the DUPLEX method [73]. In the first step, the two points which are furthest away from each other are selected for the calibration set. From the remaining points, the two objects which are furthest away from each other are included in the test set. In the third step, the remaining point which is furthest away from the two previously selected for the calibration set is included in that set. The procedure is repeated selecting a single point for the test set which is furthest from the existing points in that set. Following the same procedure, points are added alternately to each set. This approach selects representative calibration and test data sets of equal size. In figure 10 the result of applying the DUPLEX method is also presented. Of all the proposed methodologies, the Kennard-Stone, DUPLEX and Puchwein's methods need the minimum a priori knowledge. In addition, they provide a calibration set homogeneously distributed in space (flat distribution). However, Puchwein's method must be applied several times. The DUPLEX method seems to be the best way to select representative calibration and test data sets in a validation context. Once the calibration set has been selected, several tests can be employed to determine the representativity of the selected objects with respect to the total set [74]. This appears to be unnecessary if one of the algorithms recommended for the selection of the calibration samples has been applied. In practice, however, little attention is often paid to the proper selection. For instance, it may be that the analyst simply takes the first n samples for the calibration set. In this case a representativity test is necessary. One possibility is to obtain PC score plots and to compare visually the selected set of calibration samples to the whole set. This is difficult when there are many relevant PCs. In such cases a more formal approach can be useful. We proposed an approach that includes the determination of three different characteristics [75]. The first one checks if both sets have the same direction in the space of the PCs. The directions are compared by computing the scalar product of two direction vectors obtained from the PCA decomposition of both data sets. To do this, the normed scalar product between the vectors d1 and d2 is obtained : P= d1 ' d 2 (27) d12 d 22 135 New Trends in Multivariate Analysis and Calibration where d1 and d2 are the average direction vector for each data set: r d1 = ∑ λ2 p1,i 1, i i =1 r and d 2 = ∑ λ2 p 2, i 2, i (28) i =1 where λ21, i and p1,i are the corresponding eigenvalues and loading vectors for data set 1, and λ22, i and p2,i are the corresponding eigenvalues and loading vectors for data set 2. If the P value (cosinus of the angle between the direction of each set) is higher than 0.7, it can be concluded that the original variables have similar contribution to the latent variables, and they are comparable. The second test compares the variance-covariance matrices. The intention is to determine whether the two data sets have a similar volume both in magnitude and direction. The comparison is made by using an approximation of the Bartlett's test. First the pooled variance-covariance matrix is computed : C= ( n1 − 1) C1 + ( n2 − 1) C2 n1 + n2 − 2 (29) The Box M-statistic is then obtained : M = ν (n 1 − 1) ln C1−1C + ( n 2 − 1) ln C −2 1C    (30) with ν =1−  2p 2 + 3p − 1  1 1 1 + −   6(p − 1)  n1 − 1 n 2 − 1 n 1 + n 2 − 2  (31) and the parameter CV is defined as: CV = e −M n1 + n 2 − 2 (32) 136 Chapter 2 – Comparison of Multivariate Calibration Methods If CV is close to 1, both the volume and the direction of the data sets are comparable. The third and last test compares the data set centroids. To do this, the squared Mahalanobis distance D2 between the means of each data set is computed : D 2 = (x1 − x 2 )'C −1(x1 − x 2 ) (33) C is defined as in eqn (21), and from this value, a parameter F is defined as: F= n1n 2 ( n1 + n 2 − p − 1) D2 p(n1 + n 2 )( n1 + n 2 − 2) (34) F follows a Fisher-Snedecor distribution, with p and n1+n2-p-1 degrees of freedom. As already stated these tests are not needed when a selection algorithm is used. With some selection algorithms they would even be contra- indicated. For instance, the test that compares variances cannot be applied for calibration sets selected by the D-optimal design, because the most extreme samples are selected and the calibration set will necessarily have a larger variance than the original set. 9. Non-linearity Sources of non- linearity in spectroscopic methods are described in [76], and can be summarised as due to : 1 - violations of the Beer-Lambert law 2 - detector non- linearity's 3 - stray light 4 - non- linearity's in diffuse reflectance/transmittance 5 - chemically-based non- linearities 6 - non- linearities in the property/concentration relationship. Methods, based on ANOVA, proposed by Brown [77] and Xie et al (non- linearity tracking analysis algorithm) [78] detect non-linear variables, which one may decide to delete. There seems to be little 137 New Trends in Multivariate Analysis and Calibration expertise available in the practical use of these methods. Moreover, non-linear regions may contain interesting information. The methods should therefore be used only as a diagnostic, signalling that nonlinearities occur in specific regions. If it is later found that the MLR model is not as good as was hoped, or is more complex than expected, it may be useful to see if better results are obtained after elimination of the more non- linear regions. Most methods for detection of non- linearity depend on visual evaluation of plots. A classical method is to plot the residuals against y or the fitted (predicted) response ŷ for the complete model [79,80,54]. The latter is to be preferred, since it removes some of the random error which could make the evaluation more difficult (Fig. 11-b). This is certainly the case when the imprecision of y is relatively large. Non-linearity typically leads to residuals of one sign for most of the samples with mid-range yvalues, whereas most of the samples with low or high y- value have residuals of the opposite sign. The runs test [1] examines whether an unusual pattern occurs in a set of residuals. In this context a run is defined as a series of consecutive residuals with the same sign. Figure 11-d would lead to 3 runs and the following pattern: “ + + + + + + + - - - - - - + + +“. From a statistical point of view long runs are improbable and are considered to indicate a trend in the data, in this case a non- linearity. The test therefore consists of comparing the number of runs with the number of samples. Similarly, the Durbin Watson test examines the null hypothesis that there is no correlation between successive residuals. In this case no trend occurs. The runs or Durbin-Watson tests should be carried out as a complement to the visual evaluation and not as a replacement. 138 Chapter 2 – Comparison of Multivariate Calibration Methods Fig. 11. Tools for visual detection of non-linearities : a) PRP plot b) RP plot 139 New Trends in Multivariate Analysis and Calibration Fig. 11. Tools for visual detection of non-linearities : c) e-RP plot d) ApaRP plot A classical statistical way to check for non- linearities in one or more variables in multiple linear regression is based on testing whether the model improves significantly when a squared term is added. One compares y i = b 0 + b1 x i + b 2 x 2i + e i (35) to 140 Chapter 2 – Comparison of Multivariate Calibration Methods y i = b *0 + b *1 x i + e*i (36) xi being the values of the x-variable investigated for object i. A one-sided F-test can be employed to check if the improvement of fit is significant. One can also apply a two-sided t-test for checking if b2 is significantly different from 0. The calculated t-value is compared to the t-test value with (n-3) degrees of freedom, at the desired level of confidence. It can be noted that This can be applied also when the variables are PC scores to the linear model [2]. All these methods are lack-of-fit methods and it is probable that they will also indicate lack-of-fit when the reason is not non-linearity, but the presence of outliers. Caution is therefore required. We prefer the runs or the Durbin Watson tests, in conjunction with visual evaluation of the partial response plot or the Mallows plot. It should be noted that many of the methods described here require that a model has already been built. In this sense, this chapter should come after the chapters 10 and 11. However, we recommend that nonlinearity be investigated at least partly before the model is built by plotting very significant variables if available (e.g. peak maxima in Raman spectroscopy) or the scores of the first PCs as a function of y (e.g. for NIR data). If a clear non- linear relationship with y is obtained with one of these variables/PCs, it is very probable that a non-linear approach is to be preferred. If no non-linearity is found in this step, then one should, after obtaining a linear model (chapters 10 and 11) check again e.g. using Mallows plot and the runs test to confirm linearity. 10. Building the model When variables are not correlated and more samples than variables are available, the model can be built simply using all of the variables. This usually happens for non-spectroscopic data. This situation can however also arise in the case of very specific spectroscopic applications, for instance when using a simultaneous ICP-AES instrument equipped with only few photo-multiplicators fixed to emission on specific wavelengths. In some other particular cases, expert knowledge can be used to select very few variables out of a spectrum. For instance, in Raman or Atomic Emission spectroscopy, compounds in a mixture can be represented by neat and narrow peaks. Building the model can then simply consist in selecting the variables corresponding to the maxima of peaks representative for the product which 141 New Trends in Multivariate Analysis and Calibration concentration has to be predicted. The extreme case being the situation when only one variable is necessary to obtain satisfying prediction, leading to a univariate model. However, modern spectroscopic instruments usually generate a very high number of variables, exceeding by far the number of available samples (objects). In current applications, and in particular in NIR spectroscopy, variable selection is therefore needed to overcome the problems of matrix underdetermination and correlated variables. Even when more objects than variables are available, it can be interesting to select only the most representative variables in order to obtain a simpler model. In the majority of cases, building the MLR model therefore consists in performing variable selection : finding the subset of variables that has to be used. 10.1. Stepwise approaches The most classical variable selection approach, which is found in many statistical packages, is called stepwise regression [1,2]. This family of methods consists in optimising the subset of variables used for calibration by adding and/or removing them one by one from the total set. The so-called forward selection procedure consists in first selecting the variable that is best correlated with y. Suppose this is found to be xi. The model is at this stage restricted to y = f (xi). The regression coefficient b obtained from the univariate regression model relating xi to y is tested for significance using a t-test at the considered critical level α = 1 or 5 %. If it is not found to be significant, the process stops and no model is built. Otherwise, all other variables are tested for inclusion in the model. The variable xj which will be retained for inclusion together with xi is the one that, when added to the model, leads to the largest improvement compared to the original univariate model. It is then tested whether the observed improvement is significant. If not, the procedure stops and the model is restricted to y = f(xi). If the improvement is significant, xj is definitively incorporated in the model that becomes bivariate : y = f (xi,xj). The procedure is repeated for a third variable to be included in the model, and so on until finally no further improvement can be obtained. Several variants of this procedure can be used. In backward elimination, the selection is started with all variables included in the model. The least significant ones are successively eliminated in a comparable way as in forward selection. Forward and backward steps can be combined in order to obtain a more sophisticated stepwise selection procedure. As is the case in forward selection, the first variable xi 142 Chapter 2 – Comparison of Multivariate Calibration Methods entered in the model is the most correlated to the property of interest y. The regression coefficient b obtained from the univariate regression model relating xi to y is tested for significance. The next step is forward selection. The variable xj that yields the highest Partial Correlation Coefficient (PCC) is included in the model. The inclusion of a new variable in the model can decrease the contribution of a variable already included and make it non-significant. After each inclusion of a new variable, the significance of the regression terms (b ixi) already in the model is therefore tested, and the nonsignificant terms are eliminated from the equa tion. This is the backward elimination step. Forward selection and backward elimination are repeated until no improvement of the model can be achieved by including a new variable, and all the variables already included are significant. Such stepwise approaches using both forward and backward steps are usually the most efficient. 10.2. Genetic algorithms Genetic algorithms can also be used for variable selection. They were first proposed by Holland [81]. They were introduced in chemometrics by Lucasius et al [82] and Leardi et al [83]. They were applied for instance in multivariate calibration for the determination of certain characteristics of polymers [84] or octane numbers [85]. Reviews about applications in chemistry can be found in [86,87]. There are several competing algorithms such as simulated annealing [88] or the immune algorithm [89]. Genetic Algorithms are general optimisation tools aiming at selecting the fittest solution to a problem. Suppose that, to keep it simple, 9 variables are measured. Possible solutions are represented in figure 12. Selected variables are indicated by a 1, non-selected variables by a 0. 143 New Trends in Multivariate Analysis and Calibration VARIABLE 1 2 3 4 5 6 0 1 1 0 0 0 7 1 1 0 0 1 0 0 0 8 9 0 1 1 0 Ÿ Ÿ Ÿ 0 1 1 CHROMOSOMES (Solutions) 1 0 0 0 0 Fig. 12. A set of solutions for feature selection from nine variables for MLR. 0 Such solutions are sometimes called chromosomes in analogy with genetics. A set of such solutions is obtained by random selection (several hundreds chromosomes are often generated in real applications). For each solution an MLR model is built using an equation such as (1) and the sum of squares of the resid uals of the objects towards that model is determined. One says that the fitness of each solution is determined : the smaller the sum of squares the better the model describes the data and the fitter the corresponding solutions are. Then follows what is described as the selection of the fittest (leading to names such as genetic algorithms or evolutionary computation). For instance out of the, say 100 original solutions, the 50 fittest are retained. They are called the parent generation. From these is obtained a child generation by reproduction and mutation. Reproduction is explained in figure 13. Two randomly chosen parent solutions produce two child solutions by cross over. The cross over point is also chosen randomly. The first part of solution 1 and the second part of solution 2 together yield child solution 1’. Solution 2’ results from the first part of solution 2 and the second of solution 1. The child solutions are added to the selected parent solutions to form a new generation. This is repeated for many generations and the best solution from the final generation is retained. 144 Chapter 2 – Comparison of Multivariate Calibration Methods REPRODUCTION (MATING) 1 0 1 0 0 0 0 0 1 1 0 0 0 1 + 0 1 0 0 Fig. 13. Genetic algorithms: the reproduction step. The cross over point is indicated by the * symbol. * (cross over) â 1 0 1 0 1 0 0 0 1 0 0 0 0 1 + 0 1 0 0 Each generation is additionally submitted to mutation steps. Randomly chosen bits of the solution string are changed here and there (0 to 1 or 1 to 0). This is applied in figure 14. The need for the mutation step can be understood from figure 12. Suppose that the best solution is close to one of the child solutions in that figure, but should not include variable 9. However, because the value for variable 9 is 1 in both parents, it is also unavoidably 1 in the children. Mutation can change this and move the solutions in a better direction. 145 New Trends in Multivariate Analysis and Calibration MUTATION 1 0 1 0 1 0 0 0 1 * 1 0 1 0 1 0 0 0 Fig. 14. Genetic algorithms: the mutation step. The mutation point is indicated by the * symbol. 0 * 11. Model optimisation and validation 11.1. Training, optimisation and validation The determination of the optimal complexity of the model (the number of variables that should be included in the model) requires the estimation of the prediction error that can be reached. Ideally, a distinction should be made between training, optimisation and validation. Training is the step in which the regression coefficients are determined for a given model. In MLR, this means that the b-coefficients are determined for a model that includes a given set of variables. Optimisation consists in comparing different models and deciding which one gives best prediction. Validation is the step in which the prediction with the chosen model is tested independently. In practice, as we will describe later, because of practical constraints in the number of samples and/or time, less than three steps are often included. In particular, analysts rarely make a distinction between optimisation and validation and the term validation is then sometimes used for what is essentially an optimisation. While this is acceptable to some extent, in no case should the three steps be reduced to one. In other words, it is not acceptable to draw conclusions about optimal models and/or quality of prediction using only a training step. The same data should never be used for training, optimising and validating the model. If this is done, it is possible and even probable that an overfit of the model will occur, and prediction error obtained in this 146 Chapter 2 – Comparison of Multivariate Calibration Methods way may be over-optimistic. Overfitting is the result of using a too complex model. Consider a univariate situation in which three samples are measured. The y = f(x) model really is linear (first order), but the experimenter decides to use a quadratic model instead. The training step will yield a perfect result: all points are exactly on the line. If, however, new samples are predicted, then the performance of the quadratic model will be worse than the performance of the linear one. 11.2. Measures of predictive ability Several statistics are used for measuring the predictive ability of a model. The prediction error sum of squares, PRESS, is computed as : n n i =1 i =1 PRESS = ∑ ( y i − ŷi ) 2 = ∑ ei2 (37) where yi is the actua l value of y for object i and ŷi the y-value for object i predicted with the model under evaluation, ei is the residual for object i (the difference between the predicted and the actual yvalue) and n is the number of objects for which ŷ is obtained by prediction. The mean squared error of prediction (MSEP) is defined as the mean value of PRESS : n MSEP = 2 ∑ ( y i − ŷ i ) PRESS i =1 = n n n 2 ∑ ei = i =1 n (38) Its square root is called root mean squared error of prediction, RMSEP: n RMSEP = MSEP = 2 ∑ ( yi − ŷ i ) i =1 n n = 2 ∑ ei i =1 (39) n 147 New Trends in Multivariate Analysis and Calibration All these quantities give the same information. In the chemometrics literature it seems that RMSEP values are preferred, partly because they are given in the same units as the y-variable. 11.3. Optimisation The RMSEP is determined for different models. For instance, with stepwise selection, a models can be built using a t-test significance level of 1%, and another a t-test significance level of 5%. With genetic algorithms, various models can be obtained with different numbers of variable. The result can be presented as a plot showing RMSEP as a function of the number of variables and is called the RMSEP curve. This curve often shows an intermediate minimum and the number of variables for which this occurs is then considered to be the optimal co mplexity of the model. This can be a way of optimising the output of stepwise selection procedure (optimising the number of variables retained). A problem which is sometimes encountered is that the global minimum is reached for a model with a very high complexity. A more parsimonious model is often more robust (the parsimonity principle). Therefore, it has been proposed to use the first local minimum or a deflection point is used instead of the global minimum. If there is only a small difference between the RMSEP of the minimum and a model with less complexity, the latter is often chosen. The decision on whether the difference is considered to be small is often based on the experience of the analyst. We can also use statistical tests that have been developed to decide whether a more parsimonious model can be considered statistically equivalent. In that case the more parsimonious model should be preferred. An F-test [90,91] or a randomisation t-test [92] have been proposed for this purpose. The latter requires less statistical assumptions about data and model properties, and is probably to be preferred. However in practice it does not always seem to yield reliable results. 11.4. Validation The model selected in the optimisation step is applied to an independent set of samples and the yvalues (i.e. the results obtained with the reference method) and ŷ -values (the results obtained with multivariate calibration) are compared. An example is shown in figure 15. The interpretation is usually done visually : does the line with slope 1 and intercept 0 represent the points in the graph sufficiently well ? It is necessary to check whether this is true over the whole range of concentrations (non- 148 Chapter 2 – Comparison of Multivariate Calibration Methods linearity) and for all meaningful groups of samples, e.g. for different clusters. If a situation is obtained when most samples of a cluster are found at one side of the line, a more complex modelling method (e.g. locally weighted regression [31, 46]) or a model for each separate cluster of samples may yield better results. Fig. 15. The measured property (y) plotted against the predicted values of the property(yhat). Sometimes a least squares regression line between y and ŷ is obtained and a test is carried out to verify that the joint confidence interval contains slope = 1 and intercept = 0 [93]. Similarly a paired t-test between y and ŷ values can be carried out. This does not obviate, however, the need for checking nonlinearity or looking at individual clusters. An important question is what RMSEP to expect ? If the final model is correct, i.e. there is no bias, then the predictions will often be more precise than those obtained with the reference method [94,10,95], due to the averaging effect of the regression. However, this cannot be proved from measurements on validation samples, the reference values of which were obtained with the reference method. The RMSEP value is limited by the precision (and accuracy) of the reference method. For that reason, RMSEP can be applied at the optimisation stage as a kind of target value. An alternative way of deciding on model complexity therefore is to select the lowest complexity which leads to an RMSEP value comparable to the precision of the reference method. 149 New Trends in Multivariate Analysis and Calibration 11.5. External validation In principle, the same data should not be used for developing, optimising and validating the model. If we do this, it is possible and even probable that we will overfit the model and prediction errors obtained in this way may be over-optimistic. Terminology in this field is not standardised. We suggest that the samples used in the raining step should be called the training set, those that are used in optimisation the evaluation set and those for the validation the validation set. Some multivariate calibration methods require three data sets. This is the case when neural nets are applied (the evaluation set is then usually called the monitoring set). In PCR and related methods, often only two data sets are used (external validation) or, even only one (internal validation). In the latter case, the existence of a second data set is simulated (see further chapter 11.6). We suggest that the sum of all sets should be called the calibration set. Thus the calibration set can consist of the sum of training, evaluation and validation sets, or it can be split into a training and a test set, or it can serve as the single set applied in internal validation. Applied with care, external and internal validation methods will warn against overfitting. External validation uses a completely different group of samples for prediction (sometimes called the test set) from the one used for building the model (the training set). Care should be taken that both sample sets are obtained in such a way that they are representative for the data being investigated. This can be investigated using the measures described for representativity in chapter 8. One should be aware that with an external test set the prediction error obtained may depend to a large extent on how exactly the objects are situated in space in relationship to each other. It is important to repeat that, in the presence of measurement replicates, all of them must be kept together either in the test set or in the training set when data splitting is performed. Otherwise, there is no perturbation, nor independence, of the statistical sample. The preceding paragraphs apply when the model is developed from samples taken from a process or a natural population. If a model was created with artificial samples with y-values outside the expected range of y-values to be determined, for the reasons explained in chapter 8, then the test set should contain only samples with y-values in the expected range. 150 Chapter 2 – Comparison of Multivariate Calibration Methods 11.6. Internal validation One can also apply what is called internal validation. Internal validation uses the same data for developing the model and validating it, but in such a way that external validation is simulated. A comparison of internal validation procedures usually employed in spectrometry is given in [96]. Four different methodologies were employed: a. Random splitting of the calibration set into a training and a test set. The splitting can then have a large influence on the obtained RMSEP value. b. Cross-validation (CV), where the data are randomly divided into d so-called cancellation groups. A large number of cancellation groups corresponds to validation with a small perturbation of the statistical sample, whereas a small number of cancellation groups corresponds to a heavy perturbation. The term perturbation is used to indicate that the data set used for developing the model in this stage is not the same as the one developed with all calibration objects, i.e. the one, which will be applied in chapters 13 and 14. Too small a perturbation means that overfitting is still possible. The validation procedure is repeated as many times as there are cancellation groups. At the end of the validation procedure each object has been once in the test set and d-1 times in the training set. Suppose there are 15 objects and 3 cancellation groups, consisting of objects 1-5, 6-10 and 11-15. We mentioned earlier that the objects should be assigned randomly to the cancellation groups, but for ease of explanation we have used the numbering above. The b-coefficients in the model that is being evaluated are determined first for the training set consisting of objects 6-15 and objects 1-5 function as test set, i.e. they are predicted with this model. The PRESS is determined for these 5 objects. Then a model is made with objects 1-5 and 11-15 as training and 6-10 as test set and, finally, a model is made with objects 1-10 in the training set and 11-15 in the test set. Each time the PRESS value is determined and eventually the three PRESS values are added, to give a value representative for the whole data set (PRESS values are more indicated here to RMSEP values, because PRESS values are variances and therefore additive). c. leave-one-out cross-validation (LOO-CV), in which the test sets contain only one object (d = n). Because the perturbation of the model at each step is small (only one object is set aside), this procedure tends to overfit the model. For this reason the leave- more-out methods described above may be preferable. The main drawback of LOO-CV is that the computation is slow because a model has to be developed for each object. 151 New Trends in Multivariate Analysis and Calibration d. Repeated random splitting (repeated evaluation set method) (RES) [96]. The procedure described in a is repeated many times. In this way, at the end of the validation procedure, one hopes that an object has been in the test set several times with different companions. Stable results are obtained after repetition of the procedure several times (even hundreds of times). To have a good picture of the prediction error, low and high percentages of objects in the evaluation set have to be used. 12. Random correlation 12.1. The Random Correlation issue Fig. 16. The 16 wavelengths selected by the Stepwise selection method for Stepwise-MLR calibration results (RMSECV) obtained for a random (20 x 100) spectral matrix and a random (1 x 20) concentration vectors. Let us consider a simulated X spectral matrix made of 20 spectra with 100 wavelengths filed with random values between 0 and 100. And a y matrix of 20 random values between 0 and 10. The Stepwise selection applied on such a data set will surprisingly lead to sometimes retain a certain number of variables (Fig. 16). If cross validation is performed to validate the obtained model, RMSECV results can even look as the model is very efficient in predicting y (tab le 1). This phenomenon is common for stepwise variable selection applied to noisy data. It has already been described [97,98], and is referred to as random correlation or chance correlation. 152 Chapter 2 – Comparison of Multivariate Calibration Methods Table 1. Stepwise-MLR calibration results (RMSECV) obtained for a random (20 x 100) spectral matrix and 3 different random (1 x 20) concentration vectors. In most of the case, the method would find correlated variables and a model is built only on chance correlated variables. α =1% α =5% RMSECV # variables RMSECV # variables Y matrix # 1 2.0495 2 0.1434 12 Y matrix # 2 No result No variable correlated 0.0702 14 Y matrix # 3 2.0652 2 0.0041 16 12.2. Random Correlation on real data This phenomenon is illustrated here in a spectacular manner on simulated data, but it must be noted that this can also happens on real spectroscopic data. For instance, a model is built relating Raman spectra of 5-compounds mixtures [99] to the concentration of one of these compounds (called MX). Figure 17 shows the variables retained to model the MX product. The selected variables are represented by stars on the spectrum of a typical mixtures containing equivalent quantities of the 5 products. The RMSECV is found to be suspiciously low comp ared to the RMSECV of the univariate model built using only the first selected variable (maximum of the MX peak). 153 New Trends in Multivariate Analysis and Calibration Fig. 17. Wavelengths selected by the Stepwise selection method for the MX model, and order of selection of those variables. Displayed on the spectrum of a typical mixture containing all of the 5 components. The variable selection does not seem correct. The first variable is as expected retained on the maximum of the MX peak, but all the other variables are selected in unin formative parts of the spectrum. The correlation coefficients of these variables with y are quite high (table 2). Table 2. Model built with Stepwise selection for the meta-xylene (17 first variables only). The correlation coefficient and the regression coefficient for each of the selected variables are also given. Order of selection 1 2 3 4 5 6 7 8 9 Index of variable 398 46 477 493 63 45 14 80 463 Correlation coefficient 0.998 -0.488 0.221 0.134 -0.623 -0.122 0.565 -0.69 0.09 Regression coefficient 0.030 -4.47 1.50 0.97 -3.15 -1.36 3.26 -3.01 0.35 Order of selection 10 11 12 13 14 15 16 17 … Index of variable 47 94 425 77 442 90 430 423 115 154 Chapter 2 – Comparison of Multivariate Calibration Methods Correlation coefficient -0.4 -0.599 0.953 -0.67 0.61 -0.54 0.94 0.95 -0.39 Regression coefficient -3.41 -1.59 0.80 -3.32 1.79 -1.57 0.96 0.77 -0.27 These variables also happen to have high associated regression coefficients in the model. This leads to the fact that even if the Raman intensity for those wavelengths is quite low (points located in the baseline), they take a significant importance in the model. Using the regression coefficient obtained for a particular variable, and the average Raman intensity for the corresponding wavelength, it is possible to evaluate the weight this variable has in the MLR model (table 3). One can see that the relative importance of variable number 80 (selected in eighth position) is about one third of the importance of the first selected variable. This is the reason why the last selected variables are still considered important by the selection procedure and lead to a dramatic improvement of the RMSECV. In this particular case, this improvement is not the sign of a better model, but it shows the failure of stepwise selection combined with cross validation. Table 3. Evaluation of the relative importance of selected variables in the MLR model built with Stepwise variable selection for meta-xylene. Order of selection Index of variable Correlation coefficient Regression coefficient Raman intensity Weight in the model 1 398 0.9981 0.0298 1029.2 30.67 4 493 0.1335 0.9663 8.01 7.74 8 80 -0.69 -3.01 3.41 -10.26 12.3. Avoiding Random Correlation Stepwise selection is known to be often subject to random correlation when applied to noisy data. It must be noted that this phenomenom can also happen with more sophisticated variable selection 155 New Trends in Multivariate Analysis and Calibration methods like Genetic Algorithm [100,99]. Occurrence of random correlation was even reported with latent variables methods like PCR or PLS [98]. When using variable selection methods, one has therefore to be extremely careful in the interpretation of the cross- validation results. This shows the necessity of external validation since a model built using chance correlated variables would see its performances considerably deteriorate when tested on an external test set. The most efficient way to eliminate chance correlation on spectroscopic data is to desoise the spectra. Method such as Fourier or Wavelets filtering (see chapter 3) have proven efficient for this purpose. A modified version of the stepwise algorithm was also proposed to reduce the risk of random correlation [99]. The main idea is the same as in Stepwise, the forward selection and backward elimination steps are maintained. The difference lies in the fact that each time a variable xj is selected for entry in the model, an iterative process begins : • A new variable is built. This variable xj1 is made of the average Raman scattering value of a 3- point window centred on xj (from xj-1 to xj+1 ). If xj1 yields a higher PCC than xj, it becomes the new candidate variable. • A second new variable, xj2 (average Raman scattering value of points xj-2 to xj+2) is built, it is compared with xj1 , and the process goes on. • When the enlargement of the window does not lead to a variable xj(n+1) with a better PCC than xjn , the method stops and xjn enters the model. Selecting an (2n+1)-points spectral window instead of a single wavelength implies a local averaging of the signal. This should reduce the effect of noise in the prediction step. Moreover, as the first variables entered into the model (most important ones) yield a better PCC, less uninformative variables should be retained since the next best variables will not be able to improve significantly the model. 13. Outlying objects in the model In Chapter 7 we explained how to detect possible outliers before the modelling, i.e. in the y and/or xspace. When the model has been built, we should check again for the possibility that outliers in the Xy-space are present, i.e. objects that do not fit the true model well (outliers toward the model). The difficulty with this is that such outlying objects influence (bias) the model obtained, often to such an 156 Chapter 2 – Comparison of Multivariate Calibration Methods extent that it is not possible to see that the objects are outliers to the true model. Diagnostics based on the distance from the model obtained may therefore not be effective. Consider the univariate case of figure 18. The outlier (*) to the true model attracts the regression line (exerts leverage), but cannot be identified as an outlier because its distance to the obtained regression line is not significantly higher than for some of the other objects. Object (*) is then called influential and one should therefore concentrate on finding such influential objects. Fig. 18. Illustration of the effect of an outlier (*) to the true model (---) influencing the regression line (___ ). There is another difficulty, that the presence of outliers can leads to the inclusion in the MLR model of additional variables taking the specific spectral features of the outlying spectrum into account. The outlier will then be masked, i.e. it will no longer be visible as a departure from the model. If possible outliers were flagged in the x-space (chapter 7), but it was decided not to reject them yet, one should first concentrate on these candidate outliers. MLR models should be made removing one of the outliers in turn, starting with the most suspect object. If the model obtained after deletion of the candidate outlier has a clearly lower RMSEP, or a similar RMSEP but a lower comple xity, the outlier should be removed. If only a few candidate outliers remain after this step (not more than 3) one can also look at MLR models in which each of the possible combinations of 2 or 3 outliers was removed. In this way one can detect outliers that are jointly influential. It should be noted however that a conservative approach should be adopted to the rejection of outliers. If one outlier and, certainly, if more than a few outliers are rejected we should consider whether perhaps there is something 157 New Trends in Multivariate Analysis and Calibration fundamentally wrong and review the whole process including the chemistry, the measurement procedure and the initial selection of samples. The next step is the study of residuals. A first approach is visual. One can make a plot of ŷ against y. If this is done for the final model, it is likely that, for the reasons outlined above, an outlier will not be visible. One way of studying the presence of influential objects, is therefore not to study the residuals for the final model but the residuals for the model with 1, 2, ..., a variables, because in this way we may detect outliers on specifics variables. If an object has a large residual on a model using, say, two variables, but a small residual when three or more variables are added, it is possible these extra variables are included in the model only to allow for this particular object. This object is then influential. We can provisionally eliminate the object, carry out MLR again and, if a more parsimonious model with at least equal predictive ability is reached, may decide to eliminate the object completely. Studying residuals from a model can also be done in a more formal way. To do this one predicts all calibration objects with the partial or full model and computes the residuals as the difference between the observed and the fitted value : e i = y i − ŷ i (40) where e i is the residual, yi the y-value and ŷ i the fitted y-value for object i. The residuals are often standardised by dividing ei by the square root of the residual variance s2 : s2 = n 1 2 ∑ ei n − p i=1 (41) Object i has an influence on its own prediction (described by the leverage hi, see chapter 7), and therefore, some authors recommend using the internally studentized residuals: ti = ei (42) s 1 − hi 158 Chapter 2 – Comparison of Multivariate Calibration Methods The externally studentized residuals, also called the jack-knifed or cross-validatory residuals, can also be used. They are defined as t(i) = ei (43) s(i) 1 − h i where s(i) is estimated by computing the regression without object i and pi is the leverage. For high leverages (hi close to 1) t i and t(i) will increase and can therefore reach significance more easily. The computation of t(i) requires a leave-one-out procedure for the estimation of s(i), which is time cons uming, so that the internally studentized version is often preferred. An observation is considered to be a large residual observation if the absolute value of its studentized residual exceeds 2.5 (the critical value at the 1% level of confidence, which is preferred to the 5% level of confidence, as is always the case when contemplating outlier rejection). The masking and swamping effects for multiple outliers that we described in chapter 7 in the x-space, can also occur in regression. Therefore the use of robust methods is of interest. Robust regression methods are based on strategies that fit the majority of the data (sometimes called clean subsets). The resulting robust models are therefore not influenced by the outliers. Least median of squares, LMS [57,101] and the repeated median [102] have been proposed as robust regression techniques. After robust fitting, outliers are detected by studying the residual of the objects from the robust model. The performance of these methods has been compared in [103]. Genetic algorithms or simulated annealing can be applied to select clean subsets according to a given criterion from a larger population. This lead Walczak et al. to develop their evolution program, EP [104,105]. It uses a simplified version of a genetic algorithm to select the clean subset of objects, using minimalisation of RMSEP as a criterion for the clean subset objects. The percentage of possible outliers in the data set must be selected in advance. The method allows the presence of 49% of outlying points, but the selection of such a high number risks the elimination of certain sources of variation from the clean subset and the model. The clean subset should therefore contain at least 90%, if not 95%, of the objects. Other algorithms based on the use of clean subset selection have been proposed by Hadi and Simeonoff [106] and Hawkins et al [107] and by Atkinson and Mukira [108]. Unfortunately none of these methods have been studied to such an extent that they can be recommended in practice. 159 New Trends in Multivariate Analysis and Calibration If a candidate outlier is found to have high leverage and also a high residual, using one of the above methods, it should be eliminated. High leverage objects that do not have a high standardised residual stabilise the model and should remain in the model. High residual, low leverage outliers will have a deleterious effect only if the residual is very high. If such outliers are detected then one should do what we described in the beginning of this chapter, i.e. try out MLR models without them. They should be rejected only if the model build without them has a clearly lower RMSEP or a similar RMSEP and lower complexity. 14. Using the model Once the final model has been developed, it is ready for use : the calibration model can be applied to spectra of new samples. It sho uld be noted that the data pre-processing and/or pre-treatment selected for the calibration model must also be applied to the new spectra and this must be done with the same parameters (e.g. same ideal spectrum for MSC, same window and polynomial size for Savitzky-Golay smoothing or derivation, etc.). For mean-centering or autoscaling, the mean and standard deviation used in the calibration stage must be used for in the pre-treatment of the new spectra. Although it is not the subject of this article, which is restricted to the development of a model, it should be noted that to ensure quality of the predictions and validity of the model, the application of the model over time also requires several applications of chemometrics. The following subjects should be considered : • Quality control : it must be verified that no changes have occurred in the measurement system. This can be done for instance by applying system suitability checks and by measuring the spectra of standards. Multivariate quality control charts can be applied to plot the measurements and to detect changes [109,110]. • Detection of outliers and inliers in prediction : the spectra must belong to the same population as the objects used to develop the calibration model. Outliers in concentration (outliers in y) can occur. Samples can also be different from the ones used for calibration, because they present sources of variance not taken into account in the model. Such samples are then outliers in X. In both cases, this leads to extrapolation outside the calibration space so that the results obtained are less accurate. MLR can be robust to slight extrapolation, but this is less true when non-linearity occurs. More extreme 160 Chapter 2 – Comparison of Multivariate Calibration Methods extrapolation will lead to unacceptable results. It is therefore necessary to investigate whether a new spectrum falls into the spectral domain of the calibration samples. As stated in chapter 7, we can in fact distinguish outliers and inliers. Outliers in y and in X can be detected by adaptations of the methods we described in Chapter 7. Inliers are samples which, although different from the calibration samples, lie within the calibration space. They are located in zones of low (or null) density within the calibration space: for instance, if the calibration set consists of two clusters, then an inlier can be situated in the space between the two clusters. If the model is non- linear, their prediction can lead to interpolation error. Few methods have been developed to detect inliers. One of them is the potential function method of Jouan-Rimbaud et al. (chapter 7) [61]. If the data set is known to be relatively homogeneous (by application of the methods of chapter 6), then it is not necessary to look for inliers. • Updating the models : when outliers or inliers were detected and it has been verified that no change has occurred in the measurement conditions, then one may consider adding the new samples to the calibration set. This makes sense only when it has been verified that the samples are either of a new type or an extension of the concentration domain and that it is expected that similar new samples can be expected in the future. Good strategies to perform this updating with a minimum of work, i.e. without having to take the whole extended data set through all the previous steps, do not seem to exist. • Correcting the models (or the spectra): when a change has been noticed in the spectra of the standards, for instance in a multivariate QC chart, and the change cannot be corrected by changes to the instrumental, this means that spectra or model must be corrected. When the change in the spectra is relatively small and the reason for it can be established [110], e.g. a wavelength shift, numerical correction is possible by making the same change to the spectra in the reverse direction. If this is not the case, it is necessary to treat the data as if they were obtained on another instrument and to apply methods for transfer of calibration from one instrument to another. A review about such methods is given in [111]. 15. Conclusions It will be clear from the preceding chapters that developing good multivariate calibration models requires a lot of work. There is sometimes a tendency to overlook or minimise the need for such a careful approach. The deleterious effects of outliers are not so easily observed as for univariate 161 New Trends in Multivariate Analysis and Calibration calibration and are therefore sometimes disregarded. Problems such as heterogeneity or nonrepresentativity can occur also in univariate calibration models, but these are handled by analytical chemists who know how to avoid or cope with such problems. When applying multivariate calibration, the same analysts may have too much faith in the power of the mathematics to worry about such sources of errors or may have difficulties in understanding how to tackle them. Some chemometricians do not have analytical backgrounds and may be less aware of the possibility that some sources of error can be present. It is therefore necessary that strategies should be made available for systematic method development that include the diagnostics and remedies required and that analysts should have a better comprehension of the methodology involved. It is hoped that this article will help to some degree in reaching this goal. As stated in the introduction, we have chosen to consider MLR, because it is easier to explain. This is an important advantage, but it does not mean that other methods have no other advantages. By performing MLR on the scores of a PCA model, PCR avoid the variable selection procedure. Partial least squares (PLS) and PCR usually give results of equal quality but PLS can be numerically faster when optimised algorithms such as SIMPLS [112] are applied. Methods that have been specifically developed for non-linear data, such as neural networks (NN), are superior to the linear methods whe n non-linearities do occur, but may be bad at predictions for outliers (and perhaps even inliers). Locally weighted regression (LWR) methods seem to perform very well for inhomogeneous data and for nonlinear data, but may require somewhat more calibration standards. In all cases however it is necessary to have strategies available that detect the need to use a particular type of method and that ensure that the data are such that no avoidable sources of imprecision or inaccuracy are present. REFERENCES [1] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. de Jong, P.J. Lewi, J. SmeyersVerbeke, Handbook of Chemometrics, Elsevier, Amsterdam, 1997. [2] N.R. Draper, H. Smith, Applied Regression Analysis, Wiley, New York, 1981. [3] J. Mandel, The Statistical Analysis of Experimental Data, Dover reprint, 1984, Wiley &Sons, 1964, New York. [4] D.L. MacTaggart, S.O. Farwell, J. Assoc.Off. Anal. Chem., 75, 594, 1992. 162 Chapter 2 – Comparison of Multivariate Calibration Methods [5] J.C. Miller, J.N.Miller, Statistics for Analytical Chemistry, Ellis Horwood, Chichester, 3rd ed., 1993. [6] R. De Maesschalck, F. Estienne, J. Verdú-Andrés, A. Candolfi, V. Centner, F. Despagne, D. Jouan-Rimbaud, B. Walczak, S. de Jong, O.E. de Noord, C. Puel, B.M.G. Vandeginste, D.L. Massart, Internet Journal of Chemistry, 2 (1999) 19. [7] F. Despagne, D.L. Massart, The Analyst, 123 (1998) 157R-178R. [8] URL : http://minf.vub.ac.be/~fabi/calibration/multi/pcr/. [9] URL : http://minf.vub.ac.be/~fabi/calibration/multi/nn/. [10] V. Centner, D.L. Massart, S. de Jong, Fresenius J. Anal. Chem. 361 (1998) 2-9. [11] S.D. Hodges, P.G. Moore, Appl. Stat. 21 (1972) 185-195. [12] S. Van Huffel, J. Vandewalle, The Total Least Squares Problem, Computational Aspects and Analysis, SIAM, Phiadelphia, 1988. [13] Statistics - Vocabulary and Symbols Part 1, ISO stand ard 3534 (E/F), 1993. [14] Accuracy (trueness and precission) of measurement methods and results, ISO standard 5725 16, 1994. [15] V. Centner, D.L. Massart, O.E. de Noord, Anal. Chim. Acta 330 (1996) 1-17. [16] B.G. Osborne, Analyst 113 (1988) 263-267. [17] P. Kubelka, Journal of the optical Society of America 38(5) (1948) 448-457. [18] A. Savitzky, M.J.E. Golay, Anal. Chem. 36 (1964) 1627-1639. [19] P.A. Gorry, Anal. Chem. 62 (1990) 570-573. [20] S.E. Bialkowski, Anal. Chem. 61 (1989) 1308-1310. [21] J. Steinier, Y. Termonia, J. Deltour, Anal. Chem. 44 (1972) 1906-1909. [22] P. Barak, Anal. Chem. 67 (1995) 2758-2762. [23] E. Bouveresse, Maintenance and Transfer of Multivariate Calibration Models Based on NearInfrared Spectroscopy, doctoral thesis, Vrije Universiteit Brussel, 1997. [24] C.H. Spiegelman, Calibration: a look at the mix of theory, methods and experimental data, presented at Compana, Wuerzburg, Germany, 1995. [25] W. Wu, Q. Guo, D. Jouan-Rimbaud, D.L. Massart, Using contrasts as a data pretreatment method in pattern recognition of multivariate data, Chemom. and Intell. Lab. Sys. (in press). [26] L. Pasti, D. Jouan-Rimbaud, D.L. Massart, O.E. de Noord, Anal. Chim. Acta. 364 (1998) 253263. 163 New Trends in Multivariate Analysis and Calibration [27] D. Jouan-Rimbaud, B. Walczak, D.L. Massart, R.J. Poppi, O.E. de Noord, Anal. Chem. 69 (1997) 4317-4323. [28] B. Walczak, D.L. Massart, Chem. Intell. Lab. Sys. 36 (1997) 81-94. [29] P. Geladi, D. MacDougall, H. Martens, Appl. Spectrosc. 39 (1985) 491-500. [30] T. Isaksson, T. Næs, Appl. Spectrosc. 42 (1988) 1273-1284. [31] T. Næs, T. Isaksson, B.R. Kowalski, Anal. Chem. 62 (1990) 664-673. [32] R.J. Barnes, M.S. Dhanoa, S.J. Lister, Appl. Spectrosc. 43 (1989) 772-777. [33] R.J. Barnes, M.S. Dhanoa, S.J. Lister, J. Near Infrared Spectrosc. 1 (1993) 185-186. [34] M.S. Dhanoa, S.J. Lister, R. Sanderson, R.J. Barnes, J. Near Infrared Spectrosc. 2 (1994) 43-47. [35] I.S. Helland, T. Naes, T. Isaksson, Chemom. Intell. Lab. Sys. 29 (1995) 233-241. [36] O.E. de Noord, Chemom. Intell. Lab. Sys. 23 (1994) 65-70. [37] M.B. Seasholtz, B.R. Kowalski, J. Chemom. 6 (1992) 103-111. [38] A. Garrido Frenich, D. Jouan-Rimbaud, D.L. Massart, S. Kuttatharmmakul, M. Martínez Galera, J.L. Martínez Vidal, Analyst 120 (1995) 2787-2792. [39] J.E. Jackson, A user's guide to principal components, John Wiley, New York, 1991. [40] E.R. Malinowski, Factor analysis in chemistry, 2nd. Ed., John Wiley, New York, 1991. [41] S. Wold, K. Esbensen and P. Geladi, Chemom. Intell. Lab. Syst. 2 (1987) 37-52. [42] K. Pearson, Mathematical contributions to the theory of evolution XIII. On the theory of contingency and its relation to association and normal correlation, Drapers Co. Res. Mem. Biometric series I, Cambridge University Press, London. [43] H. Hotelling, J. Educ. Psychol., 24 (1933) 417-441, 498-520. [44] D. Jouan-Rimbaud, B. Walczak, D.L. Massart, I.R. Last, K.A. Prebble, Anal. Chim. Acta 304 (1995) 285-295. [45] M. Meloun, J. Militký, M. Forina, Chemometrics for analytical chemistry. Vol. 1: PC-aided statistical data analysis, Ellis Horwood, Chic hester (England), 1992. [46] T. Næs, T. Isaksson, Appl. Spectr. 1992, 46/1 (1992) 34. [47] K. Szczubialka, J. Verdú-Andrés, D.L. Massart, Chemom. and Intell. Lab. Syst. 41 (1998) 145160. [48] B. Hopkins, Ann. Bot., 18 (1954) 213. [49] R.G. Lawson, P.J. Jurs, J. Chem. Inf. Comput. Sci. 30 (1990) 36-41. [50] Forina, M., Drava, G., Boggia, R., Lanteri, S., Conti, P., Anal. Chim. Acta, 295 (1994) 109. 164 Chapter 2 – Comparison of Multivariate Calibration Methods [51] F.E. Grubbs, G. Beck, Technometrics, 14 (1972) 847-854. [52] P.C. Kelly, J. Assoc. Off. Anal. Chem. 73 (1990) 58-64. [53] T. Næs, Chemom. Intell. Lab. Sys. 5 (1989) 155-168. [54] S. Weisberg, Applied linear regression, 2nd. Edition, John Wiley & Sons, New York, 1985. [55] B. Mertens, M. Thompson, T. Fearn, Analyst 119 (1994) 2777-2784. [56] A. Singh, Chemom. Intell. Lab. Sys. 33 (1996) 75-100. [57] P.J. Rousseeuw, A. Leroy, Robust regression and outlier detection, John Wiley, New York, 1987. [58] P.J. Rousseeuw, B.C. van Zomeren, J. Am. Stat. Assoc. 85 (1990) 633-651. [59] A.S. Hadi, J.R. Statist. Soc. B 54 (1992) 761-771. [60] A.S. Hadi, J.R. Statist. Soc. B 56 (1994) ?1-4?. [61] D. Jouan-Rimbaud, E. Bouveresse, D.L. Massart, O.E. de Noord, Anal. Chim. Acta, 388, 283301 (1999). [62] A. Lorber, B.R. Kowalski, J. Chemom. 2 (1988) 67-79. [63] K.I. Hildrum, T. Isaksson, T. Naes, A. Tandberg, Near infra-red spectroscopy; Bridging the gap between data analysis and NIR applications, Ellis Horwood, Chichester, 1992. [64] D. Jouan-Rimbaud, M.S. Khots, D.L. Massart, I.R. Last, K.A. Prebble, Anal. Chim. Acta 315 (1995) 257-266. [65] J. Ferré, F.X. Rius, Anal. Chem. 68 (1996) 1565-1571. [66] J. Ferré, F.X. Rius, Trends Anal. Chem. 16 (1997) 70-73. [67] R.W. Kennard, L.A. Stone, Technometrics 11 (1969) 137-148. [68] T. Næs, J. Chemom. 1 (1987) 121-134. [69] G. Puchwein, Anal. Chem. 60 (1988) 569-573. [70] D.E. Honigs, G.H. Hieftje, H.L. Mark, T.B. Hirschfeld, Anal. Chem. 57 (1985) 2299-2303. [71] ASTM, Standard practices for infrared, multivariate, quantitative analysys". Doc. E1655-94, in ASTM Annual book of standards, vol. 03.06, West Conshohochen, PA, USA, 1995. [72] T. Fearn, NIR news 8 (1997) 7-8. [73] R.D. Snee, Technometrics 19 (1977) 415-428. [74] D. Jouan-Rimbaud, D.L. Massart, C.A. Saby, C. Puel, Anal. Chim. Acta 350 (1997) 149-161. [75] D. Jouan-Rimbaud, D.L. Massart, C.A. Saby, C. Puel, Intell. Sys. 40 (1998) 129-144. [76] C.E. Miller, NIR News 4 (1993) 3-5. 165 New Trends in Multivariate Analysis and Calibration [77] P.J. Brown, J. Chemom. 7 (1993) 255-265. [78] Y.L. Xie, Y.Z. Liang, Z.G. Chen, Z.H. Huang, R.Q. Yu, Chemom. Intell. Lab. Sys. 27 (1995) 21-32. [79] H. Martens, T. Næs, Multivariate calibration, Wiley, Chichester, England, 1989. [80] R.D. Cook, S. Weisberg, Residuals and influence in Regression, Chapman and Hall, New York, 1982. [81] J.H. Holland, Adaption in Natural and Artificial Systems, University of Mic higan Press, Ann Arbor, MI, 1975, revised reprint, MIT Press, Cambridge, 1992. [82] C.B. Lucasius, M.L.M. Beckers, G. Kateman, Anal. Chim. Acta, 286 (1994) 135. [83] R. Leardi, R. Boggia, M. Terrile, J. Chemom., 6 (1992) 267. [84] D. Jouan-Rimbaud, D.L.Massart, R. Leardi, O.E. de Noord, Anal. Chem., 67 (1995] 4295. [85] Meusinger, R. Moros, Chemom. Intell. Lab. Systems, 46 (1999) 67. [86] P. Willet, Trends. Biochem, 13 (1995) 516. [87] D.H. Hibbert, Chemom. Intell. Lab. Syst., 19 (1993) 277. [88] J.H. Kalivas, J. Chemom., 5 (1991) 37. [89] X.G. Shao, Z.H. Chen, X.Q. Lin, Fresenius J. Anal. Chem., 366 (2000) 10. [90] D.M. Haaland, E.V. Thomas, Anal. Chem. 60 (1988) 1193-1202. [91] D.W. Osten, J. Chemom. 2 (1988) 39-48. [92] H. van der Voet, Chemom. Intell. Lab. Sys. 25 (1994) 313-323 & 28 (1995) 315. [93] J. Riu, F.X. Rius, Anal. Chem. 9 (1995) 343-391. [94] R. DiFoggio, Appl. Spectrosc. 49 (1995) 67-75. [95] N.M. Faber, M.J. Meinders, P. Geladi, M. Sjöström, L.M.C. Buydens, G. Kateman, Anal. Chim. Acta 304 (1995) 273-283. [96] M. Forina, G.Drava, R. Boggia, S. Lanteri, P. Conti, Anal. Chim. Acta 295 (1994) 109-118. [97] J. G. Topliss, R. J. Costello, Journal of Medicinal Chemistry 15 (1971) 1066. [98] J. G. Topliss, R. P. Edwards, Journal of Medicinal Chemistry 22 (1979) 1238. [99] F. Estienne, N. Zanier, P. Marteau, D.L. Massart, Analytica Chimica Acta, 424 (2000) 185-201. [100] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord, Anal. Chem. 67 (1995) 4295. [101] D.L. Massart, L. Kaufman, P.J. Rousseeuw, A.M. Leroy, Anal. Chim. Acta 187 (1986) 171179. [102] A.F. Siegel, Biometrika 69 (1982) 242-244. 166 Chapter 2 – Comparison of Multivariate Calibration Methods [103] Y. Hu, J. Smeyers-Verbeke, D.L. Massart, Chemom. Intell. Lab. Sys. 9 (1990) 31-44. [104] B. Walczak, Chemom. Intell. Lab. Sys. 28 (1995) 259-272. [105] B. Walczak, Chemom. Intell. Lab. Sys. 29 (1995) 63-73. [106] A.S. Hadi, J.S. Simonoff, J. Am. Stat. Assoc. 88 (1993) 1264-1272. [107] D.M. Hawkins, D. Bradu, G.V. Kass, Technometrics 26 (1984) 197-208. [108] A.C. Atkinson, H.M. Mulira, Statistics and computing 3 (1993) 27-35. [109] N.D. Tracy, J.C. Young, R.L. Mason, Journal of Quality Technology 24 (1992) 88-95. [110] E. Bouveresse, C. Casolino, Massart DL, Applied Spectroscopy 52 (1998) 604-612. [111] E. Bouveresse, D.L. Massart, Vibrational Spectroscop y 11 (1996) 3. [112] S. de Jong, Chem. Intell. Lab. Syst. 18 (1993) 251-263. 167 New Trends in Multivariate Analysis and Calibration CHAPTER III N EW TYPES OF D ATA : N ATURE OF THE D ATA SET Like chapter 2, this chapter focuses on multivariate calibration. The work presented here can be seen as a direct application of the guidelines and methodology developed in the previous chapter. It shows how an industrial process can be improved by proper use of chemometrical tools. A very interesting aspect of this work is that is was performed on Raman spectroscopic data, which is a new field of application for chemometrical methods. In the first paper in this chapter : “Multivariate calibration with Raman spectroscopic data : a case study”, it is shown how Multiple Linear Regression was found to be the most efficient method for this industrial application. The relatively poor quality of the data implied a huge effort on variable selection in order in particular to tackle the random correlation issue. Various approaches, including an innovative variable selection strategy suggested in chapter 2, were successfully tried. The second paper in this chapter : “Inverse Multivariate calibration Applied to Eluxyl  Raman data“ is an internal report written about the same industrial process. New measurements were performed after a new and more efficient Raman spectrometer was installed. Its quality completely changed the approach to be used on this data. Due to the improved signal/noise ratio, random correlation was no longer a problem. However, a slight non- linearity that could not be detected before became visible in the data, which implied the use of a non- linear method. Treating this high quality data with Neural Networks enabled to reach a quality in calibration never reached before on Eluxyl data. Apart from giving illustrations of the principles developed in chapter 2, this chapter shows the applicability and superiority of chemometrical methods applied to Raman data. This conclusion is striking since Raman data were typically considered as sufficiently straightforward not to necessitate any sophisticated approach. It is now proven that Raman data can benefit not only from data pre- 168 Chapter 3 – New Types of Data : Nature of the Data Set treatment, which was the only mathematical treatment considered necessary, but also from inverse multivariate calibration and such sophisticated methods as neural networks. 169 New Trends in Multivariate Analysis and Calibration M ULTIVARIATE CALIBRAT ION WITH RAMAN SPECTROSCOPIC DATA : A CASE STUDY Analytica Chimica Acta, 424 (2000) 185-201. F. Estienne and D.L. Massart * N. Zanier-Szydlowski Ph. Marteau ChemoAC, Farmaceutisch Instituut, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium. E-mail: fabi@fabi.vub.ac.be Institut Français du Pétrole (I.F.P.), 1-4 Avenue du Bois Préau, 92506 Rueil-Malmaison France Université Paris Nord, L.I.M.P.H., Av. J.B. Clément, 93430 Villetaneuse France ABSTRACT An industrial process separating p-xylene from mainly other C 8 aromatic compounds is monitored with an online remote Raman analyser. The concentrations of six constituents are currently evaluated with a classical calibration method. The aim of the study being to improve the precision of the monitoring of the process, inverse calibration linear methods were applied on a synthetic data set, in order to evaluate the improvement in prediction such methods could yield. Several methods were tested including Principal Component Regression with Variable Selection, Partial Least Square Regression or Multiple Linear Regression with variable selection (Stepwise or based on Genetic Algorithm). Methods based on selected wavelengths are of great interest because the obtained models can be expected to be very robust toward experimental conditions. However, because of the important noise in the spectra due to short accumulation time, variable selection methods selected a lot of irrelevant variables by chance correlation. Strategies were investigated to solve this problem and build reliable robust models. These strategies include the use of signal pre-processing (smoothing and filtering in the Fourier or Wavelets domain), and the use of an improved variable selection algorithm based on the selection of spectral windows instead of single wavelengths when this leads to a better model. The best results were achieved with Multiple Linear Regression and Stepwise variable selection applied to spectra denoised in the Fourier domain. * Corresponding author K EYWORDS : Chemometrics, Raman Spectroscopy, Multivariate Calibration, random correlation. 170 Chapter 3 – New Types of Data : Nature of the Data Set 1 - Introduction The Eluxyl process separates para- xylene from other C 8 aromatic compounds (ortho and meta-xylene, and either para-di-ethylbenzene or toluene used as solvent) by simulated moving bed chromatography [1]. The evolution of the process is monitored online using a Raman analyser equipped with optical fibres. The Raman scattering studied is in the visible range and is collected on a 2-dimensional Charge Coupled Device (CCD) detector that allows true simultaneous recordings. The Raman technique gives access to the fundamental vibrations of molecules by using either a visible or a near-IR excitation. This allows an easy attribution of the vibrational bands and the possibility to use classical calibration methods for quantitative analysis in non-complex mixtures. Nevertheless, taking into account small quantities (< 5 %) of impurities (i.e. C9 + compounds), the classical calibration method is naturally limited in precision if all the impurities are not clearly identified in the spectrum. The scope of this paper is to evaluate the improvement that could be achieved in terms of precision of the quantification by us ing inverse calibration methods. The work presented here is at the stage of a feasibility study aiming at showing that inverse calibration should be applied later on the industrial installations. Synthetic samples were therefore studied using a laboratory instrument. In order not to overestimate the possible improvements obtained, the study has been performed in the wavelength domain currently used and optimised for the classical calibration method. Moreover, the synthetic samples contained no impurities, leading to a situation optimal for the direct calibration method. It can therefore be expected that any improvement achieved in these conditions would be even more appreciable on the real industrial process. It is also important to evaluate which inverse calibration method is the most efficient, so that the implementation of the new system on the industrial process can be performed as quickly as possible. 171 New Trends in Multivariate Analysis and Calibration 2 – Calibration Methods Bold upper-case letters (X) stand for matrices, bold lower-case letters (y) stand for vectors, and italic lower-case letters (h) stand for scalars. 2.1 - Comparison of classical and inverse calibration The main assumption when building a classical calibration model to determine concentrations from spectra is that the error lies in the spectra. The model can be seen as : Spectra = f (Concentrations). Or, in a matrix form : R=C.K+E (1) where R is the spectral response matrix, C the concentration matrix, K the matrix of molar absorptivities of the pure components, and E the error matrix. This implies that it is necessary to know all the concentrations in order to build the model, if a high precision is required. In inverse calibration, one assumes that the error lies in the measurement of the concentrations. The model can be seen as : Concentrations = f (Spectra). Or, in a matrix form: C=P.R+E (2) where R is the spectral matrix, C the concentration matrix, P the regression coefficients matrix, and E the error matrix. A perfect knowledge about the composition of the system is then not necessary. 2.2 - Method currently used for the monitoring The concentrations are currently evaluated using a software [2] implementing a classical multivariate calibration method based on the measurement of the areas of the Raman peaks. It is assumed that there is a linear relationship between Raman intensity and the molar density of a substance. The Raman 172 Chapter 3 – New Types of Data : Nature of the Data Set intensity collected also depends on other factors (excitation frequency, laser intensity, etc…), but those factors are the same for all of the bands in a spectrum. It is therefore necessary to work with relative concentrations for the substances. The relative concentration of a molecule j in a mixture including n types of molecules is obtained by calculating : cj = p j/ s j (3) n ∑ i=1 p i /s i where pj is the theoretical integrated intensity of the Raman line due specifically to the molecule j, and σ j the relative cross section of this molecule. The cross section of a molecule represents the fact that different molecules, even when studied at the same concentration, can induce Raman scattering with different intensity. The measured intensity mj of a peak is also due to the contribution of peaks from other molecules. For the method to take overlapping between peaks into account, the theoretical pj values must therefore be deduced from the experimentally measured integrated intensities m j (Fig. 1). The following system has to be solved : a11 p1 +a21 p2 + a31 p3 + … + a i1 pi = m1 a12 p1 + a22 p2 + a32 p3 + … + a i2 pi = m2 … (4) a1j p1 + a2j p2 + a3j p3 + … + a ij pi = mn where the aij coefficients represent the contribution of the th i molecule on the integrated frequency domain corresponding to the jth molecule (Fig. 1). The aij coefficients are deduced from the Raman spectra of pure components as being the ratio between the integrated intensity in the frequency domains of the jth and ith molecules respectively. The aii coefficients are equal to 1. The system (4) can be written in a matrix form as : 173 New Trends in Multivariate Analysis and Calibration K . P = M → P = K −1 . M (5) The integrated intensities m of the matrix M were measured over frequency domains of 7 cm-1 centered on the maximum of the peaks (Fig. 1). This is of the order of their width at half height. The maxima have therefore to be determined before the calculation can be performed. The spectra of the five pure products are used for this purpose. The relative scattering cross-sections σ j are obtained from the spectra of binary equimolar mixtures of each of the molecules with one taken as a reference. Here, toluene is taken as a reference, this leads to : σ toluene = 1 σ j = σ (j / toluene) = pj / ptoluene (6) Once the p and σ values are known, the concentrations are obtained using equation (5). A more detailed description of the method is available in [2]. Fig. 1. Measured intensity mOX of the meta-xylene peak on the spectrum of a single component sample. The contribution of the meta-xylene peak under the ortho-xylene peak aMX/OX is also represented. The 7 cm-1 integration domains are filled in grey. 174 Chapter 3 – New Types of Data : Nature of the Data Set 2.3 - Stepwise Multiple Linear Regression (Stepwise MLR) Stepwise Multiple Linear Regression [3] is an MLR with variable selection. Stepwise selection is used to select a small subset of variables from the original spectral matrix X. The first variable xj entered in the model is the most correlated to the property of interest y . The regression coefficient b obtained from the univariate regression model relating xj to y is tested for significance using a t- test at the considered critical level α = 1 or 5 %. The next step is forward selection. This consists in including in the model the variable xi that yields the highest Partial Correlation Coefficient (PCC). The inclusion of a new variable in the model can decrease the contribution of a variable already included and make it non-significant. After each inclusion of a new variable, the significance of the regression terms (bi Xi) already in the model is therefore tested, and the non-significant terms are eliminated from the equation. This is the backward elimination step. Forward selection and backward elimination are repeated until no improvement of the model can be achieved by including a new variable, and all the variables already included are significant. Stepwise variable selection method is known for sometimes selecting uninformative variables because of chance correlation to the property of interest. This can occur when the method is applied to noisy signals. In order to reduce this risk, a modified version of this algorithm was proposed. The main idea is the same as in Stepwise, the forward selection and backward elimination steps are maintained. The difference lies in the fact that each time a variable xj is selected for entry in the model, an iterative process begins : • A new variable is built. This variable xj1 is made of the average Raman scattering value of a 3- point window centred on xj (from xj-1 to xj+1 ). If xj1 yields a higher PCC than xj, it becomes the new candidate variable. • A second new variable, xj2 (average Raman scattering value of points xj-2 to xj+2) is built, it is compared with xj1 , and the process goes on. • When the enlargement of the window does not lead to a variable xj(n+1) with a better PCC than xjn , the method stops and xjn enters the model. Selecting an (2n+1)-points spectral window instead of a single wavelength implies a local averaging of the signal. This should reduce the effect of noise in the prediction step. Moreover, as the first variables 175 New Trends in Multivariate Analysis and Calibration entered into the model (most important ones) yield a better PCC, less uninformative variables should be retained. 2.4 - MLR with selection by Genetic Algorithm (GA MLR) Genetic Algorithms (GA) are used here to select a small subset of original variables in order to build an MLR model [4]. A population of k strings (or chromosomes) is randomly chosen from the original predictor matrix X. The chromosomes are made of genes (or bitfields) representing the parameters to optimise. In the case of variable selection, each gene is made of a single bit corresponding to an origina l variable. The fitness of each string is evaluated in terms of Root Mean Squared Error of Prediction, defined as : RMSEP = nt ∑ ( ŷ i − yi) / n 2 (7) t i =1 where nt is the number of objects in the test set, yi the known value of the property of interest for object i, and yˆ i the value of the property of interest predicted by the model for object i. With a probability depending on their fitness, pairs of strings are selected to undergo cross-over. Crossover is a GA operator consisting in mixing the information contained in two existing (parent) strings to obtain new (children) strings. In order to enable the method to escape a possible local minimum, a second GA operator, mutation, is introduced with a much lower probability. This means that each bit in the children strings may be randomly changed. In the algorithm used here [5], the children strings may replace members of the population of parent strings yielding a worse fit. This whole procedure is called a generation. It is iterated unt il convergence to a good solution is reached. In order to improve the variable selection, a backward elimination was added to ensure that all the selected variables are relevant for the model. The principle is the same as the backward elimination step in the Stepwise variable selection method. 176 Chapter 3 – New Types of Data : Nature of the Data Set This method requires as input parameters the number of strings in each generation (size of the population), the number of variables in each string (number of genes per chromosome), the frequency of cross-over, mutatio ns and backward elimination, and the number of generations. 2.5 - Principal Component Regression with variable selection (PCR VS) This method includes two steps. The original data matrix X(n,p) is approximated by a small set of orthogonal Principal Components (PCs) T(n,a). A Multiple Linear Regression model is then built relating the scores of the PCs (independent variables) to the property of interest y(n,1) . The main difficulty of this method is to choose the number of PCs that have to be retained. This was done here by means of Leave One Out (LOO) Cross Validation (CV). The predictive ability of the model is estimated at several complexities (models including 1,2, … etc PCs) in terms of Root Mean Square Error of Cross Validation (RMSECV). RMSECV is defined as RMSEP (equ. 7) when yˆ i is obtained by cross validation. The complexity leading to the smallest RMSECV is considered as optimal in a first approach. In a second step, in order to avoid overfitting, more parsimonious models (smaller complexities, one or more of the last selected variables are removed) are tested to determine if they can be considered as equivalent in performance. The slightly worse RMSECV can in that case be compensated by a better robustness of the resulting model. This is done using a randomisation test [6]. This test is applied to check the equality of performance of two prediction methods or the same prediction method at two different complexities. In this study, the probability was estimated as the average of three calculations with 249 iterations each, and the alpha value used was 5%. In the usual PCR [7], the variables are introduced into the model according to the percentage of variance they explain. This is called PCR top-down. But the PCs explaining the largest part of the global variance in X are not always the most related to y. In PCR with variable selection (PCR VS), the PCs are included in the model according to their correlation [8] with y, or their predictive ability [9]. 2.6 - Partial Least Squares Regression (PLS) Similarly to PCR, PLS [10] reduces the data to a small number of latent variables. The basic idea is to focus only on the systematic variation in X that is related to y. PLS maximises the covariance between 177 New Trends in Multivariate Analysis and Calibration the spectral data and the property to be modelled. The original NIPALS [11-12] algorithm was used in this study. In the same way as for PCR, the optimal complexity is determined by comparing the RMSECV obtained from models with various complexities. To avoid overfitting, this complexity is then confirmed or corrected by comparing the model leading to the smaller RMSECV with the more parsimonious ones using a randomisation test. 3– Signal Processing Methods 3.1 - Smoothing by moving average Smoothing by moving average (first order Savitzky-Golay algorithm [13]) is the simplest way to reduce noise in a signal. It has however important drawbacks. For instance, it modifies the shape of peaks, tending to reduce their height and enlarge their base. The size of the window chosen for the smoothing must be optimised in order not to reduce the predictive abilities of the models obtained. 3.2 - Filtering in Fourier domain Filtering was carried out in the Fourier domain [14]. The filtering method consists in applying a low pass filter [15] on the frequency domain : a frequency value, above which the Fourier coefficients should be kept, is selected. The cutoff frequency value was here automatically calculated on the bases of the power spectra (PS). The power spectrum of a function is the measurement of the signal energy at a given frequency. The narrowest peaks of interest in the signal are related to the minimum frequency to be kept in the Fourier domain. The energy corresponding to the non- informative peaks is calculated, and the power spectra are used to determine which frequencies should be kept depending on this value. 3.3 - Filtering in Wavelet Domain The main steps of signal denoising in Wavelet domain are the decomposition of the signal, the thresholding, and the reconstruction of the denoised signal [16]. The wavelet transform of a discrete signal f is obtained by : 178 Chapter 3 – New Types of Data : Nature of the Data Set w = Wf (8) where w is a vector containing wavelet transform coefficients and W is the matrix of the wavelet filter coefficients. The coefficients in W are derived from the mother wavelet function. The Daubechies family wavelet was used here. To choose the relevant wavelet coefficients (those related to the signal) a threshold value is calculated. Many methods are available. This was done here using the method kno wn as universal thresholding [17] (ThU) in which the threshold level is calculated from the standard deviation of the noise. Once the threshold is known, two different approaches are generally used, namely hard and soft thresholding. Soft thresholding [18] was used here, in this case the wavelet coefficients are reduced by a quantity equal to the threshold value. When the relevant wavelet coefficients wt are determined, the denoised signal ft can be rebuilt as : ft = W’ wt (9) 4 - Experimental 4.1 - Data set The data set was made of synthetic mixtures prepared from products previously analysed by gas chromatography in order to assess their purity. Those mixtures were designed to cover a wide range of concentrations representative for all the possible situations on the process. Only the spectra of the “pure” products and the binary mixtures are required to build the model in case of the classical calibration method. For all the inverse calibration methods, all the samples (except the replicates) are used in the model building phase. The data set consists of 52 spectra : 179 New Trends in Multivariate Analysis and Calibration - 1 spectrum for each of the 5 pure products (toluene, meta, para, and ortho-Xylene, and ethylbenzene) - 9 spectra of binary p-xylene / m-xylene mixtures (concentrations from 10/90 to 90/10 with a 10% step) - 10 equimolar binary mixtures consisting of all binary mixtures which can be prepared from the five pure products. - 10 equimolar ternary mixtures - 5 quaternary mixtures - 1 mixture including the five constituents - 10 replicates of randomly chosen mixtures Raman spectra were recorded using a spectroscopic device quite similar to the one industrially used in the ELUXYL separation process. The main differences are that a laser diode (SDL 8530) emitting at 785 nm was used instead of an argon ion laser (514.53 nm), and a 1 meter long optical fibre replaced the 200 meters one used on the process. Back scattered Raman signal was recovered through a Super DILOR Head equipped with interferential and Notch filters to prevent the Raman signal of silica to be superimposed to the Raman signal of the sample. The grating spectrometer was equipped with a CCD camera used in multiple spectra configuration. The emission line of a neon lamp could therefore also be recorded to allow wavelength calibration. The spectra were acquired from 930 to 650 cm-1 , no rotation of the grating was needed to cover this spectral range. The maximum available power at the output of the fibre connected to the laser is 250 mW. However, in order to prevent any damage to the filters, this power was reduced to a sample excitation power of 30 mW. Each spectrum was acquired during 10 seconds. This corresponds to the conditions on the industrial process, considering that concentration values have to be provided by the system every 15 seconds. The five remaining seconds should be enough for data treatment (possible pre-treatment, and concentration predictions). The wavelength domain retained in the spectra was specifically designed to fit the requirements of the classical calibration method. Thanks to the relatively simple structure of Raman spectra, it is sometimes possible to find a spectral region in which each of the peaks is readily assignable to one product of the mixture, and where there is not too much overlap. The spectral region has therefore been chosen so that each product is represented mainly by one peak (Fig. 2). There are at least two frequency regions with 180 Chapter 3 – New Types of Data : Nature of the Data Set no Raman back-scattering in this domain. This allows an easy recovery of the baseline. The spectral domain studied was anyway very restricted because of the focal of the instrument and the dispersion of the grating. a) b) Fig. 2. Spectra of the five pure products in the selected spectral domain. (2a) toluene,(2b) mxylene, (2c) p-xylene, (2d) o-xylene, (2e) ethyl-benzene. c) d) e) 181 New Trends in Multivariate Analysis and Calibration 4.2 - Normalisation of the Raman spectra It is known that the principal source of instability of the intensity of the Raman scattering is the possible variations of the intensity of the laser source. This imposes to normalise the spectra or to perform semi-quantitative measurements. In this study, repeatability has been evaluated using replicate measurements performed over a period of time of several days. This indicated some instability leading to a variation of about 2% on the Raman scattering intensity. It is therefore probable that a normalisation would have been desirable. However, given the spectral domain accessible with the instrument used, and the difference in the cross section of the substances present in the mixtures, a normalisation performed using for instance the total surface of the peaks was not considered. It was therefore necessary to study the improvement of the inverse calibration methods compared to the classical method without normalising the Raman spectra. 4.3 - Spectral shift correction Variation in ambient temperature has an effect on the monochromator present in the Raman spectrometer, and produces a spectral shift. The first part of the spectra is then used to perform a correction. The first 680 points (out of 1360) of each spectrum are not related to the studied mixture, but to the radiation from a neon lamp (Fig. 3). Fig. 3. Raman spectrum of a typical mixture. 182 Chapter 3 – New Types of Data : Nature of the Data Set The spectrum of this lamp shows very narrow peaks which wavelengths are perfectly known. The maximum of the most intense peak can be determined very precisely, and the spectrum is then shifted in such a way that this maximum is set to a given value. This is called the neon correction. At the end of the pre-treatment procedure, some small spectral regions on the extremities of the spectra were removed (from 930 to 895 cm-1 and from 685 to 650 cm-1 ). It was possible to remove these data points as they are known to be uninformative (containing no significant Raman emission from any of the compounds). The resulting spectra consisted of 500 points (Fig. 4). Fig. 4. Raman spectra of a synthetic mixture after “neon correction” (PX = p-xylene, 21.38 %; T = toluene, 20.13 %; EB = ethyl-benzene, 18.07 %; OX = o-xylene, 19.93 %; MX = m-xylene; 20.33 %). 5 – Results and discussion In all cases, separate models were built for each of the products. The results are given in terms of percentage of the result obtained with the classical calibration method. Results lower than 100% mean a lower RMSECV. The first and second derivative did not yield any improvement in the predictive ability of the models. More methods, like PLS II [10] or Uniformative Variable Elimination PLS [19] (UVE PLS), were used but did not lead to better models. 5.1 - Classical method This method applies classical multivariate calibration. The intensities of the peaks are represented as the result of the presence of a given number of chemical components with a certain concentration and a 183 New Trends in Multivariate Analysis and Calibration given cross section. As can be seen in system (4) and equation (5), according to the model built using this method, the mixture can contain only those components. Impurities that might be present are not taken into account, as the sum of the concentrations of the modelled components is always 100%. This method takes into account the variation of the laser intensity and always uses relative concentrations. These results were computed from the values given by the software after the spectra acquisition with a calibration performed using spectra from this data set. The results of this method are taken as reference. The RMSEP values for all the products are therefor set to 100 %. 5.2 - Univariate Linear Regression Linear regression models were built to relate the concentration of each of the products to the maximum of the corresponding peak, and to the average Raman scattering value of 3 to 7 points spectral windows centred on this maximum (Table 1). Compared to those obtained with the classical multivariate method, the results obtained with linear regression are comparable for some compounds (toluene, oxylene), worse for some other (m- xylene, p- xylene) and better in one case (ethyl-benzene). These differences are due to the fact that models built here are univariate models, therefore not taking into account overlapping between peaks. Table 1. Relative RMSECV calibration results obtained using Linear Regression applied to the wavelength corresponding to the maximum of each peak and to the sum of the integrated intensities of 3 to 7 points spectral windows centred on this maximum. The wavenumber corresponding to the maximum of the peak is also given. toluene m-xylene p-xylene o-xylene eth-benzene Maximum (cm-1 ) 790 728 833 737 774 RMSECV 1 point 98.1 211.1 213.9 133.0 66.7 101.3 204.6 211.5 131.9 63.6 102.5 193.7 211.8 131.6 68.3 103.9 162.8 210.8 129.3 68.5 RMSECV 3 points RMSECV 5 points RMSECV 7 points 184 Chapter 3 – New Types of Data : Nature of the Data Set 5.3 - Stepwise MLR Stepwise-MLR appeared to give the best results (table 2). The models built with a critic al level of 1 % are parsimonious (between 1 and 4 variables retained) and all give better results than the ones obtained with the previous methods except in case of p-xylene. This model is built retaining only one variable. A slightly less parsimonious model could be expected to give best result without a significant loss of robustness. Table 2. Relative RMSECV calibration results obtained for each of the five products using Stepwise Multiple Linear Regression. α =1% toluene m-xylene p-xylene o-xylene eth-benzene RMSECV 96.1 72.2 155.4 69.2 73.6 # variables 3 4 1 3 2 RMSECV 23.5 6.6 0.2 9.1 0.0 # variables 20 22 34 15 35 α =5% As expected, the models built with α = 5 % retain more variables. But here, the number of retained variables is by far too high, the models are dramatically overfitted. Moreover, the RMSECV are so low that they can not be considered as relevant. Those results are only possible because RMSECV is not used in the variable selection step of the method. It is only used after the model is built to evaluate its predictive ability. The possibility of variables selected by chance correlation was then investigated. Variable selection methods can retain irrelevant variables because of chance correlation. It has been shown that a Stepwise selection applied to a simulated X spectral matrix filled with random values and a random Y concentration matrix will lead to retain a certain number of variables [20-21]. The cross validation performed on the obtained model will even lead to a ve ry good RMSECV result. This can also happen 185 New Trends in Multivariate Analysis and Calibration with more sophisticated variable selection methods like Genetic Algorithms [22]. It was shown that this behaviour is by far less frequent for methods working on the whole spectrum, like PCR or PLS [23]. This is actually what happens in this study. For instance, on the m-xylene model (22 variables retained), some variables that should not be considered as informative (not located on one of the peaks, low Raman intensity) have a quite high correlation coefficient with the considered concentration (table 3). Those variables also have high regression coefficients, so that although the Raman intensity for those wavelengths is quite low since many of them are located in the baseline, they take a significant importance in the model. Table 3. Model built with Stepwise selection for the m-xylene (18 first variables only). The correlation coefficient and the regression coefficient for the selected variables are also given. Order of selection Index of variable 1 2 3 4 5 6 7 8 9 398 46 477 493 63 45 14 80 463 Correlation coefficient 0.998 -0.488 0.221 0.134 -0.623 -0.122 0.565 -0.69 0.09 Regression coefficient 0.030 -4.47 1.50 0.97 -3.15 -1.36 3.26 -3.01 0.35 Order of selection 10 11 12 13 14 15 16 17 18 Index of variable 47 94 425 77 442 90 430 423 115 Correlation coefficient -0.4 -0.599 0.953 -0.67 0.61 -0.54 0.94 0.95 -0.39 Regression coefficient -3.41 -1.59 0.80 -3.32 1.79 -1.57 0.96 0.77 -0.27 Using the regression coefficient obtained for a variable, and the average Raman intensity for the corresponding wavelength, it is possible to evaluate the weight this variable has in the MLR model (table 4). One can see that the relative importance of variable 80, selected in fourth position, is about one third of the importance of the first selected variable. This relative importance explains why the last selected variables are still considered relevant and lead to a dramatic improvement of the RMSECV. In 186 Chapter 3 – New Types of Data : Nature of the Data Set this particular case, this is not the sign of a better model, but this shows the failure of cross validation combined with backward elimination. Table 4. Evaluation of the relative importance of selected variables in the MLR model built with Stepwise variable selection for m-xylene. Order of selection Index of variable Correlation coefficient Regression coefficient Raman intensity Weight in the model 1 398 0.9981 0.0298 1029.2 30.67 4 493 0.1335 0.9663 8.01 7.74 8 80 -0.69 -3.01 3.41 -10.26 5.4 -PCR VS and PLS Calibration models were built with PCR VS and PLS (table 5). These two models gave comparable results (except for p-xylene) and usually required 4 latent variables, except for Ethyl-Benzene that required 7 latent variables. These complexities do not appear to be especially high for models predicting the concentration of a product in a five compound mixture. Using more latent variables for Ethyl- Benzene is logical because this peak is the most broad and overlapped by other peaks. It is also the peak with the smallest Raman scattering intensity and it therefore has the worst signal/noise ratio. Compared to Stepwise MLR with α = 1 %, those latent variable methods gave systematically worse results, except in the case of p- xylene Table 5. Relative RMSECV calibration results obtained using Principal Component Regression with Variable Selection (the PCs are given in the order in which they are selected) and Partial Least Square. PCR VS toluene m-xylene p-xylene o-xylene ethbenzene RMSECV 128.1 92.3 208.9 125.6 84.4 Selected PCs 3421 2143 1324 1234 4321 187 New Trends in Multivariate Analysis and Calibration RMSECV 112.43 75.84 149.2 108.8 102.6 # factors 4 4 5 4 7 PLS 5.5 - Improved variable selection The modified Stepwise selection method enabled to improve the MLR models built for a critical level of 5 %. The models are more parsimonious and the RMSECV seem much more physically meaningfull (table 6). Table 6. Relative RMSECV calibration results obtained for each of the five products using Stepwise Multiple Linear Regression with improved variable selection method toluene m-xylene p-xylene o-xylene ethbenzene RMSECV 80.5 41.9 82.6 69.2 59.6 # variables 7 11 9 3 6 α=5% Some new variables are built with spectral windows from 3 to 13 points (table 7). This enlargement of variable happens in each model for the maximum of the peak corresponding to the modelled compound, but also for variables in the baseline or on the location of other peaks. However, this approach does not seem to solve the problem completely. For some models, variables are still retained because of chance correlation, leading to excessively high complexities in some cases (11 varia bles for m- xylene). 188 Chapter 3 – New Types of Data : Nature of the Data Set Table 7. Complexity of the MLR calibration models built using variables selected with the modified stepwise selection method. Size is the size of the spectral window centred on this variable and used as new variable. Retained variable 1 2 3 4 5 6 7 8 9 10 11 Index 250 474 271 460 265 272 467 Size 3 1 1 1 1 1 1 Index 398 443 46 72 99 22 78 475 464 44 415 Size 5 1 1 1 1 5 1 5 1 1 1 Index 145 159 125 164 136 480 449 85 158 Size 3 1 1 5 1 1 1 3 3 Index 374 438 29 Size 3 1 1 Index 291 50 17 42 28 486 Size 13 1 1 1 1 1 toluene m-xylene p-xylene o-xylene ethbenzene Genetic Algorithms were used with the following input parameters. Number of strings in each generation : 20 ; number of variables in each string : 10 ; frequency of cross-over : 50 %; mutations : 2 %, and backward elimination : once every 20 generations ; the number of generations :200. The models obtained are much better than the α = 5 % Stepwise-MLR models in terms of complexity. However, the complexities are still high (table 8), which seems to indicate that the G.A. selection is also affected by random correlation. Moreover, the RMSECV values are comparable with those obtained with the α = 1 % Stepwise MLR model, but they are worse than those obtained with the modified Stepwise approach. Globally, the G.A. approach is therefore not more efficient than the modified Stepwise procedure. Table 8. Relative RMSECV calibration results obtained for each of the five products using Genetic Algorithm Multiple Linear Regression. 189 New Trends in Multivariate Analysis and Calibration α =5% eth- toluene m-xylene p-xylene o-xylene RMSECV 109.1 82.9 179.8 98.8 78.9 # variables 5 9 8 9 5 benzene 5.6 - Improved signal pre-processing. Another possibility to avoid the inclusion of noise variables in MLR is to decrease the noise by signal pre-processing. By plotting the difference between a spectrum and the average of the three replicates of the same sample, one can have an estimation of the noise structure (Fig. 5). It appears that the noise variance is not constant along the spectrum but heteroscedastic, it increases as the signal of interest increases. Unfortunately, it is not possible to use the average of spectra in practice to achieve a better signal/noise ratio because this would lead to acquisition times non-compatible with the kinetic of the process. a) b) 190 Fig. 5. Para-xylene spectrum (5a) and estimation of the noise for this spectrum (5b). Chapter 3 – New Types of Data : Nature of the Data Set Smoothing by moving average was used to reduce the noise in the signal. The optimisation, of the window size was done for each compound individually using PCR VS and PLS models. The optimal size for the smoothing average window is 5 points. For this window size, the RMSECV values of the PCR VS and PLS models are slightly improved (table 9). Table 9 Relative RMSECV calibration results for PCR VS (the PCs are given in the order in which they are selected) and PLS models. Spectra smoothed using a 5 points window moving average. PCR VS PLS toluene m-xylene p-xylene o-xylene ethbenzene RMSECV 95.5 83.9 152.9 70.6 68.7 PCs 3421 2143 1324 1234 4321 RMSECV 95.7 94.2 154.4 69.9 52.3 # factors 4 4 5 4 7 The complexities are unchanged, showing that no extra component was added because of noise. In the case of Stepwise MLR, the model complexities are reduced, but the Stepwise variable selection method is still subject to chance correlation with those smoothed data (table 10). Table 10 Relative RMSECV results for Stepwise MLR models. Spectra smoothed using a 5 points window moving average. toluene m-xylene p-xylene o-xylene ethbenzene RMSECV 96.1 72.2 155.4 69.2 73.6 # variables 3 4 1 3 2 RMSECV 73.5 38.9 82.6 52.9 45.9 # variables 8 18 9 8 10 α=1% α=5% Some of those models seem to be quite reasonable. For instance, the model built for toluene uses 8 variables and gives a relative RMSECV of 73.5 %, but more important, the wavelengths retained seem 191 New Trends in Multivariate Analysis and Calibration to have a physical meaning (Fig. 6). The first wavelength selected is located on the peak maximum, the second takes into account the overlapping due to the p-xylene peak, the third is on the baseline, the fourth takes into account the overlapping due to the ethyl-benzene peak, and three extra variables are selected around the peak maximum. Fig. 6. Wavelengths selected by the Stepwise selection method for the toluene model, and order of selection of those variables displayed on the spectrum of a typical mixture containing all 5 components. On the other hand, for some models, the method has retained variables in a much more surprising way. In the case of the model built for m-xylene for instance, 18 variables are retained. Most of those variables are located in non- informative parts of the spectra (Fig. 7) and are selected because of chance correlation. In that case, the denoising has not been efficient and chance correlation still occurs. Fig. 7. Wavelengths selected by the Stepwise selection method for the mxylene model, and order of selection of those variables displayed on the spectrum of a typical mixture containing all 5 components. 192 Chapter 3 – New Types of Data : Nature of the Data Set In order to check if the optimal smoothing window size is the same for PCR/PLS and Stepwise MLR, the fitness of the Stepwise-MLR models was evaluated depending on this parameter (table 11). The results show again that because smoothing by moving average modifies the shape and height of the peaks, this kind of smoothing can lead to degradation of the models. The optimal window size is anyway not the same for all of the models and it is difficult to find a typical behaviour in the calibration results. Table 11 Complexity and performance (relative RMSECV) of Stepwise MLR models (α = 5 %) depending of the window size used for smoothing by moving average. The best model in terms of complexity and RMSECV value for each constituent is written in bold. toluene m-xylene p-xylene o-xylene ethbenzene # variable s 20 22 34 15 35 RMSECV 23.5 6.6 0.2 9.1 0.0 # variables 18 19 12 10 34 RMSECV 34.3 13.0 106.0 43.9 0.1 # variables 8 18 9 8 10 RMSECV 73.5 38.9 82.6 52.9 45.9 # variable s 9 8 4 6 9 RMSECV 66.9 54.1 142.5 116.9 53.1 # variables 11 10 10 2 10 RMSECV 41.8 45.2 126.0 98.3 57.0 Smoothing on 11 points # variables 6 7 7 2 7 RMSECV 78.5 62.8 116.3 99.5 60.6 Smoothing on 21 points # variables 6 6 4 3 9 RMSECV 55.9 57.4 162.9 93.6 53.1 No smoothing (table 2) Smoothing on 3 points Smoothing on 5 points (table 10) Smoothing on 7 points Smoothing on 9 points 193 New Trends in Multivariate Analysis and Calibration To apply filtering in the Fourier domain, a slightly wider spectral region had to be retained (removing less points at the extremities of the original data after neon-correction) in order to set the number of points in the spectra to 512 (29 points). The Stepwise-MLR models obtained using the denoised spectra (Fig. 8) are by far better especially in terms of complexity. The models are much more parsimonious with only 3 to 5 wavelengths retained and the RMSECVs are the best obtained for all the substances (table 12). a) b) 194 Fig. 8. Example of a typical spectrum of a five compounds mixture before (8a) and after (8-b) denoising in the Fourier domain. Chapter 3 – New Types of Data : Nature of the Data Set Table 12. Relative RMSECV calibration results obtained with Stepwise MLR applied to data denoised in the Fourier domain. α =1% α =5% toluene m-xylene p-xylene o-xylene ethbenzene RMSECV 81.6 70.2 165.0 87.3 69.3 # variables 5 4 3 4 3 RMSECV 81.6 70.2 145.8 65.1 69.3 # variables 5 4 5 4 3 Some models built with a critical level α = 1 % are exactly identical to those built with α = 5 %. The fact that increasing the critical level does not lead to selecting more variables could mean that the models are optimal. Some are slightly worse for equal or smaller complexity. PCR VS and PLS models were also built using the filtered spectra in order to check if those method would benefit from this pretreatment (table 13). It appears that the PCR VS and PLS models built on denoised data are equivalent or worse than the ones built on raw data. This probably means that this denoising was too extensive in the case of a full spectrum method. The benefit of removing noise was lost because of the fact that peak shapes were damaged. In this case the pre-treatment has a deleterious effect on the resulting model. Table 13. Relative RMSECV calibration results obtained with PCR VS (the PCs are given in the order in which they are selected) and PLS (the number of factors retained is given) on the spectra denoised in the Fourier domain toluene m-xylene p-xylene o-xylene ethbenzene RMSECV 159.6 87.4 205.1 111.2 132.3 PCs 3421 2143 1324 1234 4321 RMSECV 146.3 87.3 154.0 89.1 101.8 # factors 5 4 5 5 5 PCR CV PLS 195 New Trends in Multivariate Analysis and Calibration The same spectra (512 points) were used to perform filtering in the wavelet domain. The Daubechies family wavelet was used on the first level of decomposition only (Fig. 9). Higher decomposition levels were investigated, but this did not lead to better models. The results obtained are generally good (table 14). However, both complexities and RMSECV values are worse than in the case of filtering in the Fourier domain, except for p-xylene. In the case of o-xylene, only three variables are retained, this is the same complexity as in the Stepwise-MLR model built with a critical level of 1% on data before denoising, but the RMSECV is worse for the denoised data. This could be expected when looking at the denoised spectra. Spectra denoised in the wavelets domain (Fig. 9-b) have a more angular shape than those denoised in the Fourier domain (Fig. 8-b). This indicates that the shape of the peaks is probably more affected by the wavelets pre-treatment. The filtering in wavelet domain can therefore be considered here as less efficient than denoising in Fourier domain. Fig. 9. Example of a typical spectrum of a mixture of five compounds before (8-a) and after (8-b) denoising in the wavelet domain. a) 196 Chapter 3 – New Types of Data : Nature of the Data Set b) Table 14. Relative RMSECV calibration results obtained with Stepwise MLR (α = 5 %) applied to data denoised in the wavelet domain. toluene m-xylene p-xylene o-xylene eth-benzene RMSECV 92.5 52.9 101.4 107.9 49.6 # variables 4 7 5 3 7 6 - Conclusion Inverse calibration methods were used on Raman spectroscopic data in order to model the concentrations of individual compounds in a C8 compounds mixture. These methods outperformed the classical calibration method currently used. In this classical calibration method, the sum of the relative concentrations of the modelled components is always 100 %, impurities are not taken into account. In inverse calibration, the concentrations are assumed to be a function of the spectral values (Raman scattering). Therefore, a perfect knowledge of the composition of the system is not necessary and the presence of possible impurities should not be a problem anymore. This is the main limitation of classical multivariate calibration and the main reason why an even more significant improvement can be expected when using inverse calibration methods on real data containing impurities. Moreover, the acquisition conditions and the spectral region studied were chosen based on constraints related to the instrument, the industrial process and the calibration method used. These conditions were therefore not 197 New Trends in Multivariate Analysis and Calibration optimal for this study. In fact, inverse calibration methods would probably have benefited from using more information on a wider spectral region. It can be expected that, for a given substance, calibration performed on several informative peaks would outperform the current models. Another interesting point is that the total integrated surface of a complex Raman spectrum is directly related to the intensity of the excitation source. Working in a wider spectral region would allow performing a standardisatio n of the spectra to take into account the effect of variations of the laser intensity. This would probably have improved significantly the calibration results. This will be investigated in a second part of this study, using an instrument with better performances particularly in terms of spectral region covered. The very specific and simple structure of Raman spectra implied that the most sophisticated methods are not the most efficient. It was shown that Stepwise Multiple Linear Regression leads to the best models. One problem is that the Stepwise variable selection method is disturbed by noise in the spectra, which induces the selection of chance correlated variables. This problem was efficiently resolved by denoising. Whatever denoising method is used, the procedure should always be seen as finding a compromise between actual noise removal (improves the performance of the model) and changing the peaks shape and height (deleterious effect on the resulting model). The best method for this purpose appeared to be filtering in the Fourier domain. The problems related to noise could also disappear when the instrument with better performances is used, as the signal/noise ratio will be much higher. R EFERENCES [1] Ph. Marteau, N. Zanier, A. Aoufi, G. Hotier, F. Cansell, Vibrational Spectroscopy 9 (1995) 101. [2] Ph. Marteau, N. Zanier, Spectroscopy 10 (1995) 26. [3] N. R. Draper, H. Smith, Applied Regression Analysis, second edition (Wiley, New York, 1981). [4] R. Leardi, R. Boggia, M. Terrile, J. Chemom. 6 (1992) 267. [5] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord, Anal. Chem. 67 (1995) 4295. [6] H. van der Voet, Chemom. Intell. Lab. Syst. 25 (1994) 313. [7] T. Naes, H. Martens, J. Chemom. 2 (1998) 155. [8]. J. Sun, J. Chemom. 9 (1995) 21. [9] J. M. Sutter, J. H. Kalivas, P.M. Lang, J. Chemom. 6 (1992) 217. [10] H. Martens, T. Naes, Multivariate Calibration (Wiley, Chichester,1989). 198 Chapter 3 – New Types of Data : Nature of the Data Set [11] D. M. Haaland, E. V. Thomas, Anal. Chem. 60 (1988) 1193. [12] P. Geladi, B. K. Kovalski, Anal. Chim. Acta 185 (1986) 1. [13] A. Savitzky and M. J. E. Golay, Anal. Chem. 36 (1964) 1627. [14] G. W. Small, M. A. Arnold, L. A. Marquardt, Anal. Chem. 65 (1993) 3279. [15] H. C. Smit, Chemom. Intell. Lab. Syst. 8 (1990) 15. [16] C. R. Mittermayer, S. G. Nikolov, H. Hutter, M. Grasserbauer, Chemom. Intell. Lab. Syst. 34 (1996) 187. [17] D. L. Donoho in: Y. Mayer and S. Roques, Progress in Wavelet Analysis and Application, (Edition Frontiers, 1993). [18] D.L.Donoho, IEEE Transaction on Information Theory 41 (1995) 6143. [19] V. Centner, D. L. Massart, O. E. de Noord, S. de Jong, B. M. V. Vandeginste, C. Sterna, Anal. Chem. 68 (1996) 3851. [20] J. G. Topliss, R. J. Costello, Journal of Medicinal Chemistry 15 (1971) 1066. [21] J. G. Topliss, R. P. Edwards, Journal of Medicinal Chemistry 22 (1979) 1238. [22] D. Jouan-Rimbaud, D. L. Massart, O. E. de Noord, Chem. Intell. Lab. Syst. 35 (1996) 213. [23] M. Clark, R. D. Cramer III, Quantitative. Structure Activity Relationship 12 (1993) 137. 199 New Trends in Multivariate Analysis and Calibration INVERSE MULTIVARIATE CALIBRATION APPLIED TO ELUXYL RAMAN DATA ChemoAC internal Report, 03/2000 F. Estienne and D.L. Massart * ChemoAC, Farmaceutisch Instituut, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium. E-mail: fabi@fabi.vub.ac.be ABSTRACT An industrial process separating p-xylene from mainly other C 8 aromatic compounds is monitored with an online remote Raman analyser. The concentrations of six constituents are currently evaluated with a classical calibration method. The aim of the study being to improve the precision of the monitoring of the process, inverse calibration linear methods were applied on a synthetic data set, in order to evaluate the improvement in prediction such methods could yield. Several methods were tested including Principal Component Regression with Variable Selection, Partial Least Square Regression or Multiple Linear Regression with variable selection (Stepwise or based on Genetic Algorithm). Methods based on selected wavelengths are of great interest because the obtained models can be expected to be very robust toward experimental conditions. However, because of the important noise in the spectra due to short accumulation time, variable selection methods selected a lot of irrelevant variables by chance correlation. Strategies were investigated to solve this problem and build reliable robust models. These strategies include the use of signal pre-processing (smoothing and filtering in the Fourier or Wavelets domain), and the use of an improved variable selection algorithm based on the selection of spectral windows instead of single wavelengths when this leads to a better model. The best results were achieved with Multiple Linear Regression and Stepwise variable selection applied to spectra denoised in the Fourier domain. * Corresponding author K EYWORDS : Chemometrics, Raman Spectroscopy, Multivariate Calibration, random correlation. 200 Chapter 3 – New Types of Data : Nature of the Data Set 1 - Introduction The task of our group in this study was to evaluate whether the use of Inverse Calibration methods could lead to an improvement in the quality of the online monitoring of the Eluxyl process. The process is currently monitored using the experimental setup and software developed by Philippe Marteau. This software implements a classical multivariate calibration method based on the measurement of the areas of the Raman peaks. The main assumption when building a classical calibration model to determine concentrations from spectra is that the error lies in the spectra. The model can be seen as : Spectra = f (Concentrations). Or, in a matrix form: R = C . K , where R is the spectral response matrix, C the concentration matrix, and K the matrix of molar absorptivities of the pure components. This implies that it is necessary to know the concentrations of all the products present in the mixture in order to build the model, at least if a high precision is required. Taking into account that a small quantities (<5%) of impurities (i.e. C9 + compounds) is present in the mixture when working on real data, the classical calibration method is naturally limited in precision if all the impurities are not clearly identified in the spectrum. In inverse calibration, one assumes that the error lies in the measurement of the concentrations. The model can be seen as : Concentrations = f (Spectra). Or, in a matrix form : C = P . R , where R is the spectral matrix, C the concentration matrix, and P the regression coefficients matrix. A perfect knowledge about the composition of the system is then not necessary. Better results are therefore expected as the presence of impur ities does not affect the prediction of the concentration of the compounds of interest (at least if these impurities were present in the calibration data set used to build the model). 2 – Data set used in this study The data set was made of synthetic mixtures prepared from products previously analysed by gas chromatography in order to assess their purity. Those mixtures were designed to cover a wide range of concentrations representative for all the possible situations on the process. The data set consists of 71 spectra : 201 New Trends in Multivariate Analysis and Calibration - 1 spectrum for each of the 5 pure products (toluene, meta, para, and ortho-Xylene, and ethylbenzene) - 10 equimolar binary mixtures consisting of all binary mixtures which can be prepared from the five pure products. - 10 equimolar ternary mixtures - 5 equimolar quaternary mixtures - 1 equimolar mixture including the five constituents - 9 spectra of binary para- xylene / meta-xylene mixtures (concentrations from 10/90 to 90/10 with a 10% step) - 5 spectra of binary toluene / meta- xylene mixtures (concentrations from 10/90 to 90/10 %) - 10 replicates of randomly chosen mixtures - 16 mixtures including the five constituents with various concentrations Spectra were acquired from 0 to 3400 cm-1 with a 1.7 cm-1 step. After interpolation by the instrument software, the spectra had a 0.3 cm-1 step, leading to 11579 data points per spectrum. 3 – Pre-processing The main problem in this data set was due to the instability of the excitation source used during the acquisition of the spectra. The laser used for excitation was in fact ageing, leading to the fact that it could deliver only one half of its nominal power at the beginning of the acquisition period, and only one fourth at the end of the acquisitions. This is not a problem when relative concentrations have to be evaluated, like this is the case with the software developed by Philippe Marteau. But this problem has to be solved when one wants to evaluate absolute concentrations. The best way would be to have a reference sample, independent from the sample studied, but measured at the same time with the same excitation source. The spectra could therefore be corrected to take into account the intensity variations of the source. This was not available here. The only way left to normalise the spectra was to work on their surface. It would have been easier if the mixtures studied had contained many products, leading to very complex spectra which total surfaces could have been considered constant. In this case, it would have been enough scale all the spectra to a given value. In the present case, the small number of substances with very different cross-sections forbids the use of such a methodology. It was therefore 202 Chapter 3 – New Types of Data : Nature of the Data Set necessary to try and find a part of the spectra with constant enough surface so that the scaling can be performed according to this part only. The choice was made empirically, testing the results of a benchmark method on data normalised according to the surface of a given part (or given parts) of the spectra. The benchmark method chosen was Principal Component Regression with Variable Selection (PCR-VS) [1,2,3,4]. Suitability of the pre-processing was evaluated according to the results of models built for para-xylene. Five zones were defined in the spectra (Fig. 1) : Zone 1 : 0-160 cm-1 à nothing Zone 2 : 160-1700 cm-1 à Actual spectra Zone 3 : 1700-2500 cm-1 à Baseline Zone 4 : 2500-3200 cm-1 à CH range Zone 5 : 3200-3400 cm-1 à noise Fig. 1. Spectra of the products with the 5 spectral domains defined. The models were built using 4 to 7 principal components. The results are given in terms of Root Mean Squared Error of Cross Validation (RMSECV). 203 New Trends in Multivariate Analysis and Calibration Table 1. RMSECV for a PCR-VS model built for para-xylene depending of standardisation. Reference zone(s) No standardis ation 1+2+3+4+5 2 3 4 2+4 RMSECV value 4.23 0.69 0.79 1.54 0.64 0.56 Number of PCs retained 7 4 5 4 4 5 It appears that the best way to normalise the spectra is to scale them according to the total surface corresponding to zones 2 and 4 (actual spectra and CH range). It can be seen also how tremendously important such a correction is, as the results can be improved by a factor 10. However, the solution is more than probably not optimal for, as said before, considering that the total surface of these two spectral zones should theoretically be constant is not a valid hypothesis. After this standardisation procedure, the baseline shift visible in spectral domain #3 was almost perfectly removed. The use of a specific baseline removal procedure did not further improve the calibration results. The spectra were corrected for wavelength shift using the corresponding Neon spectra. However, with spectra from this new experimental setup, this correction happened to be by far less crucial than on previous Eluxyl Raman data we investigated. 3 – Choice of the calibration method to be used In a previous study performed a synthetic data set simulating ELUXYL data, it had been shown that the most effective calibration method was Stepwise Multiple Linear Regression (Stepwise-MLR) [5] applied to spectra denoised in the Fourier domain. At that time, no non- linearity had been detected in the data set. Considering the much better signal/noise ration and repeatability in this data set, it was necessary to investigate for non- linearity again. In fac t, it now appears very clearly that the mixture effects are not linear. It is the case for instance for meta-xylene/para-xylene mixture. The results of a PCR-VS model of meta-xylene show that there is a clear deviation from linearity (Fig. 2-a) on the first PC. This is especially visible for samples number 2-3 and 32 to 40 (corresponding to pure meta and 204 Chapter 3 – New Types of Data : Nature of the Data Set para-xylene, and binary meta/para mixtures with various concentrations). Adding more components to this model (Fig 2-b,c) tends to accommodate for the non- linearity, but even for the optimal 4 components model, the non- linearity was not completely corrected (Fig. 2-d). Fig. 2-a. Y vs Yhat, PCR-VS model for meta-xylene. 1 component. a) Fig. 2-b. Y vs Yhat, PCR-VS model for meta-xylene. 2 components. b) 205 New Trends in Multivariate Analysis and Calibration Fig. 2-c. Y vs Yhat, PCR-VS model for meta-xylene. 3 components. c) Fig. 2-d. Y vs Yhat, PCR-VS model for meta-xylene. 4 components. d) Because of these non- linearities, linear methods such as PCR-VS, Partial Least Squares Regression (PLS) [6-8] and Stepwise MLR did not lead to good results (RMSECV values always around 0.5). It was therefore decided to work with non- linear methods. The most representative of these non- linear methods is artificial Neural Networks (NN) [9,10]. Individual models were built for each of the compounds, using the scores of a PCA as input variable. PCA was applied on the spectra limited to their informative parts (spectral ranges 2 and 4), after column centering. The calibration results are much better than those obtained with linear methods (Table 2). They are given in terms of Root Mean Squared Error of Monitoring. 206 Chapter 3 – New Types of Data : Nature of the Data Set Table 2. Results for the NN models. Compound Toluene Meta-xylene Para-xylene Ortho-xylene Ethyl-benzene Topology (input-hidden-output) 5-4 -1 6-4 -1 6-3 -1 6-4 -1 6-3 -1 Input variables (in order of sensitivity) 12436 234156 123457 324156 431269 RMSEM 0.14 0.18 0.10 0.13 0.15 As can be seen from the monitoring prediction results (Fig. 3), the NN perfectly corrected for the nonlinearity in the data set. Fig. 3. Y vs Yhat, optimum NN model for para- xylene. The deviation from linearity can be seen plotting the projection of the input variables on the transfer function of the hidden nodes (Fig. 4-a,c). This shows again that some slight but real non-linearity was present in the data. 207 New Trends in Multivariate Analysis and Calibration Fig. 4-a. Input variables projected on the transfer function of the 1st hidden node in a 6-3-1 NN model for paraxylene. Fig. 4-b. Input variables projected on the transfer function of the 2nd hidden node in a 6-3-1 NN model for para-xylene. 208 Chapter 3 – New Types of Data : Nature of the Data Set Fig. 4-c. Input variables projected on the transfer function of the 3rd hidden node in a 6-3-1 NN model for paraxylene. 4 – Industrial data A smaller data set made of 16 samples taken on the industrial process became available very recently. It would make no sense to apply the Neural Networks trained on the synthetic data set on this data set. This is due to the fact that these new samples contain impurities, leading to the fact that they would appear as outliers to the previous model. The predicted concentrations would therefore be erroneous. Moreover, due to the small number of industrial samples available, it is not possible to build a reliable and robust NN calibration model. However, it is possible to have an idea of the overall performances of inverse calibration methods on the industrial data by building for instance a PCR-VS model on the new data set (containing impurities), and comparing it to an equiva lent model built using the synthetic data set. It appears that the results obtained on the small industrial data set are generally better (except for Toluene), and sometimes even much better (for ortho-xylene) than those obtained for the synthetic data set (Table 3). Table 3 : Results of PCR-VS models for synthetic and industrial data, given in terms of RMSECV. 209 New Trends in Multivariate Analysis and Calibration Product Toluene Meta-xylene Para-xylene Ortho-xylene Ethyl-benzene Synthetic data set (standardised data) PC selected RMSECV (in order) 12435 0.49 23641 1.04 1234 0.53 3241 0.77 4312 0.79 Industrial data set (non standardised data) PC selected RMSECV (in order) 12547 0.53 12 0.44 21 0.45 12 0.09 15623 0.51 It has to be taken into account that the number of samp les in the industrial data set is very small, and the distribution of concentrations in these samples is very limited. This can explain the sometimesdramatic improvement of the results, if for instance the modelled points where distributed in such a way that non- linearities were not prejudicial to the model anymore. Another explanation can also be that the laser source seemed to be much more stable during the acquisition of the industrial data set (new laser or shorter acquisition period). This can be seen, as the results are not improved by the application of the standardisation procedure. However, the good results obtained on these real industrial data already with linear methods indicates that it would probably possible to reach an excellent precision in prediction if a Neural Networks model (or another non- linear method) with a sufficient number of calibration samples was built. 5 – Conclusion The instrument used to produce this data set enabled to achieve a much better signal/noise ratio and repeatability. With this improvement of the quality of the data, it was seen that the data contained some non-linearities. This problem could be efficiently solved by the use of a non-linear inverse calibration method : Artificial Neural Networks. However, the bad stability of the excitation source leaded to big difficulties in the calibration that could only partially be solved by the use of spectral scaling. It has to be taken into consideration that a very short time was available to us for this analysis, it is probable that the results can be further improved (by means of better pre-processing for instance). 210 Chapter 3 – New Types of Data : Nature of the Data Set The few industrial samples provided were not enough to build reliable Neural Networks models. However, the behaviour of linear methods toward these samples indicates that some very good results can be expected when applying NN on industrial data with a large enough calibration set. R EFERENCES [1] Principal Component Regression Tutorial http://minf.vub.ac.be/~fabi/calibration/multi/pcr/ [2] T. Naes, H. Martens, J. Chemom. 2 (1998) 155. [3] J. Sun, J. Chemom. 9 (1995) 21. [4] J. M. Sutter, J. H. Kalivas, P.M. Lang, J. Chemom. 6 (1992) 217. [5] N. R. Draper, H. Smith, Applied Regression Analysis, second edition (Wiley, New York, 1981). [6] H. Martens, T. Naes, Multivariate Calibration (Wiley, Chichester,1989). [7] D. M. Haaland, E. V. Thomas, Anal. Chem. 60 (1988) 1193. [8] P. Geladi, B. K. Kovalski, Anal. Chim. Acta 185 (1986) 1. [9] Neural Netwo rks Tutorial http://minf.vub.ac.be/~fabi/calibration/multi/nn/ [10] Smits, J.R.M., Melssen, W.J., Buydens, L.M.C., and Kateman, G., Chemom. Intell. Lab. Syst., 22 (1994) 165. 211 New trends in Multivariate Analysis and Calibration CHAPTER IV N EW TYPES OF D ATA : STRUCTURE AND SIZE This last chapter treats of new approaches in both multivariate calibration and data exploration. These new approaches are made necessary by data showing new types of structures or very large size. The first paper in this chapter : “Multivariate calibration with Raman data using fast PCR and PLS methods” comes back on the high quality Raman data set treated in chapter 3. The focus is this time on the large size of this data set. This work shows how classical methods like PCR or PLS can be significantly improved in terms of speed without compromising their prediction quality. The second paper in this chapter : “Multi-Way Modelling of High-Dimensionality ElectroEncephalographic Data” presents a data set cumulating novelties and challenges for chemometrical methodology. First of all, this data set is not chemistry but pharmacy related since it is made of electroencephalographic measurements performed during the clinical study of a new anti-depressant drug. It has also a very complex structure with more than 35000 measurements and up to 6 dimensions. The methods used proved particularly efficient, enabling a deep understanding of the data and underlying phenomena. The last paper in this chapter : “Robust Version of Tucker 3 Model” shows how multi-way methods can be modified the same way classical chemometrical methods are in order to be made robust to difficult data set. The author’s contribution to this work was to participate in the method development, to perform calculations on the real data set, and to write the corresponding part of the article (chapter 3.2 and 4.2). Apart from giving another example of application of chemometrics to a new type of data, this chapter proves the usefulness of multi- way methods to data with very high dimensionality. The Tucker 3 model 212 Chapter 4 – New Types of Data : Structure and Size was in particular applied to a 6-way data set. It is probably the first time in the chemometrical field that a model with such high dimensionality with a real data set proves interpretable. 213 New trends in Multivariate Analysis and Calibration MULTIVARIATE CALIBRATION WITH RAMAN DATA USING FAST PCR AND PLS METHODS Analytica Chimica Acta, 450 1 (2001) 123-129. F. Estienne and D.L. Massart * ChemoAC, Farmaceutisch Instituut, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium. E-mail: fabi@fabi.vub.ac.be ABSTRACT Linear and non- linear calibration methods (Principal Component Regression, Partial Least Squares Regression and Neural Networks) were applied to a slightly non- linear Raman data set. Because of the large size of this data set, recently introduced linear calibration methods specifically optimised for speed were also used. These fast methods achieve speed improvement by using the Lanczos decomposition for the singular value decomposition steps of the calibration procedures, and for some of their variants, by optimising the models without cross- validation. Linear methods could deal with the slight non-linearity present in the data by including extra components, therefore performing comparably to Neural Networks. The Fast methods performed as well as their classical equivalents in terms of precision in prediction, but the results were obtained considerably faster. It however appeared that cross-validation remains the most appropriate method for model complexity estimation. * Corresponding author K EYWORDS : Multivariate Calibration, Raman spectroscopy, Lanczos decomposition, Fast Calibration methods. 214 Chapter 4 – New Types of Data : Structure and Size 1 - Introduction Data treated by chemometricians tend to get larger and larger. The data set considered in our study contains 71 spectra that were acquired from 0 to 3400 cm-1 with a 1.7 cm-1 step. After interpolation by the instrument software, the spectra had a 0.3 cm-1 step, leading to 11579 data points per spectrum (Fig. 1). This number of variables was rounded to 10000 by removing points without physical significance at both extremities of the spectra. The data set consists of spectra of mixtures obtained from five pure BTEX products (benzene, toluene, ortho, meta and para xylene, all C8 molecules) previously analysed by gas chromatography in order to assess their purity. These mixtures were designed to cover a wide range of concentrations representative for all the possible mixtures that can be obtained with these five compounds, and specifically cover binary mixtures in order to investigate non-linear effects. The data set was split in calibration and test sets. Fig. 1. Spectra of the five pure products. The calibration set consists of 51 spectra : - 1 spectrum for each of the 5 pure products. - 10 equimolar binary mixtures consisting of all binary mixtures which can be prepared from the five pure products. - 10 equimolar ternary mixtures - 5 equimolar quaternary mixtures 215 New trends in Multivariate Analysis and Calibration - 1 equimolar mixture including the five constituents - 9 spectra of binary product 2 / product 3 mixtures (concentrations from 10/90 to 90/10 with a 10 % step) - 5 spectra of binary product 1 / product 2 mixtures (concentrations from 10/90 to 90/10 with a 20 % step) - 10 mixtures including the five constituents with various random concentrations The test set used to assess the models predictive ability is made of 20 spectra : - 20 mixtures including the five constituents with various random concentrations The shape of the obtained design is shown in figureg. 2-a,b. Fig. 2-a. Score plot of PC1 vs PC2 vs PC3. The test points are in a circle. 216 Chapter 4 – New Types of Data : Structure and Size Fig. 2-b. Score plot of PC1 vs PC2 vs PC4 . The test points are in a circle. Calibration methods such as Principal Component Regression (PCR), Partial Least Squares Regression (PLS) or Neural Networks were used on this data set. Apart from these usual methods, because of the large size of this data set, it was also interesting to apply calibration methods specifically optimised for speed. Such fast methods deriving from PCR and PLS were recently proposed by Wu and Manne [1] who compared them to their classical equivalents on five near infrared (NIR) data sets. The new methods reportedly achieved equivalent prediction results, using models with identical complexities, but the speed of the new algorithms was much faster. These fast methods were therefore applied in this study. 2 - Methods 2.1 - Principal Component Regression with variable selection (PCRS) This method includes two steps. The original data matrix X(n,p) is approximated by a small set of orthogonal Principal Components (PC) T(n,a). A Multiple Linear Regression model is then built relating the scores of the PCs (independent variables) to the property of interest y(n) . The main difficulty of this method is to choose the number of PCs that have to be retained. This was done here by means of Leave One Out (LOO) Cross Validation (CV). The predictive ability of the model is estimated at several 217 New trends in Multivariate Analysis and Calibration complexities (models including 1,2, … etc PCs) in terms of Root Mean Square Error of Cross Validation (RMSECV). RMSECV is defined as : RMSECV = n ∑ ( yˆ i − y i) /n 2 (1) i =1 where n is the number of calibration objects, yi the known value of the property of interest for object i, and yˆ i the value of the property of interest predicted by the model for object i. The complexity leading to the smallest RMSECV is considered as optimal in a first approach. In a second step, in order to avoid overfitting, more parsimonious models (smaller complexities, one or more of the last selected variables are removed) are tested to determine whether they can be considered as equivalent in performance. The slightly worse RMSECV can in that case be compensated by a better robustness of the resulting parsimonious model. This is done using a randomisation test [2,3]. This test is applied to compare a prediction method at two different complexities. In the usual PCR [4], the variables are introduced into the model according to the percentage of spectral variance (variance in X) they explain. This is called PCR top-down. But the PCs explaining the largest part of the global variance in X are not always the most related to y. PCR with variable selection (PCRS) was used in our study. In PCRS, the PCs are included in the model according to their correlation [5] with y, or their predictive ability [6]. 2.2 - Partial Least Squares Regression Similarly to PCR, PLS [7] reduces the data to a small number of latent variables. The basic idea is to focus only on the systematic variation in X that is related to y. PLS maximises the covariance between the spectral data and the property to be modelled. De Jong’s modified version [8] of the original NIPALS [9,10] algorithm was used in this study. In the same way as for PCR, the optimal complexity is determined by comparing the RMSECV obtained from models with various complexities. To avoid overfitting, this complexity is then confirmed or corrected by comparing the model leading to the smaller RMSECV with the more parsimonious ones using a randomisation test. 218 Chapter 4 – New Types of Data : Structure and Size 2.3 - Fast PCR and PLS algorithms The fast algorithms are based on the Lanczos decomposition scheme [11,12,13]. The Lanczos method is a way to efficiently solve eigenvalue problems. It has its fast convergence properties when applied to a large, sparse and symmetric matrix A. The method generates a sequence of tridiagonal matrices T. These matrices have the property that their extreme eigenvalues are progressively better estimates of the extreme eigenvalues of A. The method is therefore useful when only a small number of the largest and/or smallest eigenvalues of A are required. This is the case in calibration methods where information present in a large X matrix has to be compressed to a small number of PCs. In the present case, the decomposition scheme is applied on A = X’ X. The speed improvement is achieved only if T is much smaller than A. In this case, the Singular Value Decomposition (SVD) of T is much faster than the one of A, nevertheless leading to very similar eigenvalues. Two parameters have to be optimised when performing a Lanczos-based SVD. The size of the small tridiagonal matrix T has to be set. It corresponds to the number of Lanczos base vectors that have to be estimated (nl). The number of factors (PCs) that have to be extracted (nf) also has to be set, considering that nf ≤ nl. These parameters were estimated in two different ways. The first is based on LOO-CV, that was used to optimise first the size of the Lanczos basis (nl ) and then the number of eigenvectors extracted from the resulting matrix (nf). A less time consuming approach was also used. The iterations of the Lanczos algorithms were stopped before the loss of orthogonality between successive base vectors becomes important enough to require special corrections. This behaviour of the Lanczos algorithm is well known and leads to rounding errors that greatly affect the outcome of the method. The size of the Lanczos basis (nl) being set this way, the number of factors to be extracted from the resulting matrix (nf) was estimated based on model fit, by estimating how much each individual eigenvector contributes to the model of the property of interest. The model optimised through the CV procedure is called PCRL (L stands for Lanczos), and the models obtained through the other approach is called PCRF (F stands for Fast). The PLS versio n of the fast algorithms is presented by the authors of the original article as a special case in which the full space of eigenvectors generated in the Lanczos basis is used, leading to nl = nf. The obtained models are denoted by PLSF. 219 New trends in Multivariate Analysis and Calibration 2.4 - Neural Network (NN) In our study Neural Networks calibration [14,15] was performed on the X data matrix after it was compressed by means of PC transformation. The most relevant PCs, selected on the basis of explained variance, are used as input to the NN. The number of hidden layers was set to 1 in our study. The transfer function used in the hidden layer was non-linear (i.e. hyperbolic tangent). The weights were optimised by means of the Levenberg-Marquardt algorithm [16]. A method was applied to find the best number of nodes to be used in the input and hidden layers based on the contribution of each node [17]. The optimisation procedure of NN also requires the calibration set to be split into training and monitoring set in order to avoid overfitting. The last 10 spectra of the calibration set (10 mixtures including the five constituents with various random concentrations) were used as monitoring set since they can be expected to be most representative of future mixtures to be predicted. 3 - Results and discussion All calculations were performed on a personal computer equipped with an AMD Athlon 600 Mhz processor and 256 Megs of RAM, in the Matlab environment. The software used is house made except for the new methods for which the code provided as annex in the original paper [1] was used. The results used to assess the predictive ability of the methods are given in terms of Root Mean Squared Error of Prediction (RMSEP) which is defined as : RMSEP = nt ∑ ( yˆ i − y i) / n 2 (2) t i =1 where nt is the number of objects in the test set, yi the known value of the property of interest for object i, and yˆ i the value of the property of interest predicted by the model for object i. The speed of the methods is measured by estimating the number of operations necessary to perform the complete calibration and prediction procedure. This number of operations is estimated using the 220 Chapter 4 – New Types of Data : Structure and Size Matlab ‘FLOPS’ function, which is used to count the number of floating-point operations performed, and expressed in Mflops (Millions of operations). Prediction results are given in tables 1 to 5. Table 1. Results obtained for product 1. RMSEP PCRS PLS PCRL PCRF PLSF NN 0.338 0.291 0.294 0.654 0.294 0.144 Table 4. Results obtained for product 4. Product 1 Complexity Time ( nl / nf ) (Mflops) -/6 430.6 6/73.9 6/6 90.2 6/4 16.4 6/6 15.6 Topology : 5 - 4 - 1 RMSEP PCRS PLS PCRL PCRF PLSF NN Table 5. Results obtained for product 5. Table 2. Results obtained for product 2. RMSEP PCRS PLS PCRL PCRF PLSF NN 0.255 0.120 0.213 0.747 0.172 0.181 Product 2 Complexity ( Time nl / nf ) (Mflops) -/5 431.0 7/76.1 7/6 93.8 7/3 18.2 7/7 17.6 Topology : 6 - 4 - 1 RMSEP PCRS PLS PCRL PCRF PLSF NN Table 3. Results obtained for product 3. RMSEP PCRS PLS PCRL PCRF PLSF NN 0.118 0.123 0.120 0.319 0.096 0.106 0.293 0.134 0.131 0.366 0.131 0.131 Product 4 Complexity ( Time nl / nf ) (Mflops) -/ 6 430.6 7/73.9 7/7 94.1 7/4 18.9 7/7 17.6 Topology : 6 - 4 - 1 Product 3 Complexity ( Time nl / nf ) (Mflops) -/7 430.6 5/73.9 6/5 90.2 6/4 16.4 6/6 15.6 Topology : 6 - 3 - 1 221 0.244 0.142 0.186 0.539 0.147 0.149 Product 5 Complexity ( Time nl / nf ) (Mflops) -/ 6 430.6 7/73.9 7/6 93.9 7/4 18.5 7/7 17.6 Topology : 5 - 4 - 1 New trends in Multivariate Analysis and Calibration The model complexities can seem surprisingly high. When studying five compounds mixtures, considering the mixtures contain no other substances, one expects models with a complexity equal to 4 (1 component per compound, reduced by one due to closure effect). This was indeed the case in a previous study in which the same kind of mixtures was studied [18]. In this new data set, a wider spectral region is used, the instrument has a much higher resolution, and most important of all, the signal/noise ration is by far better. This much higher quality of the instruments gave access to more of the information present in the data. The data set used here was previously studied [19] and it was found that mixture effects lead to non- linear behaviour. Therefore, the best overall calibration results were obtained with a non-linear method, namely Neural Networks using non- linear transfer functions. NN results are therefore presented in this paper as a benchmark. PCR and PLS models give an illustration of the fact that slight non- linearities can be compensated by the inclusion of extra components [7]. When models with only 4 components are used, all linear methods achieve RMSEPs close to 0.5. The inclusion of extra components enables to improve greatly these results. Since the results are given in terms of RMSEP (therefore calculated on an independent test set), this improvement can not be attributed to overfitting. In case of overfitting, the RMSECV results would be improved after inclusion of the extra components, but the results obtained for prediction on the test set would not. The quality of the results obtained for the various methods greatly depends on the complexity used. The PCRF method retained from 3 to 4 PCs out of a Lanczos base of 6 to 7 vectors. This method achieves as expected RMSEPs around 0.5. No extra PCs are included in the model, the non- linear effects are not taken into account, leading to high RMSEP values. The cross validation followed by randomisation test procedure used for PCRS and PLS lead to models retaining 5 to 7 components. The results are better for PLS that generally uses a slightly higher complexity in this study. The fast PCR optimised by CV (PCRF) led to complexities comparable to PLS, therefore leading to equivalent prediction performances. The best results for PCR/PLS based methods are obtained with PLSF. By using all the components extractable from the Lanczos base, i.e. 6 to 7 components, it yields results comparable to those obtained with Neural Networks. However, NN remains the overall best performing method in terms of prediction quality. The speed of calculations confirms the conclusions of Wu and Manne [1]. The most time consuming method is PCR optimised by CV (PCRS). PLS, although optimised by CV as well, performs about 6 times faster. The fast Lanczos PCR optimised by CV (PCRL) performs almost as fast a PLS. The 222 Chapter 4 – New Types of Data : Structure and Size fastest methods are PCRF and PLSF. They perform about 5 times faster than PLS, which means almost 30 times faster than PCRS. 4 - Conclusions Neural Networks remain the overall best performing method in terms of prediction on this non- linear data set. The fast PCR and PLS methods based on the Lanczos decomposition were able to achieve at least as good results as their classical equivalents, and these results were obtained considerably faster. However, the PCRF method tended to retain too few components. The PLSF method achieved good results mainly because it retained the full range of components that could be extracted from the Lanczos space. The Lanczos based PCR optimised by CV gave good results with more parsimonious complexities. The Lanczos approach can therefore be used to speed up calculations, however, crossvalidation seems to remain the method of choice to estimate adequate model complexity. R EFERENCES [1] W. Wu, R. Manne, Chemom. Intell. Lab. Sys. 51, no. 2 (2000) 145-161. [2] H. van der Voet, Chemom. Intell. Lab. Sys. 25 (1994) 313-323. [3] H. van der Voet, Chemom. Intell. Lab. Sys. 28 (1995) 315. [4] T. Naes, H. Martens, J. Chemom. 2 (1998) 155-167. [5]. J. Sun, J. Chemom. 9 (1995) 21-29. [6] J. M. Sutter, J. H. Kalivas, P.M. Lang, J. Chemom. 6 (1992) 217-225. [7] H. Martens, T. Naes, Multivariate Calibration (Wiley, Chichester,1989). [8] S. de Jong, Chemom. Intell. Lab. Sys. 18 (1993) 251-263. [9] D. M. Haaland, E. V. Thomas, Anal. Chem. 60 (1988) 1193-1202. [10] P. Geladi, B. K. Kovalski, Anal. Chim. Acta 185 (1986) 1-17. [11] C. Lanczos, J. Res. Nat. Bur. Stand, 45 (1950) 255-282. [12] G. H. Golub, C. F. Van Loan, Matrix Computations, N orth Oxford Academic, Oxford, 1983. [13] L. N. Trefethen and D. Bau, III, Numerical linear algebra, SIAM, Philadelphia, 1997. [14] F. Despagne, D.L. Massart, The Analyst, 123 (1998) 157R-178R 223 New trends in Multivariate Analysis and Calibration [15] J.R.M. Smits, W.J. Melssen, L.M.C. Buydens, G. Kateman, Chemom. Intell. Lab. Syst., 22 (1994) 165-173. [16] R. Fletcher, Practical Methods of optimization, Wiley, N.Y., 1987. [17] F. Despagne, D.L. Massart, Chemom. Intel. lab. syst., 40 (1998) 145-163. [18] F. Estienne, N. Zanier, P. Marteau, D.L. Massart, Analytica Chimica Acta, 424 (2000) 185-201. [19] N. Zanier, P. Marteau, F. Estienne. In preparation. 224 Chapter 4 – New Types of Data : Structure and Size MULTI -WAY MODELLING OF HIGH-DIMENSIONALITY ELECTRO-ENCEPHALOGRAPHIC DATA Chemometrics and Intelligent Laboratory Systems, 58 (2001) 59-72. F. Estienne , N. Matthijs, D. L. Massart* P. Ricoux D. Leibovici ChemoAC Farmaceutisch Instituut, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium. E-mail: fabi@fabi.vub.ac.be ELF 69000 Lyon, France Image Analysis Group FMRIB Centre Oxford University John Radcliffe Hospital Oxford OX3 9DU, U.K. ABSTRACT The aim of this study is to investigate whether useful information can be extracted from an electroencephalographic (EEG) data set with a very high number of modes, and to determine which model is the most appropriate for this purpose. The data was acquired during the testing phase of a new drug expected to have effect on the brain activity. The implemented test program (several patients followed in time, different doses, conditions, etc …) led to a 6-way data set. After it was confirmed that the exploratory analysis of this data set could not be handled with classical PCA, and it was verified that multi- dimensional structure was present, multi-way methods were used to model the data. It appeared that Tucker 3 was the most suited model. It was possible to extract useful information from this high-dimensionality data. Non-relevant sources of variance (outlying patients for instance) were identified so that they can be removed before the in-depth physiological study is performed. * Corresponding author K EYWORDS : Multi-way methods, Tucker Electroencephalography, EEG 225 3, PARAFAC, Exploratory analysis, New trends in Multivariate Analysis and Calibration 1 - Introduction The general aim of this study was to investigate the effect of a new antidepressant drug on the brain activity using electroencephalographic (EEG) data. The scope of the present paper is not to present advances in the field of neuro-sciences, but to show how multidimensional models can efficiently be applied to extract useful information from multi-way data even with high dimensionality (up to 6 modes in this study). The principle of electroencephalography is to give a representation of the electrical activity of the brain [1]. Electroencephalography is mainly used for the detection and management of epilepsy. It is a noninvasive way of detecting structural abnormalities such as brain tumours. It is also used for the investigation of patients with other neurological disorders that sometimes lead to characteristic EEG abnormalities, or like in the present study, to determine the effect of a drug on the brain activity. This activity is measured using metal electrodes placed on the scalp. Even if no general agreement was reached concerning the placement of the elect rodes, most of the laboratories use the so-called International 10-20 system [2]. These measurements lead to electroencephalograms that can be used directly, as in case of abnormality they can present characteristic patterns, or can be treated with Fourier Transform to keep only the numerical values corresponding to the average energy of specific frequency bands. 2 - Experimental The data were acquired during the testing phase of a new antidepressant drug. This test program was a phase II (a small group of healthy volunteers is studied), mono-centric (all the experiments are performed in the same place), placebo-controlled, double blind (neither the patient, nor the doctor know whether the drug or the placebo is being administered) trial. The study was performed on 12 healthy male subjects, and the effect of 4 doses (placebo, 10, 30 and 90 mg) was investigated. This effect was followed in time over a 2-day period (8:00, 8:30, 9:30, 10:00, 10:30, 11:00, 11:30, 12:00 AM, 1:00 and 3:00 PM on the first day, 9:0 0 AM and 9:00 PM on the second day : 12 measurements). The EEGs were measured on 28 leads (augmented 10-20 system) located on the patient scalp (Fig. 1), and were repeated twice. The first measurement was performed in the so-called “resting” condition, where the patient is lying with eyes closed in a silent room. The second measurement was performed in 226 Chapter 4 – New Types of Data : Structure and Size the “vigilance controlled” condition, where the subject is asked to perform simple tasks while the EEGs are acquired. Fig. 1. Augmented 10-20 system, location of the 28 leads on the scalp. Overall, 32256 EEG measurements were performed. Each of the EEG (at a given time, for one of the leads, on one patient, who was administrated a certain dose of the substance, in one measurement condition) was decomposed using the Fast Fourier Transform into 7 energy bands (α1 , α2, β 1 , β 2 , β 3 , δ, θ) commonly used in neuro-sciences [1]. Therefore, only the numerical value corresponding to the average energy of specific frequency bands is taken into account. The data was provided in the form of a table with dimensions (32256 x 7) with no possibility of coming back to the original electroencephalograms. The (32256 x7) table was reorganised into a multi-dimensional array. The resulting matrix is a 6-ways array with dimension (7x12x28x4x12x2). The dimensions (or modes) are described as follows : EEG dimension : 7 EEG bands (α1 , α2, β 1 , β2 , β 3, δ, θ) Subject dimension : 12 patients Spatial dimension : 28 leads Dose dimension : 4 doses (placebo, 10, 30 and 90 mg) Time dimension : 12 EEG measurements over 2 days Condition dimension : 2 measurement conditions (resting and vigilance controlled) 227 New trends in Multivariate Analysis and Calibration The calculations were performed on a Personal Computer with an AMD Athlon 600 MHz CPU and 256 Mega Bytes of RAM. The software used was house made or parts of The N-way Toolbox from Bro and Andersson [3]. The whole study was performed in the Matlab® environment. 3 - Models 3.1 - Unfolding PCA – Tucker 1 Unfolding Principal Component Analysis (PCA) consists in applying classical two-way PCA on the data matrix after it has been unfolded. The principle of unfolding is to consider the multidimensional matrix as a collection of regular 2-ways matrices and to put them next to another, leading to a new 2way matrix containing all the data. It is possible to unfold a 3-way matrix along the 3 dimensions (Fig. 2). Fig. 2. Three possible ways of unfolding a 3-way array X. X(1), X(2) and X(3) are the 2-way matrices obtained after unfolding with preserving the 1st , 2nd and 3rd mode respectively. This results in 3 different matrices X(1), X(2) and X(3) in which modes 1, 2 and 3 are respectively preserved. The score matrices obtained building a PCA model on each of those 3 matrices, respectively 228 Chapter 4 – New Types of Data : Structure and Size called A, B and C, are the output of a Tucker 1 model. Tucker 1 is considered a weak multidimensional model, as it does not take into account the multi-way structure of the data. The A, B and C matrices are independently built. The Tucker 1 model is a collection of independent bilinear models, and not a multi- linear model. 3.2 - Tucker 3 The Tucker 3 [4,5] model is a generalisation of bilinear PCA to data with more modes. The Tucker 3 model (limited here to a 3-way case for sake of simplicity) can be formulated as in eq. 1. w1 w 2 w3 xijk = ∑∑ ∑ ail bjm ckn glmn (1) l =1 m = 1 n = 1 where x ijk is an (lxmxn) multidimensional array, w1 , w 2 and w 3 are the number of components extracted on the 1st, 2nd and 3rd mode respectively, a, b, and c are the elements of the A, B and C loadings matrices for the 1 st , 2 nd and 3rd mode respectively, and g are the elements of the core matrix G. The information carried by these matrices is therefore of the same nature as the information contained in the equivalent matrices of the Tucker 1 model. The difference comes from the fact that these matrices are built simultaneously during the Alternating Least Squares (ALS) fitting process of the model in order to account for the multidimensional structure. Tucker 3 is a multi-linear model. Moreover, the G matrix defines how individual loading vectors in the different modes interact. This information is not available in the Tucker 1 model. The Tucker 3 model can also be seen in a more graphical way as shown in figure 3, it appears as a weighted sum of outer products between the factors stored as columns in the A, B and C matrices. 229 New trends in Multivariate Analysis and Calibration Fig. 3. Representation of the Tucker 3 model applied to a 3-way array X. A, B and C are the loadings corresponding respectively to the 1st , 2nd and 3rd dimension. G is the core matrix. E is the matrix of residuals. One of the interesting properties of the Tucker model is that the number of components for the different modes does not have to be the same (as is the case in the PARAFAC model). In Tucker 3, the components in each mode are usually constrained to orthogonality, which implies a fast convergence. A limitation of this model is that the solution obtained is not unique, an infinity of other equivalent solutions can be obtained by rotating the result without changing the fit of the model. 3.3 - Parafac The Parafac model [6,7] is another generalisation of bilinear PCA to higher order data. It can be mathematically described as in eq. 2 : w xijk = ∑ ail bjl ckl (2) l =1 Like Tucker 3, Parafac is a real multi- linear model. It can be considered as a special case of the Tucker 3 model, in which the number of components extracted along each mode would be the same, and the core matrix would contain only non- zero elements on its diagonal. This specific structure of the core makes Parafac models much easier to interpret than Tucker 3 models. The Parafac model can also be seen in a more graphical way as shown in figure 4. 230 Chapter 4 – New Types of Data : Structure and Size Fig. 4. Representation of the Parafac model applied to a 3-way array X. A, B and C are the loadings corresponding to the 1st , 2nd and 3rd dimension. G is the super-diagonal core matrix. E is the matrix of residuals. The most interesting feature of the Parafac model is uniqueness. The model provides unique factor estimates, the solution obtained cannot be rotated without modification of its fit. As components on each mode are not constrained to orthogonality, the convergence is usually quite slower than observed with the Tucker 3 model. 4 - Results and discussion 4.1 - Linear and bi-linear models Because of the nature of the data set, it was very difficult to explore it visually the way it is usually done for instance with spectral data. In order to get a better insight of the data, some averages were computed directly from the original variables. This corresponds to building simple linear models. The global average (on patients, doses and conditions) for the energy bands can then be displayed on a map of the brain for each of the measurement times. It is then possible to see in a rough way the evolution of the activity of the brain as a function of time and of the location in the brain (Fig. 5). 231 New trends in Multivariate Analysis and Calibration Fig. 5. Original data (averaged on patients, doses, conditions, and the 7 energy bands) displayed, for each of the measurement times, on a grid representing the electrodes locations. Dark zones indicate low activity. It can be seen that the activity of the brain seems to globally increase to reach a maximum at time 6 (11AM, first day). The activity seems to increase mainly in the back part of the brain. The plot corresponding to time 11 (9AM, second day), shows that the state of the brain seems to be similar on the first and second day at equivalent times. Studying such plots for individual energy bands shows that the different bands are not all present and varying in the same parts of the brain (i.e. some are more present and active in the front or back part of the brain). Classical two-way PCA can also be used to explore this data set. bi- linear models are then constructed. The intensities of the 7 energy bands are considered as variables, and the 32256 measurement conditions as objects. The PCA results (Fig. 6) show that there is some structure in the data. Points of the score plot corresponding to an individual patient are located in relatively well-defined areas. The same thing can be observed for points corresponding to a certain electrode or dose. However, the results are too complex to be readily interpretable, and justify the use of multi-way methods to explore this data set. 232 Chapter 4 – New Types of Data : Structure and Size Fig. 6. Results of PCA on the (7 x 32256) matrix : scores on PC1 versus scores on PC2 . Points corresponding to patient #9 are highlighted. 4.2 - Assessing multi-linear structure Many data sets can be arranged in a multi- way form. This does not mean that multi-way methods should be applied on such data sets as using such methods makes sense only if multi- linear structure is present in the data. For instance, if slices of a three-way array are completely independent, no structure (or correlation) is present along this mode, and multi-way methods should not be used. Two-way PCA can be used to ensure that some multi-dimensional structure is actually present in the data. The data can be reduced to a smaller dimensionality (smaller number of modes) array by extracting parts of the array corresponding to one element of a given mode. For instance, considering only patient #11, the 30 mg dose, and the Resting condition, the resulting matrix is a 3-way array with dimension (28x12x7). Only the spatial, time, and variable dimensions are then taken into account. This matrix has to be unfolded before ordinary PCA can be performed. If the data is unfolded preserving the first dimension, the resulting matrix will have dimension (28x(12x7)). The scores of a PCA model performed on this data give information about the 28 electrodes, and the loadings give simultaneously information about the time and the variables, 12 repetitions (one per time) of the information about the 7 variables are expected. It is verified that there is a structure remaining in the loadings of the PCA model (Fig. 7). 233 New trends in Multivariate Analysis and Calibration Fig. 7. Loadings on PC1 for a (28 x 12 x 7) model (patient #11, 30 mg dose, resting condition). The loadings for each variable globally vary following a common time profile. This is an indication of a dimensional structure between the time and variables dimensions in the data used. A (7x(28x12)) array can also be obtained by rearranging the previous matrix. This time, the loadings show combined information about the electrodes and time dimension. The plot shows 12 repetitio ns (one per time) of the 28 electrodes. It can be observed that the loading values of the electrodes once again globally follow a time profile, indicating that there is some multi- way structure relating these two modes. Considering only the part of the data set corresponding to patient #11 and the resting condition leads to a 4-way array with dimension (28 x7x12x4). The loadings of the PCA model built on this array unfolded preserving its first mode should give information about variables, time, and doses simultaneously. A structure due to the dose dimension is visible (Fig. 8). Dose 3 (30mg) seems to be standing out. 234 Chapter 4 – New Types of Data : Structure and Size Fig. 8. Loadings on PC1 for a (28 x 7 x 12 x 4) model (patient #11, resting condition). 4.3 - Multi-linear models optimization The Parafa c model should preferably be used as its simplicity makes the interpretation of the results easier and also because of its uniqueness property. However, it has first to be investigated whether the data can be modelled with Parafac. This verification can be performed using the Core Consistency Diagnostic [8]. This approach is used to estimate the optimal complexity of a Parafac model (or any other model that can be considered as a restricted Tucker 3 model). It can be seen as building a Tucker 3 model with the same complexity as the Parafac model and with unconstrained components and analysing its core. In practice, the core consistency diagnostic is performed by calculating a Tucker 3 core matrix from the loading matrices of the Parafac model. If the Parafac model is valid and optimal in terms of complexity, the core matrix of this Tucker 3 model, after rotation to optimal diagonality, should contain no significant non-diagonal element. The data was first restricted to simpler 3-way cases, and 3-way Parafac models were built. For instance, in the case of models built for data restricted to one patient, one condition, and one dose, the dimensions modelled are the spatial dimension (position of the electrodes), the time dimension, and the variables dimension. In all cases studied here, a 2 components Parafac model was always optimal. However, the performances of the Parafac models depended greatly on the patient studied. For patient #6, for instance (Fig. 9), the model is much better than for patient #11 (Fig. 9). 235 New trends in Multivariate Analysis and Calibration Fig. 9-a. Core Consistency Diagnostic for Parafac models built on 3-way data. Patient 6, Resting condition, 30 mg dose. Fig. 9-b. Core Consistency Diagnostic for Parafac models built on 3-way data. Patient 11, Resting condition, 30 mg dose. This indicates that the data do not seem to follow a Parafac model, or at least the modelling is not easy, the data can therefore not be fit adequately by this model. By increasing the number of dimensions modelled, it was verified that a Parafac model is probably not appropriate for this data set. In order to assess the validity of the Parafac model on a data set, it is also useful to estimate the fit of both Tucker 3 and Parafac models in order to evaluate if the larger flexibility of the Tucker model leads to a significant improvement in the fit. The fit of the 2 components Parafac model and the (222222) 6-way Tucker 3 model (2 components extracted on each of the 6 modes) are actually almost identical (around 93.5% of explained variance). However, this complexity does not seem to be optimal at all in the case of the 6-way Tucker 3 model. In order to keep computation time reasonable, the optimal complexity of 236 Chapter 4 – New Types of Data : Structure and Size the 6-way Tucker 3 model was evaluated (Fig. 10) taking into account only a number of components quite close to 2. Fig. 10. Variance explained by the Tucker 3 models as a function of the model complexity. The complexity was therefore investigated only from (111111) to (333333). It appeared that the optimal complexity is (333221), which can be detailed as follows : EEG dimension 3 components Subject dimension 3 components Spatial dimension 3 components Dose dimension 2 components Time dimension 2 components Condition dimension 1 components This complexity corresponds to the beginning of the last plateau on the curve (more exactly in this case on a part of the curve just after a significant reduction of the slope). The model is on purpose not chosen to be parsimonious as it would for instance have been possible to select the complexity corresponding to the beginning of the plateau containing the (222222) model. It is however always possible to discard some components from the model if it appears from the interpretation of the core that they are not useful in the reconstruction of the original matrix X. 237 New trends in Multivariate Analysis and Calibration 4.4 - 6-way Tucker 3 model The 6-way Tucker 3 model leads to a core array G with dimensions (3x3x3x2x2x1) and six component matrices A,B,C,D,E,and F related each to one of the modes. 4.4.1 - Loadings on the variable dimension The first matrix A holds the loadings for the EEG dimension (7 EEG bands). By calculating from the original data the average energy (over the five other modes) of each frequency band, it can be seen (Fig. 11-a) that the first component is used to describe the average energy of the bands. The second component, as well as the third one (Fig. 11-b), will at this stage be interpreted as showing the effect of some other parameters (time or effect of the substance) on the distribution of the bands. Fig. 11-a. Loadings on the variable dimension, 6-way model with complexity (3 3 3 2 2 1). A(1) versus A (2). The mean energies of the bands are also given. A(2) A(1) 238 Chapter 4 – New Types of Data : Structure and Size Fig. 11-b. Loadings on the variable dimension, 6-way model with complexity (3 3 3 2 2 1). A (2) versus A (3). A(3) A(2) 4.4.2 - Loadings on the patient dimension The second matrix B holds the loadings for the patients dimension (12 patients). The main information in the loading plots is that some extreme values are present. Patient #6 appears as an extreme value on component 1 (Fig. 12-a). Patient #11 appears as an outlier on component 3 (Fig. 12-b). At this stage, without looking at the core array G in order to remove the rotational indeterminacy of the Tucker 3 model, it is not possible to go further in the discussion about this matrix. 239 New trends in Multivariate Analysis and Calibration Fig. 12-a. Loadings on the patient dimension, 6-way model with complexity (3 3 3 2 2 1). B(1) versus B(2). B(2) B(1) Fig. 12-b. Loadings on the patient dimension, 6-way model with complexity (3 3 3 2 2 1). B(2) versus B(3). B(3) B(2) 4.4.3 - Loadings on the spatial dimension The third matrix C holds the loadings for the spatial dimension (28 electrodes). The first remarkable thing in the plot of C (1) versus C(2) is the symmetry of the loadings (Fig. 13-a). All electrodes that are symmetrical on the brain (Fig. 1), for instance electrodes #17 and 20 appear very close to each other on the loading plot. Moreover, considering all these pairs of symmetrical electrodes, the one located on the right part of the brain appears to have systematically higher loading values. For instance, electrode #20 has higher loadings values than electrode #17. This rule holds for all of the pairs of electrodes, except for electrodes #12 and 16. It will be established when interpreting the core matrix that this is due to a 240 Chapter 4 – New Types of Data : Structure and Size specific problem with one of these leads for one of the patients. If the loading values on component 1 are reported on the map of the electrodes on the brain, a representation of the activity of the brain is obtained (Fig. 13-b), it looks very similar to what was obtained with linear models in the data exploration part (Fig. 5). C(2) Fig. 13-a. Loadings on the spatial dimension, 6-way model with complexity (3 3 3 2 2 1). C(1) versus C (2). C(1) Fig. 13-b. Loadings on the spatia l dimension, 6-way model with complexity (3 3 3 2 2 1). Ranking of the electrodes on C(1) reported on the map of the brain. If the second component of the C matrix is now considered (Fig. 13-c), and the loading values are reported on the map of the electrodes on the brain, a clear separation between the front and back part of the brain can be observed (Fig 13-d). Considering directions in the plots, a central part of the brain can be identified. These patterns are interpreted as showing the activity of the substance on different parts of the brain. 241 New trends in Multivariate Analysis and Calibration Fig. 13-c. Loadings on the spatial dimension, 6-way model with complexity (3 3 3 2 2 1). C(1) versus C(2). C(2) C(1) Fig. 13-d. Loadings on the spatial dimension, 6-way model with complexity (3 3 3 2 2 1). Patterns on the loading plots are reported on the map of the brain. It is important to note that, at this stage, only with the information present in the loading matrices, it is not possible to know whether the high loadings on C(1) for the central part of the brain mean high or low activity. A basic knowledge of brain physiology indicates that this indeed corresponds to high activity. It is however necessary to get rid of the rotational indeterminacy of the Tucker3 model by interpreting the core matrix to extract this information from the model. 242 Chapter 4 – New Types of Data : Structure and Size 4.4.4 - Loadings on the dose dimension The first component on the dose dimension D(1) can be interpreted quite easily (Fig. 14). It shows that 10mg is quite close to Placebo, indicating that this dose is not efficient. 90mg is more different compared to Placebo indicating a better effect of this dose, and the most different is 30mg. This can appear surprising, but the medical doctors in charge of the study expected this result. The higher dose does not systematically lead to the higher effect with this kind of substances. The second dimension, making a difference between 30mg and the other doses is much more difficult to interpret at this stage, but the phenomenon will be explained when interpreting the core matrix G. Fig. 14. Loadings on the dose dimension, 6-way model with complexity (3 3 3 2 2 1). D(1) versus D (2). D(2) D(1) 4.4.5 - Loadings on the time dimension The first component on the time dimension E(1) shows the normal time profile of the evolution of the state of the brain during day time (Fig. 15). The activity globally increases from 8AM (time 1) until 11AM (time 6). This would of course still have to be confirmed by removing the rotational indeterminacy using G, but it already fits what was seen in the linear data exploration part (Fig. 5). Afterwards, the activity reduces. The loading value for the second day at 9AM (point 11) is located between the ones corresponding to 8:30AM and 9:30AM in the first day, confirming this interpretation. The second dimension is interpreted as showing the time profile of the effect of the drug activity. It has to be specified that the drug was administrated immediately after 8:30AM (time 2). No effect of the 243 New trends in Multivariate Analysis and Calibration substance can therefore be expected before 9:30AM (time 3). The loadings on component 2 are indeed negative before 8:30AM and become positive from 9:30AM, regularly increasing until 11:3012:00AM. After 12AM, the activity drops and becomes zero (no activity, same negative loading values as before the administration of the drug), and stays at this level during the second day. Fig. 15. Loadings on the time dimension, 6-way model with complexity (3 3 3 2 2 1). E(1) versus E(2). E(2) E(1) 4.4.6 - Loadings on the condition dimension The last component matrix F gives information abo ut the two different measurement conditions. It is in fact a vector as only one component was extracted along this mode. The loadings values are 0.701 for the resting condition, and 0.713 for the vigilance controlled condition. The loadings are positive for both the conditions, this indicates that when interpreting the model, this mode can only have a scale effect. This means that the effect of the drug can only be larger or smaller depending on the condition, but one can not expect to see opposed effects due to this parameter. The loading values for each condition are also very similar. This indicates that the two conditions do not imply any effect on the brain activity that is significant for the model. This dimension was further investigated. 5-way models were built on data taking into account one of the conditions, the other condition, and the average of the data in the two conditions. All these models gave almost perfectly identical results, showing that the two conditions can in fact be considered as replicates of the same 5-way data set. This mode is therefore not relevant in the data set. 244 Chapter 4 – New Types of Data : Structure and Size 4.4.7 - The core matrix G The important elements of the core are shown in table 1, together with their squared value (that represents the relative importance of the core element), and the variance explained by these elements. Table 1. Important core elements of the 6-way model with complexity (3 3 3 2 2 1). Core element 1 2 3 4 5 6 7 8 9 10 (1, 1, 1, 1, 1, 1) (2, 2, 1, 1, 1, 1) (2, 1, 2, 1, 1, 1) (1, 3, 1, 2, 1, 1) (1, 3, 1, 2, 2, 1) (3, 3, 1, 1, 1, 1) (3, 1, 2, 1, 1, 1) (1, 2, 2, 1, 1, 1) (1, 2, 1, 2, 2, 1) (1, 3, 1, 1, 2, 1) Explained Variance (%) 95.95 1.63 0.60 0.39 0.23 0.20 0.18 0.09 0.09 0.08 Core value 4702.23 613.57 374.71 -301.13 -234.06 -214.79 - 208.41 150.35 -146.34 -137.65 Squared core value 22111057.12 376475.69 140414.91 90682.65 54788.31 46137.64 43438.22 22605.75 21417.87 18947.96 By building symbolic products as described by Henrion [9], it is possible to go over the rotational indeterminacy of the model and interpret the first elements of the core. The first element of the core explains most of the variance and reflects the normal evolution of the activity of a human brain during daytime, showing which bands are the most present, and how their intensity evolves in time. Even if the corresponding core values are very low (which is not surprising as phenomena with very small magnitude are investigated, compared to, for instance, the difference between two patients), the next elements also bring very relevant information. One of the most interesting elements in this core matrix is element #4. It shows that B(3) , third component on the patient mode and D(2) , second component on the dose mode interact. It can be reminded that B(3) differentiates between patient #11 and the other patients, spotting him as an outlier (Fig. 12-b). It was also seen that D(2) differentiates between the 30 mg dose and the other doses (Fig. 14). This core element shows that patient #11 is an outlier due to an over-reaction to the most efficient dose. This interpretation was confirmed studying a 5-way model restricted to patient #11. In this model, the 30 mg dose appeared to be even more extreme than on the 6-way model. In the same way, mainly starting from the loading plots of the patient dimension, and looking for extreme points, it was possible to find core elements explaining very small amounts of the total variance of the system, but representative for special behaviours of specific patients. Core element 245 New trends in Multivariate Analysis and Calibration #7, for instance, relates B(1), the first component on the patients mode (showing patient #6 as an outlier), to A(3) , third component on the EEG mode (differentiating α 2 from the other energy bands). This core element accounts for a specific repartition of the energy bands for patient #6. This was confirmed by investigating a 5-way model restricted to this patient. On this model, the distribution of energy bands showed in particular extremely high values of the α bands. The special behaviour of electrode #12 compared to its symmetrical on figure 13-a can be explained by focusing on patient #9. All measurements on this patient have an extreme value for electrode #12. This was confirmed by studying a 5-way model restricted to this patient, clearly differentiating electrode #12 from the others. It can be seen that the energy values for this electrode are wrong, the high-energy bands (especially β 2 and β 3 ) are strongly over-estimated. This happens for all the measurements performed on this patient for the 90mg dose (which also corresponds to a certain period in time, as the doses are tested successively with a ‘wash-out’ period between each dose). This systematic and very localised problem seems to indicate that the corresponding electrode was either damaged or badly installed on the scalp during this part of the data acquisition. 4.5 - Analyzing subject variability Since many of the core elements seemed to be used only to account for specific behaviours of individual patients, it was decided to study more thoroughly the patients mode. The idea was to simplify the problem by removing the non-typical patients. This way, the number of relevant core elements should be reduced, as well as the optimal complexity of the model. For this purpose, it was decided to center the patients mode in order to highlight the differences between patients, and hopefully identify easily the suspected outliers. Moreover, as it was shown to be not relevant, the 6th dimension (related to the two measurement conditions) was collapsed. The average of the two conditions was used leading to a simpler 5-way array. The plots of the loading matrix B obtained for the patient dimension show that outliers already spotted with the 6-way model appear now much more clearly (Fig. 16-a,b). 246 Chapter 4 – New Types of Data : Structure and Size Fig. 16-a. Loadings on the patient dimension, 5-way model with complexity (3 3 3 2 2). B(1) versus B(2). B(2) B(1) Fig. 16-b. Loadings on the patient dimension, 5-way model with comple xity (3 3 3 2 2). B(2) versus B (3). B(3) B(2) Patient #11 appears as an outlier already on component 2, while patient #6 (and also perhaps #2) is extreme on component 1, and patient #9 seems to be atypical on component 3. This shows that the centering of this mode succeeded in enhancing the differences between patients. The core matrix also gave interesting information (table 2). Table 2. Important core elements of the 5-way model with complexity (3 3 3 2 2 1). 247 New trends in Multivariate Analysis and Calibration Core element 1 2 3 4 5 6 7 8 9 10 (1, 1, 1, 1, 1) (2, 3, 1, 1, 1) (1, 2, 1, 2, 1) (2, 2, 1, 2, 1) (1, 2, 1, 2, 2) (3, 2, 1, 1, 1) (2, 2, 1, 2, 2) (1, 2, 2, 2, 1) (1, 2, 1, 1, 2) (2, 2, 1, 1, 2) Explained Variance (%) 74.66 8.69 3.34 1.99 1.75 1.70 1.37 0.82 0.78 0.74 Core value -879.87 300.26 186.20 -143.95 134.90 132.82 -119.46 -92.30 90.08 -87.65 Squared core value 774171.86 90160.43 34671.54 20723.73 18200.44 17641.84 14272.36 8520.95 8114.57 7683.06 First, as an important source of variance was previously reduced by centering the data, the total variance explained by the 5-way Tucker 3 model was, as could be expected, much smaller (from 94.9 % for the 6-way model to 68.8 % for the 5-way model centered on the patients dimension). The explained variance is also much more distributed between the core elements, which is logical as the variance of the system is less dominated by the differences between patients. It is also obvious that the complexity of the model could be very much reduced. This is especially true for the spatial (3rd ) and time (5th ) modes where 2 components might suffice. 5 - Conclusion Multi-way models, in particular Tucker 3, were used on data with a high number of modes. It was shown that this multi- way model was able to extract meaningful information from this very complex data set, when classical PCA brought no usable information. Each mode could be interpreted and the core matrix enabled to understand relations between modes. Since it was established that some atypical patients made the modelling and the interpretatio n of the results much more complicated, the second part of this study, aiming at interpreting the anatomical results of the models in details, will be performed having these patients removed from the data set. Since some major sources of variance will be removed from the data, the optimal complexity of the models will have to be investigated in details again. Another interesting point is that the performances of the Parafac model seemed to depend very much on the behaviour of the patients, it will therefore be interesting to evaluate the modelling abilities of this model on the simplified data set. The results of this second part of the study will be presented in 248 Chapter 4 – New Types of Data : Structure and Size a forthcoming publication. It is anyhow already possible to say that the optimal complexities of the models established on the simplified data set are indeed much lower. The simplified data set also happens to conform much better to a Parafac model. This model can therefore be used, which will hopefully enable an easier interpretation of the results. R EFERENCES [1] M. J. Aminoff, Electodiagnosis in Clinical Neurology, third edition, Churchill Livingstone, Edimburgh (1987). [2] H. H. Jasper, Report of the committee on methods of clinical examination in electroencephalography, Electroencephalor Clin Neurophysiol, 10 (1958) 370. [3] C. A. Andersson, R. Bro, Chemom. Intell. Lab. Sys., 52 (2000) 1-4. [4] L. R. Tucker, Psychometrika, 31 (1966) 279-311. [5] P. M. Kroonenberg, Three- mode Principal Component Analysis. Theory and Applications, DSWO Press, Leid en (1983). [6] R. Harshman, UCLA working papers in phonetics, 16 (1970) 1-84. [7] J. D. Carrol, J. Chang, Psychometrika, 45 (1970) 283-319. [8] R. Bro, H.A.L. Kiers, In press, J. Chemom. (2001). [9] H. Henrion, Chemom. Intell. Lab. Sys., 25 (1994) 1-23. 249 New trends in Multivariate Analysis and Calibration ROBUST VERSION OF TUCKER 3 M ODEL Chemometrics and Intelligent Laboratory Systems, Vol. 59, 2001, 75-88. V. Pravdova, F. Estienne, B. Walczak*+, D. L. Massart + on leave from : Silesian University Katowice Poland ChemoAC Farmaceutisch Instituut, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium. E-mail: fabi@fabi.vub.ac.be ABSTRACT A new procedure for identification of outliers in Tucker3 model is proposed. It is based on robust initialization of the Tucker3 algorithm using Multivariate trimming or Minimum covariance determinant. The performance of the algorithm is tested by a Monte Carlo study on simulated data sets and also on a real data set known to contain outliers. * Corresponding author K EYWORDS : Multivariate Calibration, Raman spectroscopy, Lanczos decomposition, Fast Calibration methods. 250 Chapter 4 – New Types of Data : Structure and Size 1 – Introduction N-way methods based on the Alternating Least Squares (ALS) algorithm are least squares methods that are highly influenced by outlying data points. One outlying sample can strongly influence the resulting model. As for 2-way PCA and related methods, there are two possibilities to deal with outliers: statistical diagnostics can be used or a robust algorithm can be constructed. Statistical diagnostics tools can be applied to the already constructed models and are usually based on the detection of the 'leverage points', defined as points that are far away from the remaining data points in the model space. This approach does not always work for multiple outliers because of the so-called masking effect. Robust versions of modelling procedures aim at building models describing the majority of data without being influenced by the outlying objects. By data majority we mean the data subset containing at least 51% of objects. Robust procedures are characterized by the so-called breakdown point, defined as a percentage of data objects that may be corrupted while the model still yields the proper estimates. A subset of data, containing no outliers is called 'clean subset'. In the arsenal of chemometrical methods there are already many robust approaches, such as robust PCA, PCR, PLS [1,2,3]. The aim of our study was to construct a robust version of the Tucker3 approach, one of the most popular N -way methods. 2 – Theory 2.1 - N-way methods of data exploration Several methods were proposed for N-way exploratory analysis, for instance CANDECOMP/PARAFAC [4,5] and the family of Tucker models [6,7]. In the present study, only the Tucker3 model is considered. Most of the N-way methods are based on ALS. The principle of ALS is to divide the parameters into several sets and for each set the least squares solution is found conditionally on the remaining parameters. The estimation of parameters is repeated until a convergence criterion is satisfied. Figure 1 shows the decomposition according to the Tucker3 model. The 3-way data matrix X is decomposed into 3 orthogonal loading matrices A (I x L), B (J x M), C (K x N) and the core matrix Z (L x M x N) which describes the relationship among them. The largest 251 New trends in Multivariate Analysis and Calibration squared elements of the core matrix Z indicate the most important factors in the model of X. Mathematically, the Tucker3 model can be expressed as L M N x ijk = ∑∑∑ a il b jmc kn z lmn + e ijk (1) l =1 m=1 n =1 Fig. 1. The Tucker3 model. 2.2 - Data unfolding For computational convenience, the Tucker3 algorithm used does not perform calculations directly on N-way arrays. The X matrix is unfolded to standard 2-way matrices. This can be done in three different ways (see Fig. 2). Unfolded matrices are denoted as: X(I x JK) , X(J x IK) and X(K x IJ) . To calculate the loading matrices several procedures can be used. Anderson and Bro [8] tested most of them with respect to speed and found NIPALS to be the fastest for large data arrays. In our algorithm, SVD is used for the estimation of A, B and C matrices. Fig. 2. Three different ways of unfolding of a 3-way data matrix. 252 Chapter 4 – New Types of Data : Structure and Size 2.3 - Algorithm of Tucker3 model 0) - Initialize B and C (as random orthogonal matrices) 1) - [A,v,d] = svd(X(I x JK) (C ⊗ B), L) 2) - [B,v,d] = svd(X(J x IK) (C ⊗ A), M) 3) - [C,v,d] = svd(X(K x IJ) (B ⊗ A), N) 4) - Go to step 1 until the relative change in fit is small 5) - Z = AT X (C ⊗ B) where symbols L, M, N denote numbers of factors in matrix A, B and C respectively and symbol ⊗ denotes Kronecker multiplication: A ⊗ B yields the element-by-element multiplication of B with the elements from A, expressed as : a 11B a 12B L A ⊗ B = a 21B a 22B L    M M O 2.4 - Robust PCA One could think about robust initialization of the ALS algorithm, i.e. finding a clean subset for the matrix X(I x JK), but in reality, as the loadings matrices B and C are only just initialized, the resulting matrix (X(I x JK) (C ⊗ B)) of dimensionality IxMN should be taken into account. The clean subset can be determined using such methods as for instance, Multivariate Trimming (MVT) [11] or Minimum Covariance Determinant (MCD) [12]. Robust initialization of the Tucker3 algorithm seems to be the most important step to determine the final model and because this step is placed out of the main loop, the algorithm does not lead to oscillations. In the consecutive steps of ALS algorithm the clean subset is constructed to decrease an objective function (see eq. 4), so that oscillations are avoided and convergence of the algorithm is achieved. 253 New trends in Multivariate Analysis and Calibration 2.4.1 - Multivariate Trimming (MVT) [11] The MVT procedure can be used for 'clean' subset selection when the input data matrix contains at least two times more objects than variables. The squared Mahalanobis distance (MD 2 ) is calculated according to the following equation: MDi2 =(ti- t ) S-1 (ti- t ) T (2) where ti denotes the i-th object, t denotes the vector containing means of data matrix columns and S is the covariance matrix. A fixed percentage of objects (here 49%) with the highest MD2 are removed and the remaining ones are used to calculate a mean and covariance matrix. MD 2 is calculated again for all objects using the new estimates of the mean and covariance matrix. Again the 49% of objects with highest MD2 are removed and the process is repeated until convergence of successive estimates of covariance matrix and mean. The subset of objects for which covariance and mean are stable is considered to be a clean subset of data. 2.4.2 - Minimum Covariance Determinant (MCD) [12] MCD aims at selecting a subset of h (out of m) objects, with the smallest determinant, i.e. the smallest volume in the p-dimensional space. h = (m + p + 1)/2 (3) The MCD algorithm can be summarized as follows: 1) - Randomly select 500 subsets of data containing p+1 objects 2) - For each subset: a) Calculate its mean and covariance, t and S b) Calculate Mahalanobis distances for all objects using the estimates of data mean and covariance matrix calculated in step 2 a 254 Chapter 4 – New Types of Data : Structure and Size c) Sort MD and take h objects with the smallest MD to calculate the next estimate of mean and covariance matrix, d) Repeat steps b and c twice 3) - Take the 10 best solutions, i.e. the 10 subsets of h objects with the smallest determin ants, and for each of them, repeat steps b and c until two subsequent determinants are equal 4) - Report the best solution, i.e. the subset with the smallest determinant The procedure starts with many very small data subsets (containing only p+1 objects) to increase the probability that these subsets do not contain outliers. Two iterations only are performed for all 500 subsets (steps 2b and 2c) to speed up the MCD procedure and, as demonstrated by P. Rousseeuw [12], small number of iterations is sufficient to find good candidates of clean subsets. Only for the 10 best subsets the calculations are repeated till convergence of the algorithm. 2.5 - Algorithm for robust Tucker3 model To find possible multiple outliers in the first mode of the X the following algorithm is proposed: 0) - Initialize loadings B and C 1) - Calculate X(I x JK) (C ⊗ B) and determine clean subset (using MVT or MCD) 2) - [A*,v,d] = svd (X(I* x JK) (C ⊗ B), L) 3) - [B*,v,d] = svd (X(J x I*K) (C ⊗ A*), M) 4) - [C*,v,d] = svd (X(K x I*J) (B* ⊗ A*), N) 5) - Z = A* T X* (C* ⊗ B*) 6) - Predict loadings A for all objects 7) - Reconstruct X(I x JK) : X(I x JK)=A Z(L x MN)(C⊗B)T 8) - Calculate the sum of squared residuals for I objects in the first mode as the differences between the original data and the reconstructed one : residuals = sum(((X(I x JK)- X̂ (I x JK))2)T ) 9) - Sort residuals along the first mode 10) -Find h objects with the smallest residuals. They constitute the clean subset 11) - Go to step 2 until the relative change in fit is small 255 New trends in Multivariate Analysis and Calibration A*, X* etc. are the matrices A, X etc. limited to the clean subset of objects, and the notation X(I* x JK) means that the unfolded data set contains objects reduced to the clean subset I*. h is the number of objects in the clean subset. In each iteration of the ALS subroutine, the loadings A*, B* and C* are calculated for the clean subset of objects only. In step 6 the loadings A are predicted for all objects and the set X(I x JK) is reconstructed with the predefined number of factors. Residuals between the initial X(I x JK) and the reconstructed X̂ (I x JK) are calculated and sorted, and 51 % of objects with the smallest residuals is selected to form the clean subset for the next ALS iteration. The objective function, F, to be minimized, is the sum of squared residuals for the h clean objects from the first mode: ˆ *) 2 F = ∑∑ ( X * − X (4) There is no guarantee that the selected clean subset is optimal, but convergence of the ALS approach is secured. In this algorithm, the outliers are identified in the first mode only, but as all modes are treated symmetrically, one can look for outliers in any mode. This can be done simply by inputting the X matrix with dimension of interest in the first mode. 2.5.1 - Outlier identification Once the robust Tucker3 model is constructed, the standardized residuals from that model are calculated for all objects of the first mode according to the following equation [10] : rs i = res i /[3 *1.48 * median (( res i − median ( res i )) 2 )] (5) where res i = ∑ j ˆ )2 ] [( X ij − X ij (6) 256 Chapter 4 – New Types of Data : Structure and Size for i = 1,…,I and j = 1,…JK. In eq. 5, the residuals are divided by the robust version of standard deviation. Using 1.48 ∗ median (( res i − median (res i )) 2 ) , the residuals for 51% of objects, which fit the model best, are calculated. This corresponds to the robust standard deviation of the data residuals. Objects with standardized residuals higher than 3 times the robust standard deviation are considered as outlying and are removed from the data set. This is equivalent with using the ratio prese nted in eq. 5 and cut-off equals one. The final Tucker3 model is constructed as the least squares model for the data after outlier elimination. 3 - Data 3.1 - Simulated data set A systematic Monte Carlo study was performed to evaluate performance of the algorithm. A data set of dimensionality (50 x 10 x 10) was simulated with 2 factors in all modes. Two Tucker3 models (X1 and X2) were constructed to explain 60 % and 90 % of data variance. The initial data sets were then contaminated with different typ es (T1-T4) and different percentages (20% and 40%) of outliers. The different types of outliers (T1-T4) can be characterized as follows: T1 data set constructed according to the same model as the initial data, but with a certain percentage of randomly per muted variables. T2 data set with the same dimensionality and the same level of noise, but constructed according to a different tri-linear model T3 data set with the same level of noise but with a higher dimensionality than the initial data set T4 data set with the same level of noise but with a lower dimensionality than the initial data set The simulation of tri- linear data structure was performed as follows: first, orthogonal loading matrices A, B and C with predefined dimensions were randomly initialized. For the selected structure and core matrix Z, the X matrix was constructed as XIxJK = A * ZLxMN * (C ⊗ B)T . Then, the Tucker3 model was built, and new X was reconstructed with chosen number of factors in each mode and used as initial data set with tri- linear structure. At the end, white Gaussian noise was added to X. In this way models, 257 New trends in Multivariate Analysis and Calibration which differ in percentage of explained variance, data complexity and structure of core matrix, can be constructed. The two following types of calculations were performed for 2 data models (X1 and X2), each with 4 types of outliers (T1-T2) and two percentages (20 and 40%): 1) One contaminated data set was constructed and the Tucker3 and robust Tucker3 models were built 100 times with random initialization of loadings B, C 2) The construction of Tucker3 and robust Tucker3 models was repeated 100 times for the predefined type and percentage of outliers, but this time outliers were simulated randomly according to the chosen type in each run. The performance of the algorithms is presented in the form of a percentage of unexplained variance for the constructed final models. In the case of robust Tucker3 approach, the final model is considered to be the Tucker3 model after outlier removal. The MVT procedure was applied in the Monte Carlo study to speed up calculations. 3.2 - Real data set An electroencephalographic (EEG) data set was used. The principle of electroencephalography is to give a representation of the electrical activity of the brain [13]. This activity is measured using metal electrodes placed on the scalp. The data was acquired during the testing phase of a new antidepressant drug. The effect of the drug was followed in time over a two days period (12 measurements). The EEGs were measured on 28 leads located on the patient’s scalp. Each of the EEG was decomposed using the Fast Fourier Transform into 7 energy bands commonly used in neuro-sciences [14]. Only the numerical values corresponding to the average energy of specific frequency bands are taken into account. This leads, for each patient, to a 3-way array with dimensions (28x7x12). The study was performed on 12 patients. Only the results corresponding to two patients are shown here. Patient #6 shows a very typical behaviour, while patient #9 has aberrant results for electrode #12. 258 Chapter 4 – New Types of Data : Structure and Size 4 – Results and discussion 4.1 - Monte Carlo study Let us consider the data set X1 contaminated with 20% outliers of type1 (T2). The Tucker3 model for this data set is presented in figure 3. As one can notice there are ten objects far away from the remaining ones, and the Tucker3 model is highly influenced by them. For the same data set the robust Tucker3 model was constructed and the object residuals from that model are presented in figure 4. The first 10 outlying objects are correctly ide ntified as the outlying ones. After their removal, the final Tucker3 model is constructed and its results are presented in figure 5. Fig. 3. Tucker3 model for data set X1 (90 % of explained variance) with 20 % of outliers (type T1). 259 New trends in Multivariate Analysis and Calibration Fig. 4. Residuals from the robust Tucker3 model, data set X1, 20 % contamination, type T1. Fig. 5. Final Tucker3 model after elimination of identified outliers. For each studied data set, the Tucker3 and robust Tucker3 algorithms were run 100 times with random initialization of loadings. The results for the discussed data set, expressed as the percentage of the explained variance, are presented in bar form in figure 6-a. 260 Chapter 4 – New Types of Data : Structure and Size a) b) c) d) Fig. 6. Monte Carlo study for the data set X1, type of outliers T2 and 20 % contamination constructed by a) robust Tucker3, b) Tucker3 model with random initialization and c) robust Tucker3, d) Tucker3 model with each time randomly generated outliers. The observed results show that the robust Tucker3 algorithm always converges to the proper solution, and that the outlying objects do not influence the final Tucker3 model. Analogous results for the (non-robust) Tucker3 model are presented in figure 6-b. They indicate that the Tucker3 algorithm is highly influenced by outliers and, depending on the initialization of the loadings, the algorithm converges to different solutions. In the next step of our study, both algorithms, i.e. Tucker3 and robust Tucker3, were run 100 times, each time once for a different data set contaminated randomly with 20 % of outliers constructed according to the chosen model (type T2). The results are presented in figure 6-c,d. The robust Tucker3 algorithm always leads to the proper model not influenced by outlying objects, whereas the Tucker3 models are highly influenced by them. The calculations described above were performed for the data sets contaminated with different percentages of outliers of different types. The final results, presented in figure 7, reveal that the proposed robust version of the Tucker3 model works properly for data sets containing no more than 20% of outlying samples. The robust models constructed for data sets X1 and X2 with 20% of outliers, i.e. data sets with a different percentage of explained variance, are not influenced by outliers. 261 New trends in Multivariate Analysis and Calibration Set X1(g ood model), 20% contamination. Set X2 (bad model), 20% contamination. T1 T1 T2 T2 Fig. 7. Final results for Monte Carlo study for contamination 20 % (data sets X1 and X2, type of outliers T1-T2). T3 T3 T4 T4 The final results for data sets X1 and X2 with 40% of outliers are presented in figure 8. The robust model performed properly only for two types of outliers (T2 and T4). The results for the types T1 and 262 Chapter 4 – New Types of Data : Structure and Size T3 were strongly influenced by the procedure for the selection of the clean subset. Here MVT results are presented, those with MCD are somewhat better. Set X1(good model), 40% contamination. Set X2 (bad model), 40% contamination. T1 T1 T2 T2 Fig. 8. Final results for Monte Carlo study for contamination 40 % (data sets X1 and X2, type of outliers T1-T2). T3 T3 T4 T4 263 New trends in Multivariate Analysis and Calibration Analogous calculations were performed for the data sets with clustering tendency. The results of the Monte Carlo study for these data sets lead to the same conclusions. While working with the highly contaminated data sets (40%), it was noticed that there is an essential difference depending on the methods used to select a clean subset. In figure 9 the results for X1 (40% of outliers T1; simulation type 2) achieved with MVT and MCD are presented for illustrative purposes. Fig. 9-a. Comparison of two algorithms for finding a clean subset. Multivariate trimming (MVT). Fig. 9-b. Comparison of two algorithms for finding a clean subset. Multivariate covariance determinant (MCD). The observed differences in MVT and MCD performance for highly contaminated data (40%) are associated with different breakdown points of those methods. MCD with breakdown point 50% 264 Chapter 4 – New Types of Data : Structure and Size performs better, but due to the relatively long computation time required, it was not used in the Monte Carlo study. 4.2 - Real data set The classical and robust Tucker3 algorithms were applied on the real data set. The results obtained for patient #6 (the one without outlying object) show (Fig. 10-a,b) that the classical and the robust Tucker3 models are equivalent on this normal patient. Fig. 10-a. A, B and C loading matrices and convergence times for patient #6. Tucker3 model. Fig. 10-b. A, B and C loading matrices and convergence times for patient #6. Robust Tucker3 model. 265 New trends in Multivariate Analysis and Calibration Moreover, convergence is as fast in both cases. The results obtained for patient #9 with the classical Tucker3 model (Fig. 11-a) already spots object #12 as an outlier on the A loading plot (corresponding to the electrodes dimension). This is even more obvious when using the robust version of the algorithm (Fig. 11-b) as scale is different. Fig. 11-a. A, B and C loading matrices and convergence times for patient #9. Tucker3 model. Fig. 11-b. A, B and C loading matrices and convergence times for patient #9. Robust Tucker3 model. In the case of the robust Tucker3, the loadings on B and C are not influenced anymore by electrode #12 as the corresponding slice of the matrix is not used in the model construction. For patient #6, the 266 Chapter 4 – New Types of Data : Structure and Size resid uals obtained for the 1 st mode (electrodes dimension) with the classical method (Fig. 12-a) and the robust method (Fig. 12-b) show the same pattern. The situation is very different for patient #9. For the classical Tucker3 model, the residuals for electrode #12 (Fig. 12-c) are not higher than the residuals of other points corresponding to good electrodes. The outlying electrode is therefore not found by the model residuals. For the robust Tucker3 model the residuals for electrode #12 (Fig. 12-d) are extreme ly high and the outlier can be found and eliminated. In the robust Tucker3 approach, the loadings on A, B, and C are really robust. The reconstruction is good for all of the points except electrode #12. Fig. 12. Residuals obtained for the reconstruction of the objects on the 1st mode (12 electrodes) : a) Patient #6, Tucker3 model. b) Patient #6, robust Tucker3 model. c) Patient #9, Tucker3 model. d) Patient #9, robust Tucker3 model. 4 - Conclusion The performed study shows that the robust version of the Tucker3 model always converges to a good solution when the data are contaminated by 20% outliers. For 40% contamination the algorithm converges to a good solution only for two types of outliers (T2 and T4). It can be concluded that MCD is better algorit hm for finding the clean subset than MVT. The robust Tucker3 algorithm gives good results also for the real data set. 267 New trends in Multivariate Analysis and Calibration ACKNOWLEDGEMENT Professor Massart thanks the FWO project (G.0171.98) and EU project NWAYQUAL (G6RD-CT1999-00039) for founding this research. R EFERENCES [1] Y. L. Xie, J. H. Wang, Y. Z. Liang, L. X. Sun, X. H. Song, R. Q. Yu, J. Chemom., 7 (1993) 527-541. [2] B. Walczak, D. L. Massart, Chemom. Intell. Lab. Syst., 27 (1995) 354-362. [3] I. N. Wakeling, H. J. H. Macfie, J. Chemom., 6 (1992) 189-198. [4] J. D. Carroll, J. J. Chang, Psychometrica, 35 (1970) 283-319. [5] R. A. Harshman, UCLA working papers in Phonetic, 16 (1970) 1-84. [6] L. R. Tucker, Problems in measuring change, The University of Wisconsin Press, Madison, (1963) 122-137. [7] L. R. Tucker, Psychometrica, 31 (1966) 279-311. [8] C. A. Andersson, R. Bro, Chemom. Intell. Lab. Syst., 42 (1998) 93. [9] L. P. Ammann, J. Am. Stat. Assoc., 88 (1994) 505-514. [10] P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier Detection, Wiley, New York, 1987. [11] R. Gnanadesikan, J. R. Kettenring, Biometrics, 28 (1972) 81-124. [12] P. J. Rousseeuw, K. Van Driessen, Technometrics, 41 (1999) 212. [13] M. J. Aminoff, Electrodiagnosis in Clinical Neurology, second edition (Churchill Livingstone). [14] H. H. Jasper, Electroencephalor. Clin. Neurophysiol., 10 (1958) 370. 268 Chapter 4 – New Types of Data : Structure and Size 269 New trends in Multivariate Analysis and Calibration N EW TRENDS IN MULTIVARIATE ANALYSIS AND C ALIBRATION CONCLUSION Chemometrics is by definition a discipline at the interface of several branches of science (chemistry, statistics, process engineering, etc …). Chemometricians often have very different backgrounds and our discipline was in time enriched with a lot of techniques from their respective original fields of research. The most common chemometrical modelling methods, together with some more advanced ones, in particular methods applying to data with complex structure, were presented in Chapter 1. Even from this necessarily non-exhaustive introduction, it can be seen that a very wide range of methods is available. The profusion of available options for the resolution of a given problem is usually the first issue encountered by chemometricians during a typical study. The choice of the best method to be used is very often done following subjective considerations such as personal preferences or software availability. The second chapter of this thesis was an attempt to rationalise this step of method selection in the process of building a multivariate calibration model. A part of the work had already been done covering the simplest and somehow ideal case where the robustness of calibration methods is not challenged. A very frequently occurring difficulty is extrapolation. In other words, a prediction has to be done and the new sample is out of the space covered by the calibration samples. From a purely statistical point of view, the answer to the problem is simple : no model should be used to predict an object out of the calibration domain. However, this problem can very often not be avoided when models are used on real-life industrial applications. All possible sources of variance cannot be foreseen when the model is constructed and some are therefore not taken into account. The robustness of 14 methods toward extrapolation was studied using 5 reference data sets presenting challenging characteristics often found in industrial data (non- linearity, inhomogeneity). Some important conclusions were drawn from this study. First of all, it illustrated that the inevitable problem of extrapolation can indeed be dealt with in industria l applications. Some general recommendations and guidelines could also be made about the best method to be used depending on the expected level of extrapolation and the structure of the data set. 270 Conclusion Another problem currently occurs in real-life industrial conditions. Modifications in measurement conditions, aging, maintenance, or replacement of an instrument can induce drift and changes in the instrumental response. It is most of the time not possible to take these perturbations into account in the calibration step. The quality of prediction for new samples can therefore be expected to degrade over time. The second study presented in Chapter 2 aimed at evaluating the robustness of calibration methods in the case of instrumental perturbations. It was performed on 12 multivariate calibration methods, using the same 5 industrial data sets as the previous study, and by simulating 6 different instrumental perturbations on the response obtained for the samples to be predicted. Some general recommendations could be made in particular about the type of model, in terms of complexity or pretreatment, that has to be avoided in order to increase robustness toward instrumental perturbations. The third and final part of Chapter 2 follows naturally the comp arative studies presented. It aims at explaining, step by step, from data pre-processing to the prediction of new samples, how to develop a calibration model. Even though this tutorial describes the construction of a Multiple Linear Regression model on spectroscopic data, most of the strategy can be applied to other calibration methods and/or data sets of different nature. The third chapter of this thesis presents some specific case studies. The aim of this chapter is twofold. First of all, the strategy and guidelines developed in Chapter 2 are applied on industrial data. The whole chapter illustrates how an industrial process can be improved by proper use of chemometrical tools. It also gives another illustration of the importance of the method selection step. The fact of using another instrument for data acquisition can have a dramatic influence on the multivariate calibration model building process. Even though the studied process was the same and the nature of the spectroscopic technique remained unchanged, the fact that an instrument with better resolution was used implied that the best results were achieved by a different calibration method. The second important aspect of this study is that it was performed on Raman spectroscopic data. Sophisticated data treatment is usually not considered necessary for Raman data. Specialists in the field mostly employ direct calibration, as opposed to inverse calibration methods used by chemometricians. It was demonstrated that chemometrical tools cannot only match the results obtained by the methods classically used on Raman data, but can even outperform them. When classical methods could only predict relative concentration for the monitored chemical process, using inverse calibration it was for 271 New trends in Multivariate Analysis and Calibration the first time possible to evaluate absolute concentrations, moreover achieving a much better precision level. The fourth chapter of this thesis continues with the effort in broadening the field of applicability of chemometrics. This chapter is devoted to methodologies used to deal with data that can be considered original because of their structure and/or size. The first study in this chapter shows that data sets with very a high number of variables can be treated very efficiently by new algorithms designed specifically for suc h computationally intensive cases. Even though current computers have enough power and speed to deal with very big matrices in relatively short amounts of time, the existence of such methods can be very important in situations where the time factor is critical, for instance for online analysis or Statistical Process Control (SPC). The rest of this chapter was devoted to rather new techniques in the field of chemometrics : N-way methods. These methods take into account data with a more complex structure than the traditional tables (2 dimensions ). It is important to realise that the N-way structure is not only dealt with, it is actually used to achieve a better understanding of the data structure, and a more efficient extraction of information contained in the data. A case study on a pharmaceutical data set with very high dimensionality (6 dimensions, over 225000 data points) showed that these methods (in particular the Tucker 3 model) are unmatched for the exploration of such data sets. The study of this data set however confirmed that N-way models are just as sensible to outliers or extreme samples as classical method. It was therefore investigated how the Tucker 3 model could be made more robust, and a methodology was proposed in this sense. This methodology proved efficient both on synthetic data set and on the 6-way pharmaceutical data set. Overall, this thesis confirmed that chemo metrical methods can be applied to data coming from other spectroscopic techniques than NIR, and of course also to non-spectroscopic data. As was illustrated by the study of electro-encephalographic data with N-way models, new methods can help chemometrics to step foot in new fields of science. Another example of this phenomenon is the current merging of chemometrics and Quantitative Structure Activity Relationship (QSAR), which hopefully represents a step forward in the direction of the unification of all branches of computer-chemistry. 272 Conclusion 273 New trends in Multivariate Analysis and Calibration P UBLICATION LIST “A Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data Sets. Part III : Predictive Ability under Instrumental Perturbation Conditions.” F. Estienne , L. Pasti, V. Centner, B. Walczak, F. Despagne, D. Jouan Rimbaud, O.E. de Noord, D.L. Massart. Submitted for publication. “Chemometrics and Modelling.” F. Estienne , Y. Vander Heyden, D.L. Massart. Chimia, 55 (2001) 70-80. “Multi-way Modelling of Electro-encephalographic Data.” F. Estienne , Ph. Ricoux, D. Leibovici, D.L. Massart. Chemometrics and Intelligent Laboratory System, 58 (2001) 59-72. “Multivariate Calibration with Raman data using Fast PCR and PLS methods.” F. Estienne , D.L. Massart. Analytica Chimica Acta, 450 (2001) 123-129.. “A Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data Sets. Part II : Predictive Ability under Extrapolation Conditions.” F. Estienne , L. Pasti, V. Centner, B. Walczak, F. Despagne, D. Jouan Rimbaud, O.E. de Noord, D.L. Massart. Chemometrics and Intelligent Laboratory Systems, 58 (2001) 195-211. 274 Conclusion “Robust Version of Tucker 3 Model.” V. Pravdova, F. Estienne, B. Walczak, D.L. Massart. Chemometrics and Intelligent Laboratory System, 59 (2001) 75-88. “Multivariate Calibration with Raman Spectroscopic Data : A case Study.” F. Estienne , N. Zanier, P. Marteau, D.L. Massart. Analytica Chimica Acta, 424 (2000) 185-201. “The Development of Calibration Models for Spectroscopic Data Using Principal Component Regression.” R. De Maesschalck, F. Estienne, J. Verdú-Andrés, A. Candolfi, V. Centner, F. Despagne, D. JouanRimbaud, B. Walczak, D.L. Massart, S. de Jong, O.E. de Noord, C. Puel, B.M.G. Vandeginste. Internet Journal of Chemistry, 2 (1999) 19. 275 New trends in Multivariate Analysis and Calibration 276