0% found this document useful (0 votes)

205 views

Chemometric Software For Multivariate Data Analysis Based On Matlab

Chemometric Software

Uploaded by

mikk85

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

205 views

Chemometric Software For Multivariate Data Analysis Based On Matlab

Chemometric Software

Uploaded by

mikk85

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Chemometrics and Intelligent Laboratory Systems 116 (2012) 18

Contents lists available at SciVerse ScienceDirect

Chemometrics and Intelligent Laboratory Systems

journal homepage: www.elsevier.com/locate/chemolab

MultiDA: Chemometric software for multivariate data analysis based on Matlab

Qianxu Yang, Liangxiao Zhang, Longxing Wang, Hongbin Xiao
Key Lab of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China

a r t i c l e

i n f o

Article history:
Received 17 November 2011
Received in revised form 1 March 2012
Accepted 5 March 2012
Available online 2 May 2012
Keywords:
Chemometrics software
Matlab
Multivariate analysis
Metabolomics/metabonomics
Multi-model comparison

a b s t r a c t
Multivariate data analysis (MultiDA), a user-friendly interface chemometric software, is developed for the
routine metabolomics/metabonomics data analysis. There are mainly two advantages for MultiDA. First, it
could simultaneously provide multiply methods for data preprocessing and multivariate analysis. The main
chemometric methods in MultiDA contains k-means cluster analysis, k-medoid cluster analysis, hierarchical
cluster analysis (HCA), principal component analysis (PCA), robust principal component analysis (ROPCA),
non-linear PCA (NLPCA), non-linear iterative partial least squares (NIPALS), SIMPLS, discriminate analysis
(DA), canonical discriminate analysis (CDA), stepwise discriminate analysis (SDA), uncorrelated linear discriminate analysis (ULDA) and some data preprocessing methods, such as standardization, outlier detection,
genetic algorithm for feature selection (GAFS), orthogonal signal correction (OSC), weight analysis (Weight)
etc. Second, multi-model comparison could be conducted to obtain the best outcome. Moreover, this software
is available for free.
2012 Elsevier B.V. All rights reserved.

1. Introduction
Chemometrics is dened as A chemical discipline that uses statistical and mathematical methods, to design or select optimum procedures
and experiments, and to proved maximum chemical information by analyzing chemical data [1].
With emergence and development of systems biology including genomics, translatomics, proteomics, and metabolomics, massive amounts
of data are produced from instruments. Subsequent data processing has
become a challenge for the development of omics. Besides, there are
many algorithms focusing on the same problem, such as the total nine
PLS1 algorithms [2], which always confuse people without statistical
basis. At the same time, there is no perfect method for all data. Selection
of chemometrics method depends on the current data. Thus, it is necessary to compare models built by different methods for the same data.
Matlab is high-level technical computing language and interactive
platform for algorithm development, data visualization, data analysis,
and numeric computation [3]. With the help of the graphical user interface (GUI) in Matlab, it is possible to develop user-friendly software. In
this study, MultiDA is created based on the Matlab GUI. Recently, many
excellent Matlab toolbox have been developed for multivariate data
processing, such as ParLes [4] and TOMCAT [5]. Both of ParLes and
TOMCAT are popular software. Among them, ParLes focuses on spectroscopic data processing with minor multivariate calibration function and
TOMCAT emphasizes on multivariate calibration, including many algorithms on PCA, robust PCA, PLS and Robust PLS.

Corresponding author. Tel./fax: + 86 411 84379756.

E-mail address: hbxiao@dicp.ac.cn (H. Xiao).
0169-7439/$ see front matter 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.chemolab.2012.03.019

In this study, MultiDA is developed to deal with metabolomics/

metabonomics data analysis. The multi-model comparison is conducted
in MultiDA for a comprehensive, robust and concise outcome. Additionally, MultiDA tries to discriminate samples in different groups and explore
informational variables. The GNU 3 license of MultiDA is available
through http://code.google.com/p/multida/ for raw code and http://
code.google.com/p/multida-standalone/ for standalone version.
2. Software
2.1. Software environment and installation
MultiDA is developed under MATLAB R2007b environment based on
graphic user interface, and tested at MATLAB 7 and 2010, 2011b. The
standalone version is also available.
A Microsoft Spreadsheet ActiveX Control (corresponding le
OWC10.DLL) is embedded in software for showing raw data. ActiveX
control is an object linking and embedding control extension (OCX)
called by Microsoft. It provides a set of rules for how applications should
share information.
If MultiDA is transferred to a target computer without OWC10.DLL le,
ActiveX Control should be registered rstly, or an error will be generated.
2.2. Outline and structure of MultiDA
As shown in Fig. 1, the software comprises seven compartments,
including data input, data preprocessing, cluster, PCA, classication,
PLS and gure control. Each compartment performs the special function
as marked in Fig. 1. MultiDA provides a plain operation interface and
a straightforward operation model without complicated parameters

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 18

Fig. 1. Outline of MultiDA: the software is composed of seven compartments, i.e., data input, data preprocessing, cluster, PCA, classication, PLS and gure processing. Each compartment is a separate panel.

setting. Note that the data input compartment is a Microsoft spreadsheet ActiveX Control, separating from the independent variable input
block and group variable (or dependent variable) input block, marked
with X data and Y data at the top of the spreadsheet (X for independent variable data, Y for group variable data), to reduce the complexity of subsequent data processing.
Fig. 2 displays the structure and data stream of software. There are
two forms to input data, [ReadData] 1 button for reading data from the
spreadsheet and [ImportData] from a special format le. After reading of
data, some preprocessing method is available, such as standardization,
weight analysis [6], outlier detection [7,8], GAFS [911] and OSC [12].
If the information of sample class is unknown, cluster analysis is
unique choice. MultiDA provides three cluster methods, including kmeans cluster analysis, k-medoid cluster analysis and HCA. The commonly used unsupervised methods in MultiDA contain PCA, ROPCA
[13,14], and NLPCA [15,16], while supervised learning methods are
also available for a given sample class including linear DA (LDA), quadratic DA (QDA), Mahalanobis DA (MDA), CDA [17], SDA [18], ULDA
[19,20], Kernel-PLS [21], NIPALS [22,23] and SIMPLS [24]. In the actual
data processing, one or more than one methods could be employed
according to the requirement of data analysis. Surely, the results from
different methods could be compared with each others.
To clearly show the results of these methods, many gures are provided by MultiDA, such as bar plot for the weight coefcient from
weight analysis and stem-leaf plot for Hotelling T-square statistics of
each sample in PCA. Matrix and structure are the format for data
storing in MultiDA. All the useful gures and data, especially transitive
variable could be saved by [Save Plot] or [Save Data] button, respectively.

processing (plot of tness changes with generation increasing) etc.;

Mode 2gures are produced by special uicontrol button, such as
[Score plot], [Loading plot] and [Weight] button, for visualization of desire result. [PlotControl] panel displays some fundamental uicontrol
for gures modication, especially for score or loading plot, the error ellipse with or without group, cross line through point [0, 0], group name
etc., could be changed. When gures exist, [SavePlot] is available for saving plot in different form. Especially, for tif format, gures could be saved
in proper resolution to meeting need of publication.

2.3. Figure output

There are two modes of gure generation of MultiDA: Mode 1
gures are an accompaniment of algorithm with the purpose of giving
additional information of outliers in samples (PCA boxplot), principle
component selection (Pareto plot, AIC-RMSE plot), computation
1

Square brackets represent uicontrol in MultiDA for clarity.

Fig. 2. Structure and data stream of MultiDA: rectangle represents data with different
format. Elapse shows the chemometric ability. Round rectangle stands for the data
input button in software. Double line arrow displays the main direction of data transfer. Dashed line arrow indicates complementary data stream. Bracket means that all
the data in processing can be saved in gure or mat format.

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 18

2.4. Others
2.4.1. Transparent data analysis
MultiDA provides a relative transparent data analysis environment.
All the useful data produced by MultiDA are stored in the handle object and subsequently saved in workspace with name of algorithm tag,
including the transition data for your checking error. Taking ULDA as
an example, a structure type variable named by ULDA is exported
to Matlab workspace when runs of ULDA nished. The ULDA structure
contains some variables as follows:
NumberOfEachGroup number of each group
MeanXByGroup mean of each variable in different group
WithInDeviation with-in class deviation
TotalDeviation total deviation
BetweenClassScatter between class scatter
Transmits transform matrix
UDV
uncorrelated discriminate vector
CrossValidation recognition, prediction, correct and ve-fold recognition rate of ULDA
Sometimes, it is available to get a plot utterly by UDV instead of
the sample number in Matlab command windows. Because it is usually
difcult to get more than one UDV, the default gure employs the sample number to avoid error. Fig. 3 is the scatter plot gotten by UDV1 and
UDV2. It looks beautiful compared with by the sample number.

3. Applications
3.1. Data acquisition
The wine data set [25] from UCI Machine Learning Repository is
employed to test the functions of MultiDA and demonstrate multimodel comparison method. Wine data set contains the quantities of 13
constituents found in 3 types of wines in 178 wines samples. All the 13
constituents are alcohol, malic acid, ash, alkalinity of ash, magnesium,
total phenols, avanoids, nonavanoid phenols, proanthocyanins,
color intensity, hue, OD280/OD315 of diluted wines and proline.
There are two ways to input data to software: [ReadData] and
[ImportData] button. [ReadData] is used to get data in both x and y
spreadsheet. [ImportData] is applied to data import from the specied
3.5
3
2
1

4
4.5
5

UDV2

5.5

le. File types MultiDA could recognize are listed in Table 1. The
[XdataLabel] in [ImportData] dialog is employed for storing data for
labeling the sample or variable of X data, which can be invoked by
[Label] button in the scores plot, loading plot or some else.
3.2. Methods of data preprocessing
MultiDA provides many data preprocessing methods including descriptive analysis, data standardization, outlier detection and OSC.
3.2.1. Descriptive analysis and standardization
Descriptive analysis describes the main feature of a collection of data
quantitatively. In descriptive analysis, MultiDA provides a fundamental
analysis of within class data and whole data, including mean, median,
standard variance, variance, maximum, minimum, kurtosis, skewness
and coefcient. According to descriptive analysis, an overview of data
shows.
The descriptive analysis results indicate that wine data possess the
magnitude varies from 0.1 to 1000 and the variance varies from 0.01
to 100,000. It means that almost the 99.9% information focuses on a
few variables, which shields the effect of other variables dramatically.
From the prediction error of PCA-DA and Zscores-PCA-DA, a prominent
improvement of predictive ability is obtained after standardization
(data not show). In MultiDA, [Standerlize] provides ve methods for
standardization, that is, log-transform, centering, Z-scores, min-max
normalization and decimal scaling.
3.2.2. Outlier detection
As dened by Grubbs [7], outlier is one that appears to deviate
markedly from other members of the sample in which it occurs. MultiDA provides two outlier detection methods (Grubbs test [7] and Wilk's
method [8,26]) and three gures description of outlier (PCA boxplot,
stem-leaf-plot of Hotelling T-square and ROPCA distance scatter plot).
[Outlier] button invokes a selective dialog for Grubbs Test and Wilk's
Method. MultiDA does not afford a direct control interface for gure
description of outlier. While PCA boxplot is a part outcome of outlier
analysis, stem-leaf-plot of Hotelling T-square is outputted by PCA and
ROPCA score diagnostic plot is obtained by ROPCA.
3.2.3. Multi-comparison of outlier detection
Table 2 displays the outliers recognized by different methods. Results indicate that different outlier detection methods tend to select
different outliers. However, the 70th and 96th sample were selected
by all methods, so multi-model comparison suggests that these samples
could be classied as extreme outlier. Fig. 4 represents a visualized outcome of three gures description of outlier. It is interesting that all outliers detected above belong to group 2 (Fig. 4a and b), which might
suggest samples in group 2 preserve greater variations. The overview
of outlier detection based on different methods could give us a more
comprehensive insight to wine data.

3.3. Feature selection method

6.5

Feature selection or variable selection is to select a subset of feature

for constructing a more robust model. MultiDA provides weight

7
7.5
8
8.5

Table 1
Files that MultiDA could recognize.

1.5

2.5

3.5

4.5

UDV1
Fig. 3. Scatter plot get by UDV1 and UDV2 of ULDA. x and y axis represent UDV
(uncorrelated discriminate vector) 1 and 2. Star, circle and square represent three different groups respectively. The two UDV give a perfect separation of three groups in
wine data.

File type

Format

Text le
Excel le
Image le
Sound le
Others

txt, dat, tap, dlm,

xls, csv, wk1
gif, cur, ico, tif, bmp, pcx, jpeg
wav, au, snd
mat, avi

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 18

different variables may affect different groups. Such as, Fig. 5c indicates
that color intensity is the key factor for discriminating group 1 and 3,
while avanoids inuence the classication of group 2 and 3 according
to Fig. 5d.

Table 2
Outlier detection by different outlier detection methods and gure description of
outlier.
Method

Outlier detection

Grubbs test
Wilk's method
Box plot
Hotelling T-square plot
Distance plot

60, 70, 96
159, 70, 96, 74, 122
96, 70, 97, 74, 122
122, 70, 96, 74, 159, 111
74, 96, 122, 70, 79, 158, 160

3.3.2. Genetic algorithm for feature selection

The GUI of GAFS in MultiDA shows the default values of population,
mutation rate, generation, crossover rate and goal (see Fig. 6a). Surely,
the users could adopt other values for the optimism results. Moreover,
evaluation functions based on LDA, QDA, MDA, ULDA, NIPALS, and
SIMPLS are involved in our software.
Fig. 6b reects the outcome of GAFS based on NIPALS. MultiDA
compared tness of every chromosome in each generation, and the best
chromosome (or subset) would be marked by its tness as Fig. 6b. A
structure variable of GAFS was also sent to workspace of Matlab, which
contains the information of the best chromosome, selected variable and
tness of the best chromosome. For wine data the best chromosome
contains six variables (variable 1, 4, 5, 7, 12 and 13) and ve components are extracted by NIPALS. The six variables give the 92.1348%
correct rate of classication.

analysis, genetic algorithm, CART, PLS weight and stepwise DA ([Stepwise]) for this purpose.
3.3.1. Weight analysis
The purpose of weight analysis is to nd the best variable set for
discriminating between two groups. A variance weight method is introduced to achieve this aim based on Liang Y.Z. [6].

2
nC P
nT
P
xcj xtj
c1 t1
n n
n n
2
2
C P
C
P
PT PT
nC
xc j xcj nT n
xt j xtj
n
nC nT

c 1 c1

3.3.3. Multi-comparison of feature selection

Table 3 illustrates the informative variable based on the feature selection methods mentioned above. All the selected variable subsets were
evaluated by 10-fold cross validation RMSEP, RMSER and BIC. As shown
in Table 3, variable 1, 13, and 7 (namely alcohol, proline and avanoids)
are selected in most of six methods; they are therefore considered as
more informative variable. Modeling with those three variables could

t 1 t1

Fig. 5a shows the outline of weight analysis GUI. Because weight

analysis method is used only for two groups, there are two popupmenus in weight analysis GUI for selecting group pair. From Fig. 5b
and c, it is clear that variable 13, 10, 7 and 1 (i.e. proline, color intensity,
avanoids and alcohol) are probably the latent variables. However,

Box Plot of Raw data

500

20
10
0

Values

0
74

122

500

40
50

70
96

1000

PC2

PC1

c
9
Slightly devious
sample

159
122

6
96

5
4
70

3
79

96
111

159

30
20
10

1
0

122

160

Hotelling TSquare

Orthogonal Distance

Extrem
sample

Normal
sample

Slightly devious
sample

Score Distance

100 120 140 160 180

Sample Case

Fig. 4. Three gures description of outlier: a) PCA boxplot of three groups at the rst component, b) PCA boxplot at the second component, c) scored diagnostic plot, each sample
displays a score distance within the PC space and orthogonal distance to the PCA space, and d) stem-leaf-plot of Hotelling's T-square statistic of each sample. Outliers were marked
with sample number in all four gures.

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 18

Weight Analysis Between Group 1 And Group 2

1.6
1.4
1.2

Weight

1
0.8
0.6
0.4
0.2
0

9 10 11 12 13

c
Weight Analysis Between Group 2 And Group 3

1.4

Weight Analysis Between Group 1 And Group 3

5
4.5

1.2
4
3.5
3

0.8

Weight

0.6

2.5
2
1.5

0.4

1
0.2
0.5
0

9 10 11 12 13

Variable

9 10 11 12 13

Variable

Fig. 5. Weight analysis GUI and bar plot of weight coefcient bar plot of each variable between different groups pair: a) layout of weight analysis GUI, b) weight coefcient between
group 1 and 2, c) group 1 and 3, d) group 2 and 3. In weight coefcient bar plot, x axis represents variables and y axis indicates discriminating ability to each group pair of each variable.

get a low BIC at 14.4213, indicating a robust and concise model. Variable 2, 3, 4, 5, 9, 10, 11 and 12 were just selected by some of methods
and little RMSE could be obtained when they are involved in model, so

they can be classied as assistant variable. Similarly, total phenols (variable 6) and nonavanoid phenol (variable 8) were never selected by
any method implying they possess little contribution for discriminating

b
1
0.95
0.921348

0.9

Fitness

0.85
0.8
0.75
0.7
0.65

100

Generation
Fig. 6. Genetic algorithm for feature selection: a) outline of GAFS GUI, b) tness of population in different generations based on NIPALS. Circle represents for mean tness (y axis) of
each generation and star for the maximum tness in each generation (x axis). Digital number denotes the best tness of all generation. If goal met, run will stop prematurely.

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 18

different types of wine, so multi-model comparison of various feature selection methods could give a deep insight of each variable and more accurate conclusion.
3.4. Supervised learning method
MultiDA contains lots of algorithms for supervised analysis as follows: linear discriminate analysis (LDA), quadratic discriminate analysis (QDA), Mahalanobis discriminate analysis (MDA), CDA, ULDA and
PLS.
3.4.1. Discriminate analysis
LDA, QDA and MDA are invoked by [Classication] button. A prior
probability should be selected ahead of classication analysis. MultiDA
supplies All Group equal and Weight by Group Size, two kinds of
prior probability. After classication analysis, the territorial map and
tree plot could be output. CDA is related to PCA and canonical correlation. Raw program of CDA is obtained at Matlab le exchange center
created by Trujillo-Ortiz [17] with revision.
Leave-one-out (LOO) and k-fold cross-ventilation are available for
evaluation of discriminate algorithms. PCA-DA is also available for
tackling high dimensional data.
3.4.2. Partial least squares analysis
Partial least squares (PLS) is a commonly used method for modeling
relations between multi-independent variable and non (PLS1) or
multi-dependent (PLS2) variable. MultiDA provides Kernel-PLS,
NIPALS and SIMPLS algorithms for PLS. [NIPALS] and [SIMPLS] will invoke a dialog for choosing number of components; at the same time a
gure displays providing information of AIC and RMSE variation with
different number of components. Like PCA, the scores and loadings
plots are also easily output for PLS. LOO and k-fold cross validation are
also available as discriminate analysis.
3.4.3. Multi-comparison of supervised learning methods
Table 4 displays the multi-model comparison outcome of different
supervised learning methods by 10-fold cross validation RMSEP and
RMSER. Both CDA and ULDA gave a perfect recognition and prediction
error. Therefore, multi-model comparison shows that ULDA and CDA
are suitable for analysis this wine data.
3.5. Unsupervised learning method
Cluster analysis and PCA are the two most important unsupervised
learning methods. There are three methods for cluster analysis in MultiDA, namely, k-means ([KMeans]), k-medoid ([KMedoid]) and hierarchical clustering ([HierachicalClustering]), and three methods for PCA,
classical PCA, robust PCA and NLPCA included.

selective, in which methods for calculating the distance and proximity

are provided. The grouping results will display in the Y data spreadsheet.
3.5.2. Comparison of multiply cluster analysis
Fig. 7 is PCA scores plot from three methods. It is clear that k-means
and k-medoid depict the group information correctly and give similar
results. Compared with k-means, k-medoid is more robust to outlier
buts computationally intensive. Hierarchical cluster method is sensitive
to data value for calculating similarity and distance of data. In the left of
Fig. 7b, the 19th sample is grouped singly because the value of praline in
the 19th sample is bigger than others and praline magnitude is signicantly larger than other variables. For z-score data, hierarchical cluster
groups the 96th sample itself as a class and the 70th, 79th and 74th
samples as another one, all of which are outlier according to outlier detection. Thus, it is essential to standardize samples and exclude outlier
for hierarchical cluster. As mentioned above, through different cluster
analysis, the inner data structure could be revealed deeply. Thus,
multi-model comparison takes much advantage than single analysis.
3.5.3. Principle component analysis
Classical PCA, robust PCA (ROPCA) and non-linear PCA (NLPCA)
are provided in MultiDA. PCA is sensitive to outlier, so robust PCA is
needed and used for handling data with outlier. When after run of
ROPCA, a distance scatter plot is plotted automatically to reect the
extreme and moderate outlier in data. The extreme outlier may locate
in the top right position.
In MultiDA, non-linear PCA is based on the feed-forward back
propagation network including ve layers. Neuroses in each hidden
layer are decided by data dimensionality. Fig. 8a is NLPCA GUI prole.
Default parameters of max epoch to 500, error to 0.001, learn rate to
0.3 and show interval to 10 are provided. Fig. 8b is the scores plot
based on NLPC 1 and 2. Compared with PCA scores plot, NLPCA scores
plot displays poor separation of three groups, which indicates variables
in wine display poor nonlinear correlation.
4. Conclusion
Multivariate data analysis (MultiDA), a user-friendly interface
chemometric software, is developed for the routine metabolomics
data analysis. MultiDA integrates plenty of algorithms to deal with metabolomics data. The multi-model comparison is also adopted in MultiDA for a comprehensive insight of data structure and making up
shortcomings of similar algorithm. A case of wine data is employed to
demonstrate the functions and operation procedures of MultiDA. Because based on MATLAB platform, MultiDA possesses a powerful extended capability for your own purpose.
Acknowledgment

3.5.1. Cluster analysis

A group number is needed ahead of cluster analysis in MultiDA.
For [HierachicalClustering], a dialog will pop up for more parameters

This work is supported by the National Natural Science Foundation of China (90709014).

Table 3
Multi-model comparison of different feature selection method.
Method

Informative variable RMSEP-10CVc RMSER-10CV BIC

Normala
Weight analysis
GA NIPALS
CART
NIPALS weight
Stepwise
Co-variablesb

113
1, 7, 1013
1, 2, 4, 7, 913
5, 7, 1113
1, 4, 5, 13
14, 7, 1013
1, 7, 13

a
b
c

0.0618
0.0278
0.0562
0.0781
0.2245
0.0451
0.0614

All variable was introduced into PLS model.

Variable were nearly selected in all six methods.
10 Fold cross validation.

0.0546
0.0305
0.0431
0.0614
0.1327
0.0377
0.0584

1.9601
15.0633
9.1189
10.8625
6.1949
8.2655
14.4213

Table 4
Comparison of different supervised learning methods.
Method

RMSEP-10CVa

RMSER-10CV

LDA
QDA
MDA
CDA
ULDA
NIPALS
SIMPLS

0.0056
0.0056
0.0222
0
0
0.0448
0.0448

0
0.0047
0
0
0
0.0322
0.0273

10 Fold cross validation.

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 18

RAWPCA Score plot

KMeansPCA Score plot

30
1
2
3

20
10
0
10
20
30
40
50
60
1000

500

Principal Component 2(0.17359%)

3
2
1

20
10
0
10
20
30
40
50
60
1000

500

KMedoidPCA Score plot

500

HierarchicalPCA Score plot

1
2
3

10
0
10
20
30
40
50
500

500

Principal Component 2(0.17359%)

60
1000

Principal Component 1(99.8091%)

Principal Component 2(0.17359%)

1
2
3

20
10
0
10
20
30
40
50
60
1000

500

Principal Component 1(99.8091%)

500

Principal Component 1(99.8091%)

Fig. 7. PCA scores plot grouped by three cluster methods and raw group: a) PCA scores plot grouped by k-means cluster analysis, b) PCA scores plot grouped by raw group, c) PCA
scores plot grouped by hierarchical cluster analysis, d) PCA scores plot grouped by k-medoid cluster analysis. x and y axis indicate principle component score 1 and 2 with variance
explanation in four gures, and star, circle and square represent three different groups.

b
3.5
1
2
3

NLPCA 2

2.5
2
1.5
1
0.5
0
0.5

1.5

2.5

3.5

NLPCA 1
Fig. 8. Outline of non-linear PCA GUI: a) NLPCA GUI prole. Number of node in three hidden layer, epoch, error, learning rate and showing interval is adjustable. b) Scores plot of
wine data processed by NLPCA under the 20-3-20 hidden layer structure. x and y axis are weight of node in bottleneck layer respectively. Star, circle and square represent three
different groups.

References
[1] K. Varmuza, P. Filzmoser, Introduction to Multivariate Statistical Analysis in
Chemometrics, CRC, Boca Raton, 2009.
[2] M. Andersson, A comparison of nine PLS1 algorithms, Journal of Chemometrics 23
(2009) 518529.
[3] Matlab, The MathWorks, Inc. Natwick, MA (USA), http://www.mathworks.com.
[4] R.A. Viscarra Rossel, ParLeS: software for chemometric analysis of spectroscopic data, Chemometrics and Intelligent Laboratory Systems 90 (2008)
7283.

[5] M. Daszykowski, S. Serneels, K. Kaczmarek, P. Van Espen, C. Croux, B. Walczak,

TOMCAT: a MATLAB toolbox for multivariate calibration techniques, Chemometrics
and Intelligent Laboratory Systems 85 (2007) 269277.
[6] L.Z. Yi, J. He, Y.Z. Liang, D.L. Yuan, F.T. Chau, Plasma fatty acid metabolic proling
and biomarkers of type 2 diabetes mellitus based on GC/MS and PLS-LDA, FEBS
Letters 580 (2006) 68376845.
[7] F.E. Grubbs, Procedures for detecting outlying observations in samples, Technometrics
11 (1969) 121.
[8] J.E. Gentleman, M.B. Wilk, Detecting outliers in two-way table: 1. Statistical behaviour of residuals, Technometrics 17 (1975) 114.

Q. Yang et al. / Chemometrics and Intelligent Laboratory Systems 116 (2012) 18

[9] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. De Noord, Genetic algorithms as a

tool for wavelength selection in multivariate calibration, Analytical Chemistry 67
(1995) 42954301.
[10] R. leardi, Application of a genetic algorithm for feature selection under full validation conditions and to outlier detection, Journal of Chemometrics 8 (1994) 6579.
[11] R. Leardi, Genetic algorithms in chemometrics and chemistry: a review, Journal of
Chemometrics 15 (2001) 559569.
[12] S. Wold, H. Antti, F. Lindgren, J. Ohman, Orthogonal signal correction of near-infrared
spectra, Chemometrics and Intelligent Laboratory Systems 44 (1998) 175185.
[13] M. Hubert, P.J. Rousseeuw, K. Vanden Branden, ROBPCA: a new approach to robust principal components analysis, Technometrics 47 (2005) 6479.
[14] M. Hubert, S. Engelen, Fast cross-validation of high-breakdown resampling algorithms for PCA, Computational Statistics and Data Analysis 51 (2007) 50135024.
[15] S. Hoon, W. Keith, R.F. Charles, Novelty Detection under Changing Environmental
Conditions, SPIE's 8th Annual International Symposium on Smart Structures and
Materials, Newport Beach, 2001, pp. 108118.
[16] M.A. Kramer, Nonlinear principal component analysis using autoassociative neural networks, AICHE Journal 37 (1991) 233243.
[17] A. Trujillo-Ortiz, R. Hernandez-Walks, S. Perez-Osuna, RAFisher2cda:Canonical
Discriminant Analysis, A MATLAB le, 2004, [WWW document].

[18] H.W. Wang, Partial Least-Squares Regression Method and Applications, National
Defense Industry Press, Beijing, 1994.
[19] D. Yuan, Y. Liang, L. Yi, Q. Xu, O. Kvalheim, Uncorrelated linear discriminant analysis
(ULDA): a powerful tool for exploration of metabolomics data, Chemometrics and
Intelligent Laboratory Systems 93 (2008) 7079.
[20] Y. Xu, J.Y. Yang, Z. Jin, A novel method for Fisher discriminant analysis, Pattern
Recognition 37 (2004) 381384.
[21] F. Lindgren, P. Geladi, S. Wold, The kernel algorithm for PLS, Journal of
Chemometrics 7 (1993) 4549.
[22] H. Wold, Nonlinear estimation by iterative least squares procedures, Research Papers in Statistics, Wiley, New York, 1996, pp. 411444.
[23] P. Geladi, B.R. Kowalski, Partial least squares regression: a tutorial, Analytica
Chimica Acta 185 (1986) 117.
[24] S. de Jong, SIMPLS: an alternative approach to partial least squares regression,
Chemometrics and Intelligent Laboratory Systems 18 (1993) 251263.
[25] A. Frank, A. Asuncion, UCI Machine Learning Repository, University of California,
School of Information and Computer Science, 2010.
[26] S. Yang, Y. Lee, Identication of a Multivariate Outlier, Annual Meeting of the
American Statistical Association, San Francisco, 1987.