Chemometric Software For Multivariate Data Analysis Based On Matlab
Chemometric Software For Multivariate Data Analysis Based On Matlab
a r t i c l e
i n f o
Article history:
Received 17 November 2011
Received in revised form 1 March 2012
Accepted 5 March 2012
Available online 2 May 2012
Keywords:
Chemometrics software
Matlab
Multivariate analysis
Metabolomics/metabonomics
Multi-model comparison
a b s t r a c t
Multivariate data analysis (MultiDA), a user-friendly interface chemometric software, is developed for the
routine metabolomics/metabonomics data analysis. There are mainly two advantages for MultiDA. First, it
could simultaneously provide multiply methods for data preprocessing and multivariate analysis. The main
chemometric methods in MultiDA contains k-means cluster analysis, k-medoid cluster analysis, hierarchical
cluster analysis (HCA), principal component analysis (PCA), robust principal component analysis (ROPCA),
non-linear PCA (NLPCA), non-linear iterative partial least squares (NIPALS), SIMPLS, discriminate analysis
(DA), canonical discriminate analysis (CDA), stepwise discriminate analysis (SDA), uncorrelated linear discriminate analysis (ULDA) and some data preprocessing methods, such as standardization, outlier detection,
genetic algorithm for feature selection (GAFS), orthogonal signal correction (OSC), weight analysis (Weight)
etc. Second, multi-model comparison could be conducted to obtain the best outcome. Moreover, this software
is available for free.
2012 Elsevier B.V. All rights reserved.
1. Introduction
Chemometrics is dened as A chemical discipline that uses statistical and mathematical methods, to design or select optimum procedures
and experiments, and to proved maximum chemical information by analyzing chemical data [1].
With emergence and development of systems biology including genomics, translatomics, proteomics, and metabolomics, massive amounts
of data are produced from instruments. Subsequent data processing has
become a challenge for the development of omics. Besides, there are
many algorithms focusing on the same problem, such as the total nine
PLS1 algorithms [2], which always confuse people without statistical
basis. At the same time, there is no perfect method for all data. Selection
of chemometrics method depends on the current data. Thus, it is necessary to compare models built by different methods for the same data.
Matlab is high-level technical computing language and interactive
platform for algorithm development, data visualization, data analysis,
and numeric computation [3]. With the help of the graphical user interface (GUI) in Matlab, it is possible to develop user-friendly software. In
this study, MultiDA is created based on the Matlab GUI. Recently, many
excellent Matlab toolbox have been developed for multivariate data
processing, such as ParLes [4] and TOMCAT [5]. Both of ParLes and
TOMCAT are popular software. Among them, ParLes focuses on spectroscopic data processing with minor multivariate calibration function and
TOMCAT emphasizes on multivariate calibration, including many algorithms on PCA, robust PCA, PLS and Robust PLS.
Fig. 1. Outline of MultiDA: the software is composed of seven compartments, i.e., data input, data preprocessing, cluster, PCA, classication, PLS and gure processing. Each compartment is a separate panel.
setting. Note that the data input compartment is a Microsoft spreadsheet ActiveX Control, separating from the independent variable input
block and group variable (or dependent variable) input block, marked
with X data and Y data at the top of the spreadsheet (X for independent variable data, Y for group variable data), to reduce the complexity of subsequent data processing.
Fig. 2 displays the structure and data stream of software. There are
two forms to input data, [ReadData] 1 button for reading data from the
spreadsheet and [ImportData] from a special format le. After reading of
data, some preprocessing method is available, such as standardization,
weight analysis [6], outlier detection [7,8], GAFS [911] and OSC [12].
If the information of sample class is unknown, cluster analysis is
unique choice. MultiDA provides three cluster methods, including kmeans cluster analysis, k-medoid cluster analysis and HCA. The commonly used unsupervised methods in MultiDA contain PCA, ROPCA
[13,14], and NLPCA [15,16], while supervised learning methods are
also available for a given sample class including linear DA (LDA), quadratic DA (QDA), Mahalanobis DA (MDA), CDA [17], SDA [18], ULDA
[19,20], Kernel-PLS [21], NIPALS [22,23] and SIMPLS [24]. In the actual
data processing, one or more than one methods could be employed
according to the requirement of data analysis. Surely, the results from
different methods could be compared with each others.
To clearly show the results of these methods, many gures are provided by MultiDA, such as bar plot for the weight coefcient from
weight analysis and stem-leaf plot for Hotelling T-square statistics of
each sample in PCA. Matrix and structure are the format for data
storing in MultiDA. All the useful gures and data, especially transitive
variable could be saved by [Save Plot] or [Save Data] button, respectively.
Fig. 2. Structure and data stream of MultiDA: rectangle represents data with different
format. Elapse shows the chemometric ability. Round rectangle stands for the data
input button in software. Double line arrow displays the main direction of data transfer. Dashed line arrow indicates complementary data stream. Bracket means that all
the data in processing can be saved in gure or mat format.
2.4. Others
2.4.1. Transparent data analysis
MultiDA provides a relative transparent data analysis environment.
All the useful data produced by MultiDA are stored in the handle object and subsequently saved in workspace with name of algorithm tag,
including the transition data for your checking error. Taking ULDA as
an example, a structure type variable named by ULDA is exported
to Matlab workspace when runs of ULDA nished. The ULDA structure
contains some variables as follows:
NumberOfEachGroup number of each group
MeanXByGroup mean of each variable in different group
WithInDeviation with-in class deviation
TotalDeviation total deviation
BetweenClassScatter between class scatter
Transmits transform matrix
UDV
uncorrelated discriminate vector
CrossValidation recognition, prediction, correct and ve-fold recognition rate of ULDA
Sometimes, it is available to get a plot utterly by UDV instead of
the sample number in Matlab command windows. Because it is usually
difcult to get more than one UDV, the default gure employs the sample number to avoid error. Fig. 3 is the scatter plot gotten by UDV1 and
UDV2. It looks beautiful compared with by the sample number.
3. Applications
3.1. Data acquisition
The wine data set [25] from UCI Machine Learning Repository is
employed to test the functions of MultiDA and demonstrate multimodel comparison method. Wine data set contains the quantities of 13
constituents found in 3 types of wines in 178 wines samples. All the 13
constituents are alcohol, malic acid, ash, alkalinity of ash, magnesium,
total phenols, avanoids, nonavanoid phenols, proanthocyanins,
color intensity, hue, OD280/OD315 of diluted wines and proline.
There are two ways to input data to software: [ReadData] and
[ImportData] button. [ReadData] is used to get data in both x and y
spreadsheet. [ImportData] is applied to data import from the specied
3.5
3
2
1
4
4.5
5
UDV2
5.5
le. File types MultiDA could recognize are listed in Table 1. The
[XdataLabel] in [ImportData] dialog is employed for storing data for
labeling the sample or variable of X data, which can be invoked by
[Label] button in the scores plot, loading plot or some else.
3.2. Methods of data preprocessing
MultiDA provides many data preprocessing methods including descriptive analysis, data standardization, outlier detection and OSC.
3.2.1. Descriptive analysis and standardization
Descriptive analysis describes the main feature of a collection of data
quantitatively. In descriptive analysis, MultiDA provides a fundamental
analysis of within class data and whole data, including mean, median,
standard variance, variance, maximum, minimum, kurtosis, skewness
and coefcient. According to descriptive analysis, an overview of data
shows.
The descriptive analysis results indicate that wine data possess the
magnitude varies from 0.1 to 1000 and the variance varies from 0.01
to 100,000. It means that almost the 99.9% information focuses on a
few variables, which shields the effect of other variables dramatically.
From the prediction error of PCA-DA and Zscores-PCA-DA, a prominent
improvement of predictive ability is obtained after standardization
(data not show). In MultiDA, [Standerlize] provides ve methods for
standardization, that is, log-transform, centering, Z-scores, min-max
normalization and decimal scaling.
3.2.2. Outlier detection
As dened by Grubbs [7], outlier is one that appears to deviate
markedly from other members of the sample in which it occurs. MultiDA provides two outlier detection methods (Grubbs test [7] and Wilk's
method [8,26]) and three gures description of outlier (PCA boxplot,
stem-leaf-plot of Hotelling T-square and ROPCA distance scatter plot).
[Outlier] button invokes a selective dialog for Grubbs Test and Wilk's
Method. MultiDA does not afford a direct control interface for gure
description of outlier. While PCA boxplot is a part outcome of outlier
analysis, stem-leaf-plot of Hotelling T-square is outputted by PCA and
ROPCA score diagnostic plot is obtained by ROPCA.
3.2.3. Multi-comparison of outlier detection
Table 2 displays the outliers recognized by different methods. Results indicate that different outlier detection methods tend to select
different outliers. However, the 70th and 96th sample were selected
by all methods, so multi-model comparison suggests that these samples
could be classied as extreme outlier. Fig. 4 represents a visualized outcome of three gures description of outlier. It is interesting that all outliers detected above belong to group 2 (Fig. 4a and b), which might
suggest samples in group 2 preserve greater variations. The overview
of outlier detection based on different methods could give us a more
comprehensive insight to wine data.
7
7.5
8
8.5
Table 1
Files that MultiDA could recognize.
1.5
2.5
3.5
4.5
UDV1
Fig. 3. Scatter plot get by UDV1 and UDV2 of ULDA. x and y axis represent UDV
(uncorrelated discriminate vector) 1 and 2. Star, circle and square represent three different groups respectively. The two UDV give a perfect separation of three groups in
wine data.
File type
Format
Text le
Excel le
Image le
Sound le
Others
different variables may affect different groups. Such as, Fig. 5c indicates
that color intensity is the key factor for discriminating group 1 and 3,
while avanoids inuence the classication of group 2 and 3 according
to Fig. 5d.
Table 2
Outlier detection by different outlier detection methods and gure description of
outlier.
Method
Outlier detection
Grubbs test
Wilk's method
Box plot
Hotelling T-square plot
Distance plot
60, 70, 96
159, 70, 96, 74, 122
96, 70, 97, 74, 122
122, 70, 96, 74, 159, 111
74, 96, 122, 70, 79, 158, 160
analysis, genetic algorithm, CART, PLS weight and stepwise DA ([Stepwise]) for this purpose.
3.3.1. Weight analysis
The purpose of weight analysis is to nd the best variable set for
discriminating between two groups. A variance weight method is introduced to achieve this aim based on Liang Y.Z. [6].
2
nC P
nT
P
xcj xtj
c1 t1
n n
n n
2
2
C P
C
P
PT PT
nC
xc j xcj nT n
xt j xtj
n
nC nT
n2
c 1 c1
t 1 t1
500
20
10
0
Values
Values
0
74
10
97
20
122
30
500
74
40
50
70
96
60
1000
PC2
PC1
c
9
Slightly devious
sample
60
159
122
6
96
5
4
70
3
79
70
40
74
96
111
159
30
20
10
1
0
122
50
160
Hotelling TSquare
Orthogonal Distance
Extrem
sample
74
Normal
sample
Slightly devious
sample
Score Distance
20
40
60
80
Sample Case
Fig. 4. Three gures description of outlier: a) PCA boxplot of three groups at the rst component, b) PCA boxplot at the second component, c) scored diagnostic plot, each sample
displays a score distance within the PC space and orthogonal distance to the PCA space, and d) stem-leaf-plot of Hotelling's T-square statistic of each sample. Outliers were marked
with sample number in all four gures.
1.6
1.4
1.2
Weight
1
0.8
0.6
0.4
0.2
0
9 10 11 12 13
c
Weight Analysis Between Group 2 And Group 3
1.4
5
4.5
1.2
4
3.5
3
0.8
Weight
Weight
0.6
2.5
2
1.5
0.4
1
0.2
0.5
0
9 10 11 12 13
Variable
9 10 11 12 13
Variable
Fig. 5. Weight analysis GUI and bar plot of weight coefcient bar plot of each variable between different groups pair: a) layout of weight analysis GUI, b) weight coefcient between
group 1 and 2, c) group 1 and 3, d) group 2 and 3. In weight coefcient bar plot, x axis represents variables and y axis indicates discriminating ability to each group pair of each variable.
get a low BIC at 14.4213, indicating a robust and concise model. Variable 2, 3, 4, 5, 9, 10, 11 and 12 were just selected by some of methods
and little RMSE could be obtained when they are involved in model, so
they can be classied as assistant variable. Similarly, total phenols (variable 6) and nonavanoid phenol (variable 8) were never selected by
any method implying they possess little contribution for discriminating
b
1
0.95
0.921348
0.9
Fitness
0.85
0.8
0.75
0.7
0.65
20
40
60
80
100
Generation
Fig. 6. Genetic algorithm for feature selection: a) outline of GAFS GUI, b) tness of population in different generations based on NIPALS. Circle represents for mean tness (y axis) of
each generation and star for the maximum tness in each generation (x axis). Digital number denotes the best tness of all generation. If goal met, run will stop prematurely.
different types of wine, so multi-model comparison of various feature selection methods could give a deep insight of each variable and more accurate conclusion.
3.4. Supervised learning method
MultiDA contains lots of algorithms for supervised analysis as follows: linear discriminate analysis (LDA), quadratic discriminate analysis (QDA), Mahalanobis discriminate analysis (MDA), CDA, ULDA and
PLS.
3.4.1. Discriminate analysis
LDA, QDA and MDA are invoked by [Classication] button. A prior
probability should be selected ahead of classication analysis. MultiDA
supplies All Group equal and Weight by Group Size, two kinds of
prior probability. After classication analysis, the territorial map and
tree plot could be output. CDA is related to PCA and canonical correlation. Raw program of CDA is obtained at Matlab le exchange center
created by Trujillo-Ortiz [17] with revision.
Leave-one-out (LOO) and k-fold cross-ventilation are available for
evaluation of discriminate algorithms. PCA-DA is also available for
tackling high dimensional data.
3.4.2. Partial least squares analysis
Partial least squares (PLS) is a commonly used method for modeling
relations between multi-independent variable and non (PLS1) or
multi-dependent (PLS2) variable. MultiDA provides Kernel-PLS,
NIPALS and SIMPLS algorithms for PLS. [NIPALS] and [SIMPLS] will invoke a dialog for choosing number of components; at the same time a
gure displays providing information of AIC and RMSE variation with
different number of components. Like PCA, the scores and loadings
plots are also easily output for PLS. LOO and k-fold cross validation are
also available as discriminate analysis.
3.4.3. Multi-comparison of supervised learning methods
Table 4 displays the multi-model comparison outcome of different
supervised learning methods by 10-fold cross validation RMSEP and
RMSER. Both CDA and ULDA gave a perfect recognition and prediction
error. Therefore, multi-model comparison shows that ULDA and CDA
are suitable for analysis this wine data.
3.5. Unsupervised learning method
Cluster analysis and PCA are the two most important unsupervised
learning methods. There are three methods for cluster analysis in MultiDA, namely, k-means ([KMeans]), k-medoid ([KMedoid]) and hierarchical clustering ([HierachicalClustering]), and three methods for PCA,
classical PCA, robust PCA and NLPCA included.
This work is supported by the National Natural Science Foundation of China (90709014).
Table 3
Multi-model comparison of different feature selection method.
Method
Normala
Weight analysis
GA NIPALS
CART
NIPALS weight
Stepwise
Co-variablesb
113
1, 7, 1013
1, 2, 4, 7, 913
5, 7, 1113
1, 4, 5, 13
14, 7, 1013
1, 7, 13
a
b
c
0.0618
0.0278
0.0562
0.0781
0.2245
0.0451
0.0614
0.0546
0.0305
0.0431
0.0614
0.1327
0.0377
0.0584
1.9601
15.0633
9.1189
10.8625
6.1949
8.2655
14.4213
Table 4
Comparison of different supervised learning methods.
Method
RMSEP-10CVa
RMSER-10CV
LDA
QDA
MDA
CDA
ULDA
NIPALS
SIMPLS
0.0056
0.0056
0.0222
0
0
0.0448
0.0448
0
0.0047
0
0
0
0.0322
0.0273
30
1
2
3
20
10
0
10
20
30
40
50
60
1000
500
30
3
2
1
20
10
0
10
20
30
40
50
60
1000
500
500
30
500
1
2
3
10
0
10
20
30
40
50
500
500
30
20
60
1000
1
2
3
20
10
0
10
20
30
40
50
60
1000
500
500
Fig. 7. PCA scores plot grouped by three cluster methods and raw group: a) PCA scores plot grouped by k-means cluster analysis, b) PCA scores plot grouped by raw group, c) PCA
scores plot grouped by hierarchical cluster analysis, d) PCA scores plot grouped by k-medoid cluster analysis. x and y axis indicate principle component score 1 and 2 with variance
explanation in four gures, and star, circle and square represent three different groups.
b
3.5
1
2
3
NLPCA 2
2.5
2
1.5
1
0.5
0
0.5
1.5
2.5
3.5
NLPCA 1
Fig. 8. Outline of non-linear PCA GUI: a) NLPCA GUI prole. Number of node in three hidden layer, epoch, error, learning rate and showing interval is adjustable. b) Scores plot of
wine data processed by NLPCA under the 20-3-20 hidden layer structure. x and y axis are weight of node in bottleneck layer respectively. Star, circle and square represent three
different groups.
References
[1] K. Varmuza, P. Filzmoser, Introduction to Multivariate Statistical Analysis in
Chemometrics, CRC, Boca Raton, 2009.
[2] M. Andersson, A comparison of nine PLS1 algorithms, Journal of Chemometrics 23
(2009) 518529.
[3] Matlab, The MathWorks, Inc. Natwick, MA (USA), http://www.mathworks.com.
[4] R.A. Viscarra Rossel, ParLeS: software for chemometric analysis of spectroscopic data, Chemometrics and Intelligent Laboratory Systems 90 (2008)
7283.
[18] H.W. Wang, Partial Least-Squares Regression Method and Applications, National
Defense Industry Press, Beijing, 1994.
[19] D. Yuan, Y. Liang, L. Yi, Q. Xu, O. Kvalheim, Uncorrelated linear discriminant analysis
(ULDA): a powerful tool for exploration of metabolomics data, Chemometrics and
Intelligent Laboratory Systems 93 (2008) 7079.
[20] Y. Xu, J.Y. Yang, Z. Jin, A novel method for Fisher discriminant analysis, Pattern
Recognition 37 (2004) 381384.
[21] F. Lindgren, P. Geladi, S. Wold, The kernel algorithm for PLS, Journal of
Chemometrics 7 (1993) 4549.
[22] H. Wold, Nonlinear estimation by iterative least squares procedures, Research Papers in Statistics, Wiley, New York, 1996, pp. 411444.
[23] P. Geladi, B.R. Kowalski, Partial least squares regression: a tutorial, Analytica
Chimica Acta 185 (1986) 117.
[24] S. de Jong, SIMPLS: an alternative approach to partial least squares regression,
Chemometrics and Intelligent Laboratory Systems 18 (1993) 251263.
[25] A. Frank, A. Asuncion, UCI Machine Learning Repository, University of California,
School of Information and Computer Science, 2010.
[26] S. Yang, Y. Lee, Identication of a Multivariate Outlier, Annual Meeting of the
American Statistical Association, San Francisco, 1987.