The Laboratory Use of Computers: Merck Research Laboratories, Merck & Co., Inc., Rahway, NJ, USA
The Laboratory Use of Computers: Merck Research Laboratories, Merck & Co., Inc., Rahway, NJ, USA
The Laboratory Use of Computers: Merck Research Laboratories, Merck & Co., Inc., Rahway, NJ, USA
1
1
x(t)e
i2pft
dt (1)
This requires the integral to be calculated over all time in a
continuous manner. Out of practical necessity, however,
one can only obtain the amplitude at discrete intervals
for a nite period of time. Several approximations are
Table 2 Digital Communication Interfaces
Maximum
transmission speed Maximum cable length Maximum number of devices
Serial 64 kbps 10 ft 2
Parallel 50100 kbytes/s 912 ft 2
USB 2.0 480 mbits/s 5 m Segments with a maximum of six segments
between device and host
127
IEEE488 1 MByte/s 20 m (2 m per device) 15
Twisted pair ethernet 1000 Mbps 82 ft (329 ft at 100 Mbps) 254 per subnet using TCP/IP
SCSI 160 Mbytes/s 6 m
a
16
a
Single ended cabledifferential on cables can be up to 25 m long.
Laboratory Use of Computers 11
Copyright 2005 by Marcel Dekker
made in order to calculate the transform digitally, yielding
the discrete fourier transform:
S
0
^ xx(kDf )
1
N
X
N1
n0
x(n)e
i(2pnk=N)
(2)
where N is the number of points sampled. This allows the
signal to be represented in the frequency domain as in
The function assumes that every component of the
input signal is periodic, that is, every component of the
signal has an exact whole number of periods within
the timeframe being studied. If not, discontinuities at the
beginning and ending border conditions develop, resulting
in a distortion of the frequency response known as leakage.
In this phenomena, part of the response from the true fre-
quency band is attributed to neighboring frequency bands.
This leads to articially broaden the signal response over
larger frequency bands that obscure smaller amplitude fre-
quency signals. A technique known as windowing is used
to reduce the signal amplitude to zero at the beginning and
end of the time record (band-pass lter). This eliminates
the discontinuities at the boundary time points, thus
greatly reducing the leakage.
V. DATA ANALYSIS/CHEMOMETRICS
Besides their roles in controlling instruments, collecting
and storing data, computers play a critical role in the com-
putations and data processing needed for solving chemical
problems. A good example is multivariate data analysis in
analytical chemistry. The power of multivariate data
analysis combined with modern analytical instruments is
best demonstrated in areas where samples have to be ana-
lyzed as is and relevant information must be extracted
from interference-laden data. These cases include charac-
terization of chemical reactions, exploration of the
relationship between the properties of a chemical and its
structural and functional groups, fast identication of
chemical and biological agents, monitoring and controling
of chemical processes, and much more. The multivariate
data analysis technique for chemical data, also known as
chemometrics, heavily relies on the capabilities of compu-
ters because a lot of chemometrics algorithms are compu-
tationally intensive, and the data they designed to analyze
are usually very large in size. Fortunately, advances in
computer technology have largely eliminated the perform-
ance issues of chemometrics applications due to limit-
ations in computer speed and memory encountered in
early days.
Chemometrics is a term given to a discipline that uses
mathematical, statistical, and other logic-based methods
to nd and validate relationships between chemical data
sets, to provide maximum relevant chemical information,
and to design or select optimal measurement procedures
and experiments. It covers many areas, from traditional
statistical data evaluation to multivariate calibration, mul-
tivariate curve resolution, pattern recognition, experiment
design, signal processing, neural network, and more. As
chemical problems get more complicated and more soph-
isticated mathematical tools are available for chemists,
this list will certainly grow. It is not possible to cover all
these areas in a short chapter like this. Therefore we
chose to focus on core areas. The rst is multivariate cali-
bration. This is considered as the center piece of chemo-
metrics, for its vast applications in modern analytical
chemistry and great amount of research work done since
the coinage of this discipline. The second area is pattern
recognition. If multivariate calibration deals with predo-
minantly quantitative problems, pattern recognition rep-
resents the other side of chemometricsqualitative
techniques that answer questions such as Are the
samples different? How are they related?, by separating,
clustering, and categorizing data. We hope in this way
readers can get a relatively full picture of chemometrics
from such a short article.
A. Multivariate Calibration
1. General Introduction
In science and technology, establishing quantitative
relationships between two or more measurement data
sets is a basic activity. This activity, also know as cali-
bration, is a process of nding the transformation to
relate a data set with the other ones that have explicit infor-
mation. A simple example of calibration is to calibrate a
pH meter. After reading three standard solutions, the elec-
trical voltages from the electrode are compared with the
pH values of the standard solutions and a mathematical
relation is dened. This mathematical relationship, often
referred as a calibration curve, is used to transform the
electric voltages into pH values when measuring new
samples. This type of calibration is univariate in nature,
which simply relates a single variable, voltage, to the pH
values. One of the two conditions must be met in order
to have accurate predictions with a univariate calibration
curve. The measurement must be highly selective, that
is, the electrode only responds to pH changes and nothing
else, or interferences that can cause changes in the elec-
trode response are removed from the sample matrix and/
or the measurement process. The later method is found
in chromatographic analyses where components in a
sample are separated and individually detected. At each
point of interest within a chromatographic analysis, the
sample is pure and the detector is responding to the com-
ponent of interest only. Univariate calibration works
12 Analytical Instrumentation Handbook
Fig. 3.
Copyright 2005 by Marcel Dekker
perfectly here to relate the chromatographic peak heights
or areas to the concentrations of the samples. On the
other side of the measurement world, measurement
objects are not preprocessed or puried to eliminate
things that can interfere with the measurement. Univariate
calibration could suffer from erroneous data from the
instrument, and worse yet there is no way for the analyst
to tell if he/she is getting correct results or not. This short-
coming of univariate calibration, which is referred as zero-
order calibration in tensor analysis for its scalar data
nature, has seriously limited its use in modern analytical
chemistry.
One way to address the problem with univariate cali-
bration is to use more measurement information in estab-
lishing the transformation that relates measurement data
with the reference data (data that have more explicit infor-
mation). Modern analytical instrumentations such as
optical spectroscopy, mass spectroscopy, and NMR
deliver multiple outputs with a single measurement, pro-
viding an opportunity to overcome the shortcomings of
univariate calibration. The calibration involving multiple
variables is called multivariate calibration which has
been the core of chemometrics since the very beginning.
The major part of this section will be focused on the dis-
cussion of multivariate calibration techniques.
The capability of multivariate calibration in dealing
with interferences lies on two bases: (1) the unique
pattern in measurement data (i.e., spectra) for each com-
ponent of interest and (2) independent concentration vari-
ation of the components in the calibration standard set. Let
us consider the data from a multivariate instrument, say, an
optical spectrometer. The spectrum from each measure-
ment is represented by a vector x [x
1
, x
2
, . . . , x
n
],
which carries unique spectral responses for component(s)
of interest and interference from sample matrix. Measure-
ment of a calibration set with m standards generates a data
matrix X with m rows and n columns, with each row repre-
senting a spectrum. The concentration data matrix Y with
p components of interest will have m rows and p columns,
with each row containing the concentrations of com-
ponents of interest in a particular sample. The relationship
between X (measurement) and Y (known values) can be
described by the following equation:
Y XB E (3)
The purpose of calibration is to nd the transform matrix B
and evaluate the error matrix E. Using linear regression,
the transformation matrix B, or the regression coefcient
matrix as it is also called, can be found as:
B (X
0
X)
1
X
0
Y (4)
where X
0
represents the transpose of X. The inversion of
square matrix X
0
X is a critical step in multivariate
calibration, and the method of inversion essentially differ-
entiates the techniques of multivariate calibration. The
following sections will discuss the most commonly used
methods.
2. Multiple Linear Regression
Multiple linear regression (MLR) is the simplest multi-
variate calibration method. In this method, the transform-
ation matrix B [Eq. (4)] is calculated by direct inversion of
measurement matrix X. Doing so requires the matrix X to
be in full rank. In other words, the variables in X are inde-
pendent to each other. If this condition is met, then the
transformation matrix B can be calculated using Eq. (4)
without carrying over signicant errors. In the prediction
step, B is applied to the spectrum of unknown sample to
calculate the properties of interest:
^ yy
i
x
0
i
(X
0
X)
1
X
0
Y (5)
The most important thing in MLR is to ensure the indepen-
dence of the variables in X. If the variable are linearly
dependent, that is, at least one of the rows can be written
as an approximate or exact linear combination of the
others, matrix X is called collinear. In the case of colli-
nearity some elements of B from least square t have
large variance and the whole transform matrix loses its
stability. Therefore, for MLR collinearity in X is a
serious problem and limits its use.
To better understand the problem, consider an example
in optical spectroscopy. If one wants to measure the con-
centration of two species in a sample, he/she has to
hope that the two species have distinct spectral features
and it is possible to nd a spectral peak corresponding to
each of them. In some cases, one would also like to add
a peak for correcting variations such as spectral baseline
drift. As a result, he would have a 3 nX matrix after
measuring n standard samples. To ensure independence
of X variables, the spectral peaks used for measuring the
components of interest have to be reasonably separated.
In some cases, such as in FTIR and Raman spectroscopy
whose nger print regions are powerful in differentiating
chemical compounds, it might be possible. MLR is accu-
rate and reliable when the no collinearity requirement
is met. In other cases, however, nding reasonably separ-
ated peaks is not possible. This is often true in near
infrared (NIR) and UV/Vis spectroscopy which lack the
capability in differentiating between chemicals. The NIR
and UV/Vis peaks are broad, often overlapping, and the
spectra of different chemicals can look very similar. This
makes choosing independent variables (peaks) very dif-
cult. In cases where the variables are collinear, using
MLR could be problematic. This is probably one of the
reasons that MLR is less used in the NIR and UV/Vis
applications.
Laboratory Use of Computers 13
Copyright 2005 by Marcel Dekker
Due to the same collinearity reason, one should also be
aware of problems brought by redundancy in selecting
variables for X matrix. Having more variables to describe
a property of interest (e.g., concentration) may generate a
better t in the calibration step. However, this can be mis-
leading. The large variance in the transformation matrix
caused by collinearity will ultimately harm the prediction
performance of calibration model. It is not uncommon to
see the mistake of using an excessive number of variables
to establish a calibration model. This kind of calibration
model can be inaccurate in predicting unknown samples
and is sensitive to minor variations.
3. Factor Analysis Based Calibration
In their book, Martens and Naes listed problems encoun-
tered in dealing with complex chemical analysis data
using traditional calibration methods such as univariate
calibration:
1. Lack of selectivity: No single X-variable is suf-
cient to predict Y (property matrix). To attain
selectivity one must use several X-variables.
2. Collinearity: There may be redundancy and hence
collinearity in X. A method that transforms corre-
lated variables into independent ones is needed.
3. Lack of knowledge: Our a priori understanding of
the mechanisms behind the data may be incomplete
or wrong. Calibration models will fail when new
variations or constituents unaccounted by the cali-
bration occur in samples. One wishes at least to
have a method to detect outliers, and further to
improve the calibration technology so that this
kind of problem can be solved.
Factor analysis based calibration methods have been
developed to deal with these problems. Due to space
limit, we only discuss two most popular methods, principal
component regression (PCR) and partial least square
(PLS).
Principle Component Regression
The base of PCR is principal component analysis
(PCA) which computes so-called principle components
to describe variation in matrix X. In PCA, the main vari-
ation in X fx
k
, k 1, 2, . . . , Kg is represented by a
smaller number of variables T ft
1
, . . . , t
A
g (A , K). T
represents the principle components computed from X.
The principle components are calculated by nding the
rst loading vector u
1
that maximizes the variance of
u
1
x and have u
1
X Xu
1
t
1
t
1
, where t
1
is
the corresponding score vector. The next principle com-
ponent is calculated in the same way but with the restric-
tion that t
1
and t
2
are orthogonal (t
1
t
2
0). The
procedure continues under this restriction until it reaches
the dimension limit of matrix X. Figure 4 may help to
understand the relationship between original variables x
and principle component t.
Consider a sample set measured with three variables x
1
,
x
2
, and x
3
. Each sample is represented by a dot in the coor-
dinate system formed by x
1
, x
2
, and x
3
. What PCA does is
to nd the rst principle component (t
1
) that points to the
direction of largest variation in that data set, then the
second principle component (t
2
) capturing the second
largest variation and orthogonal to t
1
, and nally the
third principle component (t
3
) describes the remaining
variation and orthogonal to t
1
and t
2
. From the gure it
is clear that principle components replace x
1
, x
2
, and x
3
to form a new coordination system, these principle com-
ponents are independent to each other, and they are
arranged in a descending order in terms the amount of var-
iance they describe. In any data set gathered from reason-
ably well-designed and measured experiments, useful
information is stronger than noise. Therefore it is fair to
expect that the rst several principle components mainly
contain the useful information, and the later ones are domi-
nated by noise. Users can conveniently keep the rst
several principle components for use in calibration and
discard the rest of them. Thus, useful information will be
kept while noise is thrown out. Through PCA, the original
matrix X is decomposed into three matrices. V consists of
normalized score vectors. Uis the loading matrix and the S
is a diagonal matrix containing eigenvalues resulting from
normalizing the score vectors.
X VSU
0
(6)
As just mentioned, X can be approximated by using the
rst several signicant principle components,
~
XX V
A
S
A
U
0
A
(7)
where V
A
, S
A
, and U
A
are subsets of V, S, and U, respect-
ively, formed by the rst A principle components.
~
XX is a
close approximation of X, with minor variance removed
by discarding the principle components after A.
Figure 4 PCA illustrated for three x-variables with three prin-
ciple components (factors).
14 Analytical Instrumentation Handbook
Copyright 2005 by Marcel Dekker
Principal components are used both in qualitative
interpretation of data and in regression to establish
quantitative calibration models. In qualitative data inter-
pretation, the so-called score plot is often used. The
element v
ij
in V is the projection of ith sample on jth prin-
ciple component. Therefore each sample will have a
unique position in a space dened by score vectors, if
they are unique in the measured data X. Figure 5 illustrates
use of score plot to visually identify amorphous samples
fromcrystalline samples measured with NIRspectroscopy.
The original NIR spectra show some differences between
amorphous and crystalline samples, but they are subtle
and complex to human eyes. PCA and score plot present
the differences in a much simpler and straight forward
way. In the gure, the circular dots are samples used as
standards to establish the score space. New samples (tri-
angular and star dots) are projected into the space and
grouped according to their crystallinity. It is clear that
the new samples are different: some are crystalline, so
they fall into the crystalline circle. The amorphous
samples are left outside of the circle due to the abnormal-
ities that have shown up in NIR spectra. Based on the devi-
ations of the crystalline sample dots on each principle
component axis, it is possible to calculate statistical bound-
ary to automatically detect amorphous samples. This kind
of scheme is the bases of outlier detection by factor analy-
sis based calibration methods such as PCA and PLS.
PCA combined with a regression step forms PCR. PCR
has been widely used by chemists for its simplicity in
interpretation of the data through loading matrix and
score matrix. Equation (6) shows how regression is per-
formed with the principle components.
b yV
A
S
1
A
U
0
A
(8)
In this equation, y is the property vector of the calibration
standard samples and b is the regression coefcient vector
used in predicting the measured data x of unknown
sample.
^ yy bx
0
(9)
where ^ yy is the predicted property. A very important aspect
of PCR is to determine the number of factors used in
Eq. (6). The optimum is to use as much information in X
as possible and keep noise out. That means one needs to
decide the last factor (principle component) that has
useful information and discard all factors after that one.
A common mistake in multivariate calibration is to use
too many factors to over t the data. The extra factors
can make a calibration curve look unrealistically good
(noise also gets tted) but unstable and inaccurate when
used in prediction. A rigorous validation step is necessary
to avoid these kinds of mistakes. When the calibration
sample set is sufciently large, the validation samples
can be randomly selected from the sample pool. If the cali-
bration samples are limited in number, a widely used
method is cross-validation within the calibration sample
set. In cross-validation, a sample (or several samples)
is taken out from the sample set and predicted by the
calibration built on the remaining samples. The prediction
errors corresponding to the number of factors used in cali-
bration are recorded. Then the sample is put back into
the sample set and another one is taken out in order to
repeat the same procedure. The process continues until
each sample has been left out once and predicted. The
average error is calculated as the function of the
number of principle components used. The formulas for
SEP (using a separate validation sample set) and for
SECV cross-validation are slightly different:
SEP
P
(y
i
^ yy
i
)
2
n
s
(10)
SECV
P
n
i1
y
i
^ yy
i
2
n A
s
(11)
where ^ yy
i
is the model predicted value and y
i
is reference
value for sample i, n is the total number of samples used
in calibration, and A is the number of principle com-
ponents used. When the errors are plotted against the
number of factors used in calibration, they typically look
ponents (factors) are added into the calibration one at a
time, SEP or SECV decreases, hits a minimum, and then
bounces back. The reason is that the rst several principle
components contain information about the samples and are
needed in improving the calibration models accuracy. The
later principle components, on the other hand, are predo-
minated by noise. Using them makes the model sensitive
to irrelevant variations in data, thus being less accurate Figure 5 PCA applied to identify sample crystallinity.
Laboratory Use of Computers 15
like the one illustrated in Fig. 6. As the principle com-
Copyright 2005 by Marcel Dekker
and potentially more vulnerable to process variations and
instrument drifts. There is clearly an optimal number of
principle components for each calibration model. One of
the major tasks in multivariate calibration is to nd the
optimum which will keep the calibration model simple
(small number of principle components) and achieve the
highest accuracy possible.
Partial Least Square
PLS is another multivariate calibration method which
uses principle components rather than original
X-variables. It differs from PCR by using the y-variables
actively during the decomposition of X. In PLS, principle
components are not calculated along the direction of
largest variation in X at each iteration step. They are
calculated by balancing the information from the X and
y matrices to best describe information in y. The rationale
behind PLS is that, in some cases, some variations in X,
although signicant, may not be related to y at all. Thus,
it makes sense to calculate principle components more
relevant to y, not just to X. Because of this, PLS may
yield simpler models than PCR.
Unlike PCR, the PLS decomposition of measurement
data X involves the property vector y. A loading weight
vector is calculated for each loading of X to ensure the
loadings are related to the property data y. Furthermore,
the property data y is not directly used in calibration.
Instead, its loadings are also calculated and used together
with the X loadings to obtain the transformation vector b.
b W PW
1
q (12)
where W is the loading weight matrix, P is the loading
matrix for X, and q is the loading vector for y. Martens
and Neas gave out detailed procedures of PLS in their
book.
In many cases, PCR and PLS yield similar results.
However, because the PLS factors are calculated utilizing
both X and y data, PLS can sometimes give useful results
from low precision X data where PCR may fail. Due to the
same reason, PLS has a stronger tendency to over t noisy
data y than that of PCR.
B. Pattern Recognition
Modern analytical chemistry is data rich. Instruments such
as mass spectroscopy, optical spectroscopy, NMR, and
many hyphenated instruments generate a lot of data for
each sample. However, data rich does not mean infor-
mation rich. How to convert the data into useful infor-
mation is the task of chemometrics. Here, pattern
recognition plays an especially important role in explor-
ing, interpreting, and understanding the complex nature
of multivariate relationships. Since the tool was rst
used on chemical data by Jurs et al. in 1969, many new
applications have been published, including several
books containing articles on this subject. Based on
Lavines review, here we will briey discuss four main sub-
divisions of pattern recognition methodology: (1) mapping
and display, (2) clustering, (3) discriminant development,
and (4) modeling.
1. Mapping and Display
When there are two to three types of samples, mapping and
display is an easy way to visually inspect the relationship
between the samples. For example, samples can be plotted
in a 2D or 3D coordinate system formed by variables
describing the samples. Each sample is represented by a
dot on the plot. The distribution and grouping of the
samples reveal the relationship between them. The fre-
quently encountered problem with modern analytical
data is that the number of variables needed to describe a
sample is often way too large for this simple approach.
An ordinal person cannot handle a coordinate system
with more than three dimensions. The data size from an
instrument, however, can be hundreds or even thousands
of variables. To utilize all information carried by so
many variables, factor analysis method can be used to
compress the dimensionality of the data set and eliminate
collinearity between the variables. In last section, we dis-
cussed the use of principle components in multivariate
data analysis (PCA). The plot generated by principle com-
ponents (factors) is exactly the same as the plots used in
mapping and display method. The orthogonal nature of
principle components allows convenient evaluation of
Figure 6 SECV plotted against number of principle com-
ponents (factors) used in calibration.
16 Analytical Instrumentation Handbook
Copyright 2005 by Marcel Dekker
factors affecting samples based on their positions in the
principle component plot.
The distance between samples or from a sample to the
centroid of a group provides a quantitative measure of the
degree of similarity of the sample with others. The
most frequently used ones are the Euclidean distance and
the Mahalanobis distance. The Euclidean distance is
expressed as:
D
E
X
n
j1
x
Kj
x
Lj
2
v
u
u
t
(13)
where x
Kj
and x
Lj
are the jth coordinate of samples K and L,
respectively and n is the total number of coordinates. The
Mahalanobis distance is calculated by the following
equation:
D
2
M
x
L
xx
K
0
C
1
K
x
L
xx
K
(14)
where x
L
and xx
K
are, respectively, the data vector of
sample L and mean data vector for class K. C
1
K
is the
covariance matrix of class K. The Euclidean distance is
simply the geometric distance between samples. It does
not consider the collinearity between the variables that
forms the coordinator system. If variables x
1
and x
2
are
independent, the Euclidean distance is not affected by
the position of the sample in the coordinate system and
truly reects the similarity between the samples or
sample groups. When variables are correlated, this may
not be true. The Mahalanobis distance measurement
takes into account the problem by including a factor of cor-
relation (or covariance).
2. Clustering
Clustering methods are based on the principle that the dis-
tance between pairs of points (i.e., samples) in the
measurement space is inversely related to their degree of
similarity. There are several types of clustering algorithms
using distance measurement. The most popular one is
called hierarchical clustering. The rst step in this algor-
ithm is to calculate the distances of all pairs of points
(samples). Two points having the smallest distance will
be paired and replaced by a new point located in the
midway of the two original points. Then the distance cal-
culation starts again with the new data set. Another new
point is generated between the two data points having
the minimal distance and replaces the original data
points. This process continues until all data points have
been linked.
3. Classication
Both mapping and displaying clustering belong to unsu-
pervised pattern recognition techniques. No information
about the samples other than the measured data is used
in analyses. In chemistry, there are cases where a classi-
cation rule has to be developed to predict unknown
samples. Development of such a rule needs training
datasets whose class memberships are known. This is
called supervised pattern recognition because the knowl-
edge of class membership of the training sets is used in
development of discriminant functions. The most
popular methods used in solving chemistry problems
include the linear learning machine and the adaptive
least square (ALS) algorithm. For two classes separated
in a symmetric manner, a linear line (or surface) can be
found to divide the two classes (Fig. 7). Such a discrimi-
nant function can be expressed as:
D wx
0
(15)
where w is called weight vector and w fw
1
, w
2
, . . . ,
w
n1
g and x fx
1
, x
2
, . . . , x
n1
g is pattern vector whose
elements can be the measurement variables or principle
component scores. Establishing the discriminant function
is to determine the weight vector w with the restraint
that it provides the best classication (most correct Ds)
for two classes. The method is usually iterative in which
error correction or negative feedback is used to adjust w
until it allows the best separation for the classes. The
samples in the training set are checked one at a time by
the discriminant function. If classication is correct, w is
kept unchanged and the program moves to the next
sample. If classication is incorrect, w is altered so that
correct classication is obtained. The altered w is then
used in the subsequent steps till the program goes
through all samples in the training set. The altered w is
dened as:
w
000
w
2s
i
x
i
x
0
i
x
i
(16)
Figure 7 Example of a linear discriminate function separating
two classes.
Laboratory Use of Computers 17
Copyright 2005 by Marcel Dekker
where w
000
is the altered weight factor, s
i
is the discriminant
for the misclassied sample i, and x
i
is the pattern vector
of sample i. In situations where separation cannot be
best achieved by a simple linear function, ALS can be
used. In ALS, w is obtained using least squares:
w (X
0
X)
1
X
0
f (17)
where f is called forcing factor containing forcing factors
f
i
for each sample i. When classication of sample i is
correct, f
i
s
i
, where s
i
is the discriminant score for
sample i. If classication is incorrect, f is modied accord-
ing to the following equation:
f
i
s
i
0:1
(a d
i
)
2
b(a d
i
) (18)
where a and b are constants that are empirically deter-
mined and is the distance between the pattern vector and
the classication surface (i.e., the discriminant score).
With the corrected forcing factor, an improved weight
vector w is calculated using Eq. (16) and used in next
round of discriminant score calculation. The procedure
continues until favorable classication results are obtained
or preselected number of feedback has been achieved.
The nonparametric linear discriminant functions dis-
cussed earlier have limitations when dealing with classes
separated in asymmetric manners. One could imagine a
situation where a group of samples is surrounded by
samples that do not belong to that class. There is no way
for a linear classication algorithm to nd a linear discri-
minant function that can separate the class from these
samples. Apparently there is a need for algorithms that
have enough exibility in dealing with this type of
situations.
4. SIMCA
SIMCA stands for soft independent modeling of class
analogy. It was developed by Wold and coworkers for
dealing with asymmetric separation problems. It is based
on PCA. In SIMCA, PCA is performed separately on
each class in the dataset. Then each class is approximated
by its own principle components:
X
i
XX
i
T
iA
P
iA
E
i
(19)
where X
i
(N P) is the data matrix of class i,
XX
i
is the
mean matrix of X
i
with each row being the mean of X
i
.
T
iA
and P
iA
are the score matrix and loading matrix,
respectively, using A principle components. E
i
is the
residual matrix between the original data and the approxi-
mation by the principle component model.
The residual variance for the class is dened by:
S
2
0
X
N
i1
X
P
j1
e
ij
2
(P A)(N A 1)
(20)
where e
ij
is the element of residual matrix E
i
and S
0
is the
residual variance which is a measurement for the tightness
of the class. A smaller S
0
indicates a more tightly distrib-
uted class.
In classication of unknown samples, the sample data is
projected on the principle component space of each class
with the score vector calculated as:
t
ik
x
i
P
1
k
(21)
where t
ik
is the score vector of sample i in the principle
component space of class k. P
1
k
is the loading matrix
of class k. With score vector t
ik
and loading matrix P
k
,
the residual vector of sample i tting into class K can
be calculated similarly to Eq. (19). Then the residual
variance of t for sample i is:
S
2
i
X
P
j1
e
ij
2
P A
(22)
The residual variance of t is compared with the residual var-
iance of each class. If S
i
is signicantly larger than S
0
, sample
i does not belong to that class. If S
i
is signicantly larger than
S
0
, sample i is considered a member of that class. F-test is
employed to determine if S
i
is signicantly larger than S
0
.
The number of principle components (factors) used for
each class to calculate S
i
and S
0
is determined through
cross-validation. Similar to what we have discussed in
multivariate calibration, cross-validation in SIMCA is to
take out one (or several) sample a time from a class and
use the remaining samples to calculate the residual
variance of t with different number of principle com-
ponents. After all samples have been taken out once, the
overall residual variance of t as the function of
the number principle components used are calculated.
The optimal number of principle components is the one
that gives the smallest classication error.
SIMCA is a powerful method for classication of
complex multivariate data. It does not require a mathemat-
ical function to dene a separation line or surface. Each
sample is compared with a class within the class sub-
principle component space. Therefore, it is very exible
in dealing with asymmetrically separated data and
classes with different degree of complexity. Conceptually,
it is also easy to understand, if one has basic knowledge in
PCA. These are main reasons why SIMCA becomes very
popular among chemists.
VI. DATA ORGANIZATION AND
STORAGE
The laboratory and the individual scientists can easily be
overwhelmed with the sheer volumes of data produced
today. Very rarely can an analytical problem be answered
with a single sample let alone a single analysis. Compound
18 Analytical Instrumentation Handbook
Copyright 2005 by Marcel Dekker
this by the number of problems or experiments that a
scientist must address and the amount of time spent
organizing and summarizing the data can eclipse the
time spent acquiring it. Scientic data also tends to be
spread out among several different storage systems. The
scientists conclusions based on a series of experiments
are often documented in formal reports. Instrument data
is typically contained on printouts or in electronic les.
The results of individual experiments tend to be documen-
ted in laboratory notebooks or on ofcial forms designed
for that purpose.
It is important that all of the data relevant to an exper-
iment be captured: the sample preparation, standard prep-
aration, instrument parameters, as well as the signicance
of the sample itself. This meta data must be cross-
referenced to the raw data and the nal results so that
they can be reproduced if necessary. It is often written in
the notebook or in many cases it is captured by the analyti-
cal instrument and stored in the data le and printed on the
report where it cannot be easily searched. Without this
information, the actual data collected by an instrument
can be useless, as this information may be crucial in its
interpretation.
Scientists have taken advantage of various personal
productivity tools such as electronic spreadsheets, per-
sonal databases, and le storage schemes to organize and
store their data. While such tools may be adequate for a
single scientist such as a graduate student working on a
single project, they fail for laboratories performing large
numbers of tests. It is also very difcult to use such
highly congurable, nonaudited software in regulated
environments. In such cases, a highly organized system
of storing data that requires compliance to the established
procedures by all of the scientic staff is required to ensure
an efcient operation.
A. Automated Data Storage
Ideally, all of the scientic data les of a laboratory would
be cataloged (indexed) and stored in a central data reposi-
tory. There are several commercial data management
systems that are designed to do just this. Ideally these
systems will automatically catalog the les using indexing
data available in the data les themselves and then upload
the les without manual intervention from the scientist. In
reality, this is more difcult than it would rst appear. The
scientist must enter the indexing data into the scientic
application and the scientic application must support its
entry. Another potential problem is the proprietary
nature of most instrument vendors data les. Even when
the instrument vendors are willing to share their data
formats, the sheer numbers of different instrument le
formats make this a daunting task. Still, with some stan-
dardization, these systems can greatly decrease the time
scientists spend on mundane ling-type activities and
provide a reliable archive for the laboratorys data.
These systems also have the added benet of providing
the le security and audit trail functionality required in
regulated laboratories on an enterprise-wide scale instead
of a system-by-system basis.
However, storing the data les in a database solves only
part of the archiving problem. Despite the existence of a
few industry standard le formats, most vendors use a pro-
prietary le format as already discussed. If the data les
are saved in their native le format, they are only useful
for as long as the originating application is available or
if a suitable viewer is developed. Rendering the data
les in a neutral le format such as XML mitigates the
obsolescence problem but once again requires that the
le format be known. It will also generally preclude reana-
lyzing the data after the conversion.
B. Laboratory Information Management
Systems
Analytical laboratories, especially quality control, clinical
testing labs, and central research labs, produce large
amounts of data that need to be accessed by several
different groups such as the customers, submitters, ana-
lysts, managers, and quality assurance personnel. Paper
les involve a necessarily manual process for searching
results, requiring both personnel and signicant amounts
of time. Electronic databases are the obvious solution
for storing the data so that it can be quickly retrieved as
needed. As long as sufcient order is imposed on the
storage of the data, large amounts of data can be retrieved
and summarized almost instantaneously by all interested
parties.
A database by itself, however, does not address the
workow issues that arise between the involved parties.
Laboratories under regulatory oversight such as pharma-
ceutical quality control, clinical, environmental control,
pathology, and forensic labs must follow strict procedures
with regard to sample custody and testing reviews.
Laboratory information management systems (LIMS)
were developed to enforce the laboratorys workow
rules as well as store the analytical results for convenient
retrieval. Everything from sample logging, workload
assignments, data entry, quality assurance review, man-
agerial approval, report generation, and invoice
processing can be carefully controlled and tracked. The
scope of a LIMS system can vary greatly from a simple
database to store nal results and print reports to a com-
prehensive data management system that includes raw
data les, notebook-type entries, and standard operating
procedures. The degree to which this can be done will
be dependent upon the ability and willingness of all con-
cerned parties to standardize their procedures. The LIMS
Laboratory Use of Computers 19
Copyright 2005 by Marcel Dekker
functions are often also event and time driven. If a sample
fails to meet specications, it can be automatically
programed to e-mail the supervisor or log additional
samples. It can also be programed to automatically log
required water monitoring samples every morning and
print the corresponding labels.
It was mentioned earlier that paper-based ling
systems were undesirable because of the relatively large
effort required to search for and obtain data. The LIMS
database addressed this issue. However, if the laboratory
manually enters data via keyboard into its LIMS database,
the laboratory can be paying a large up-front price in
placing the data in the database so that it can be easily
retrieved. Practically from the inception of the LIMS
systems, direct instrument interfaces were envisioned
whereby the LIMS would control the instrumentation
and the instrument would automatically upload its data.
Certainly this has been successfully implemented in
some cases but once again the proprietary nature of instru-
ment control codes and data le structures makes this a
monumental task for laboratories. Third party parsing
and interfacing software has been very useful in extracting
information from instrument data les and uploading the
data to the LIMS systems. Once properly programed
and validated, these systems can bring about very large
productivity gains in terms of the time saved entering
and reviewing the data as well as resolving issues
related to incorrect data entry. Progress will undoubtedly
continue to be made on this front since computers are
uniquely qualied to perform such tedious, repetitive
tasks, leaving the scientist to make conclusions based on
the summarized data.
BIBLIOGRAPHY
Bigelow, S. J. (2003). PC Hardware Desk Reference. Berkeley:
McGraw Hill.
Crecraft, D., Gergely, S. (2002). Analog Electronics: Circuits, Systems
and Signal Processing. Oxford: Butterworth-Heinemann.
Horowitz, P., Hill, W. (1989). The Art of Electronics. 2nd. ed.
New York: Cambridge University Press.
Isenhour, T. L., Jurs, P. C. (1971). Anal. Chem. 43: 20A.
Jurs, P. C., Kowalski, B. R., Isenhour, T. L. (1969). Anal. Chem.
41: 21.
Lai, E. (2004). Practical Digital Signal Processing for Engineers
and Technicians. Oxford: Newnes.
Lavine, B. K. (1992). Signal processing and data analysis. In:
Haswell, S. J., ed. Practical Guide to Chemometrics.
New York: Marcell Dekker, Inc.
Massart, D. L., Vandeginste, B. G. M., Deming, S. N.,
Michotte, Y., Kaufman, L. (1988). Chemometrics: A Text-
book. Amsterdam: Elsevier.
Materns, H., Naes, T. (1989). Multivariate Calibration. NewYork:
John Wiley and Sons Ltd.
Moriguchi, I., Komatsu K., Matsushita, Y. (1980). J. Med. Chem.
23: 20.
Mueller, S. (2003). Upgrading and Repairing PCs. 15th ed.
Indianapolis: Que.
Paszko, C., Turner, E. (2001). Laboratory Information Manage-
ment Systems. 2nd ed. New York: Marcel Dekker, Inc.
Van Swaay, M. (1997). The laboratory use of computers. In:
Ewing, G. W., ed. Analytical Instrumentation Handbook.
2nd ed. New York: Marcel Dekker, Inc.
Wold, S., Sjostrom, M. (1977). SIMCAa method for analyzing
chemical data in terms of similarity and analogy. In: Kowalski,
B. R., ed. Chemometrics: Theory and Practice. Society
Symp, Ser. No. 52. Washington D.C.: American Chemical
Society.
20 Analytical Instrumentation Handbook
Copyright 2005 by Marcel Dekker