The Laboratory Use of Computers: Merck Research Laboratories, Merck & Co., Inc., Rahway, NJ, USA

1
The Laboratory Use of Computers

WES SCHAFER, ZHIHAO LIN
Merck Research Laboratories, Merck & Co., Inc., Rahway, NJ, USA
I. INTRODUCTION
This chapter in the second edition of this work began with
the statement that processing speeds of small computers
have increased by more than an order of magnitude, and
the cost of memory has gone down by a comparable
factor. This edition can once again truthfully begin with
the same statement. The evolution of computers over the
past 25 years has been truly phenomenal. Calculations
that were previously only possible on the largest main-
frame computers, at considerable cost, can now be per-
formed on personal computers in a fraction of the time.
This has revolutionized some aspects of analytical chem-
istry, especially in the eld of chemometrics.
Concurrent with the increase in performance is the
scientists dependence on them. The computer is a valu-
able tool at almost all stages of experimentation. Tedious
literature searches to determine the prior art of a subject
are quickly dispatched by searching keywords against
online databases. Computers are not only used to acquire
data from analytical instruments but also to conveniently
control them saving complex instrument parameters in
method or recipe les that are easily downloaded to the
instrument when needed again. Combined with automated
sampling and other robotic devices, analytical instruments
are often left to work late into the night and weekends,
fully utilizing expensive equipment in off-hours, and
freeing the scientist to concentrate on experimental design
and result analysis. The computer is also extensively used
in data analysis, automatically calculating results, and
graphically displaying them for the scientist to best inter-
pret. Finally, they have proven themselves invaluable in
the more mundane task of storing and organizing the
scientists data for later retrieval as necessary.
This chapter is divided into two sections. The rst section
will briey describe the physical components of the system
and their interdependencies. The key attributes of each com-
ponent as it relates to the performance of the system will be
discussed. The second section will focus on the role of the
computer in each stage of the experimental process: data
acquisition, data analysis, and data storage.
II. COMPUTER COMPONENTS AND
DESIGN CONSIDERATIONS
As one would expect, the intended use of the system
should be taken into account when determining which
computer should be used with a given instrument. While
computers are becoming continually faster and cheaper,
it is still possible to waste several thousand dollars on
system components that are simply not needed.
A simple UV spectrophotometer with associated soft-
ware to determine absorbance and calculate Beers Law
curves is unlikely to tax a modern personal computer.
The amount of data produced by the instrument and its
subsequent processing can be handled by a relatively
low end computer with a relatively small hard drive. A
1
Copyright 2005 by Marcel Dekker
Michelson interferometer based infrared spectropho-
tometer that must perform the Fourier transforms of the
data would require more resources. If it is to also perform
complex chemometric calculations, processing time would
probably benet from a higher end PC. If the application is
to continually compare collected spectra to a library of
spectra on the hard drive, it would benet from a system
with quick disk input/output (I/O) capability.
While instrument manufacturers have little incentive to
sell an underpowered computer with their instrument as it
also reects poorly on their product, marketing and testing
considerations that are of little interest to the user may
mean the offered PC is not the optimal one. The computers
offered by instrument vendors are also often priced at a
hefty premium when compared with market prices.
Still, the decision to purchase the computer from the
vendor or separately must be decided on a laboratory to
laboratory or even instrument to instrument basis. If the
laboratory has little experience with the computer, operat-
ing system, and application software and does not have
other internal organizational resources for dealing with
computer issues, purchasing the computer from the vendor
and including it on the vendors service/maintenance
contract is a sensible approach. This provides a single
contact point for all instrument and computer issues. The
instrument and software were also presumably thoroughly
tested on the computer offered. If, however, a service con-
tract will not be purchased with the instrument or if the PC
is not included in the service contract, it is often advan-
tageous to purchase the computer separately.
Frequently a laboratorys IT (information technology)
organization places a number of additional requirements
on the computer and the conguration of its operating
system. If the instrument and its software place no
special requirements on the computer, it may be best to
obtain one of the computer models supported by the organ-
izations internal IT group.
A. Motherboard/Central Processing Unit
The motherboard is the unifying component of the compu-
ter. It contains the central processing unit (CPU) as well as
the system bus which is the means by which the CPU trans-
fers data to other key components such as the memory and
the hard drive. It also contains the expansion slots and com-
munication ports that interface the computer to its periph-
erals such as the monitor, printer, and laboratory
instrument itself. A growing number of these peripherals
[e.g., video adapters and network interface cards (NICs)]
are being integrated onto the motherboard as PC manufac-
turers search for more ways to lower the costs.
Although the motherboard and CPU are two separate
components with different functions they are completely
interdependent. The motherboard contains the support
chips for the CPU, which dictate its architecture as well
as the number of bits it processes at a time. There are
four main considerations to consider when choosing the
CPU and motherboard: bus speed, CPU architecture/
word size (32 vs. 64 bit), and processor speed.
The motherboard contains the buses (both data and
address) by which all of the system components communi-
cate with one another. Increases in system bus speeds
almost always lead to performance improvements. This
is because it increases the speed by which computer
components communicate with one another. Modern PCs
typically split the bus into a system portion that links
high-speed components such as the memory and the
video card to the CPU and a slower local bus for connect-
ing disks, modems, and printers. The speed of the system
bus is especially important in terms of the CPU accessing
main memory as memory access speed increases with bus
speed. One should also note the number of memory slots
on the motherboard as it could ultimately limit the
amount of memory that can be installed on the PC.
There are two basic CPU architectures: complex
instruction set computers (CISC) and reduced instruction
set computers (RISC). The CISC processors are designed
to complete tasks with the least amount of assembly
code. Assembly code is the lowest level programing
language and is used by the processor itself. The popular
Intel
w
Pentium
w
family of processors are CISC based.
The RISC processors are designed to complete each
assembly code instruction in one clock cycle. It takes
more assembly instructions to carry out tasks on the
RISC based machines but each assembly code instruction
is processed in one clock cycle. High performance work-
stations tend to be based on the RISC architecture. All
of this being said, chip manufacturers have not rigidly
adhered to either standard.
The processor speed is the speed at which the processor
performs internal calculations. Applications that require
intensive calculations will benet from using the fastest
CPU possible. Operations requiring the CPU to interact
with other system components will still be limited by the
bus speed and the capabilities of the device it is addres-
sing. However, the relationship between processor speed
and cost is far from linear and small percentage increases
in speed at the high end of processor speeds can lead to
large percentage increases in system cost. Therefore,
one should ensure that the CPU is the bottleneck for a par-
ticular system before expending a relatively large sum of
money to get the latest and fastest processor.
The word size (number of bits that the CPU and bus
process per clock cycle) obviously affects the amount of
data calculated per clock cycle but it also affects the
maximum amount of memory the system can support. A
32 bit system is limited to addressing 4 GB (2
32
) of
2 Analytical Instrumentation Handbook
RAM. A 64 bit system can address a practically limitless
amount of RAM (2
64
16 exabytes or 16 billion giga-
bytes). The user should not, however, expect that a 64
bit system will perform twice as fast as a 32 bit system.
The additional 32 bit is not necessary for every calculation
and unless the operating system and application software
is designed to use the 64 bit architecture it will continue
to go unused.
B. Memory/Cache
In order to process instructions, the CPU must have access
to the data it is to process. Data is stored in the computers
memory. Ideally the CPU should be able to access the
values in memory at the same speed it processes instruc-
tions but unfortunately the cost of memory is highly
dependent on its speed. The extremely expensive static
random access memory (SRAM) is the only memory
that can keep pace with the processor speed. In order to
obtain the large amounts of memory required for the
entire system, relatively inexpensive dynamic random
access memory is used. As the names try to imply, the
values stored in SRAM remain constant unless they are
purposely changed by instructions from the CPU or the
computer is powered down. Dynamic RAM must be con-
tinuously refreshed or the values stored in it fade away.
This reinforcement of the values in DRAM is paid for
as a loss in speed.
Rather than avoiding SRAM altogether because of its
cost, computer manufacturers employ small amounts
(1 MB or less) of it as a cache. Whenever the system
must access main memory for information it copies that
information into the cache as well, replacing the oldest
information as necessary. Before the system turns to
main memory for data, it checks the cache rst. The
assumption here is that if you asked for it once, you will
ask for it again. As it turns out, this is a very good assump-
tion. As you would expect, the larger the cache the more
likely it is that the values being sought will be there.
Caching is used extensively in computers whenever a
faster device or process must turn to a slower one for infor-
mation. Other examples include disk caches and web page
caches.
Memory is offered in nonparity, parity, and error cor-
rection code (ECC) variations. Parity memory calculates
the check sum of data as it is being written and stores
the value of the checksum in the parity bit. The checksum
depends on whether even or odd parity is used. With even
parity the checksum is 0 if the total number of 1 values in
the data byte is even and 1 if the total number is odd. Odd
parity is the opposite. When the data is read from memory
the checksum is calculated again and checked vs. the
stored value. If they do not match a system halt is
called. The ECC memory takes this process one step
further. By using additional parity bits, the ECC memory
is able to detect and correct single bit errors as well as
detect 2 bit errors. Memory has proven itself to be very
reliable and most vendors sell their systems with nonerror
correction code (NECC) memory.
C. Disk Storage
As discussed, memory is used by the system to store
executable statements and data that are needed on an
immediate basis. It is inappropriate for long-term storage
of data because there is a substantially limited amount of
it and it is nondurable, that is, when the computer is shut
down the information in memory is lost. The systems
hard drive is used for storing software and data that are
needed to boot and run the system and applications.
Although there are other means of storing data, it is the
rst choice for storing data that are needed on a frequent
basis because it is the fastest read/write durable storage
media available.
The hard drive has been far from left behind in the com-
puters remarkable performance gains. Megabyte drives
have given way to gigabyte drives; data transfer rates
and sophisticated interfaces have been developed so that
multiple drives can work together in concert to provide
staggeringly large amounts of reliable data storage. We
will consider both the hard drive itself and the interfaces
that have been developed for them.
At the simplest level the hard drive consists of a platter
with a thin layer of magnetic material and head mounted
on a pivoting arm. The platter is rotated at speed under
the head and the arm swings across the radius of the
platter so that the head can access any location on the
platter to read or write data.
There are three important performance attributes for the
physical hard drives: data rate, seek time, and capacity.
The data rate is the number of bytes per second that the
drive can send to the CPU. Obviously higher data rates
are better. Seek time is the time required by the drive to
nd a le and begin sending the information from it.
Here the speed at which the disk is rotated as well as its
diameter is important. The platter is rotated underneath
the head, therefore the faster the disk is rotated, the
faster the head will reach the desired rotational location
on the disk. Since the head has to move in and out along
the radius of the disk, the smaller the disk the faster it
will be able to reach the desired location along the disk
radius. Of course the smaller the drive, the smaller the
capacity of the drive. Drives of 3.5 in. appear to have
become the agreed upon compromise between speed and
capacity although smaller diameter drives are built for
laptops. By placing multiple platters in a single hard
Laboratory Use of Computers 3
drive, manufacturers have developed drives with a
capacity of hundreds of gigabytes.
The interface used to connect the hard drives to the
system affects the number of drives and data transfer
rate of the drives. The rst hard drives were connected
to the system via a hard drive controller that was connected
to the bus. Each drive was connected separately to the
motherboard via a controller, and the operating system
addressed each drive individually. Later the controller
was actually integrated onto the drive, integrated drive
electronics (IDE), instead of being a separate card.
Drives were then connected to the motherboard in a paral-
lel master/slave arrangement that is still widely used
today. The interface is called the parallel advanced tech-
nology attachment (ATA) storage interface and the
drives are called ATA packet interface (ATAPI) devices.
The interface is limited to two hard devices (four if two
interfaces are installed on the motherboard) and they
cannot be accessed concurrently. Originally the CPU was
responsible for handling the data transfer process but that
has been shifted to the bus master in peripheral component
interconnect (PCI)-based systems.
The small computer system interface (SCSI) is a higher
performance-based I/O system that handles more and
faster storage devices. The SCSI bus contains a controller
(host adapter) that controls the ow of data and up to 15
devices daisy-chained to it. The controller addresses
each device by its unique ID (015) and is able to
address multiple drives simultaneously as well as reorder
tasks to improve throughput (asynchronous transfer).
SCSI is relatively expensive and the high performance
that it offers is usually not needed on a single user work-
station. It is, however, a means of connecting more than
four I/O devices to the computer.
D. Video (Graphics Card and Monitor)
Most PC vendors now integrate the video card onto the
motherboard for cost considerations. This is generally of
no consequence in the laboratory. The boom in amateur
digital photography and gaming for the home PC market
has greatly increased the video capabilities of the standard
PC and all but the very low end of mass-produced PCs
will satisfy the video requirements of most scientic
applications.
There are, however, some scientic applications that
place very high demands on the video subsystem of the
computer. Examples include imaging applications such
as digital microscopy and 3D molecular modeling appli-
cations. A basic understanding of the video subsystem is
very helpful in purchasing or building the appropriate
computer for the intended application as well as diagnos-
ing video problems.
The video adapters rst and minimum required respon-
sibility is to translate instructions from the CPU and trans-
mit them to the monitor. The computers graphics (video)
card and monitor are codependent and they must be
matched to one another in order to function optimally.
Specically they must be compatible in terms of resol-
ution, color depth, and refresh rate.
The monitor produces video images through a grid of
colored pixels. The resolution of the video card and
monitor is expressed in terms of the number of horizontal
pixels by the number of vertical pixels (e.g., 1024 768).
Obviously, the higher the number of pixels, the more the
information that can be displayed on the monitor, which
results in sharper images.
The video card must transmit color information for each of
these pixels. Any standard PC video card and monitor now
support 24 bit color information which corresponds to
16,777,216 colors (2
24
). Since the human eye can only
discern about 10 million colors, this is often referred to as
true color and there is no reason to increase it further.
(Most video cards today are 32 bit video cards but the
additional 8 bits are used to transmit translucency information
for layered graphics in digital video animation and gaming.)
The refresh rate is the number of times the pixels on the
monitor are redrawn per second (Hz). Generally, it will be
the monitor and not the video card that will limit how fast
the refresh rate may be set. Note that the maximum refresh
rate is also dependent on the resolution. Since the monitor
can only refresh a maximum number of pixels per second,
the higher the resolution (and therefore the number of
pixels), the fewer times the monitor can refresh the entire
screen per second.
The conversion of operating systems and applications
to graphical user interfaces (GUIs) of ever increasing com-
plexity has required that the video card/graphics card
takes on more responsibilities so as to not overtax the
CPU and system memory. The better the video card the
more tasks it will assume on behalf of the CPU. This
added functionality can lead to dramatic performance
improvements for the 3D imaging used by molecular mod-
eling. To take on these additional tasks, video cards typi-
cally employ three major components: the graphics
accelerator, the video memory, and the accelerated
graphics port (AGP) connector.
The graphics accelerator is essentially a co-processor
specically designed for performing graphics calculations
which are particularly intensive in 3D imaging appli-
cations. It calculates the location and color of each pixel
based on more generic instructions from the system CPU
to draw geometric shapes, thus freeing the system CPU
from these calculations. Unfortunately, the commands
used to address the graphics accelerator are specic to
the particular graphic card and its drivers. This is an area
where standardization has been slow in coming and
many applications are optimized for a particular model or
family of video cards.
Most video cards also have their own onboard memory.
The more memory a video card has, the higher resolutions
it will support. To support 24 bit color with a resolution of
1280 1024, 4 MB of video RAM is required. To support
1600 1200 with 24 bit color 6 MB is needed. The
amount of memory does not generally increase speed
unless the card sets aside a portion of its memory to
assist in complex geometric calculations. It is the speed
of the video RAM and the cards digital-to-analog conver-
ter that ultimately determine the maximum refresh rate
supported by the card at a given resolution.
In the past, video cards were connected to the system
through one of the normal expansion slots on the mother-
board. Most cards now connect via a specialized AGP
connector. The AGP connector interfaces the card to the
motherboard and allows the graphics accelerator chip
to directly access the system memory and CPU. This
allows the video card to use the system memory for calcu-
lations. The speed multiplier of AGP stands for 266 Mbps,
so a 2 AGP will have a transfer rate of 532 Mbps.
There are two video attributes that are dened by the
monitor alone: size and dot pitch. Monitor sizes are gener-
ally measured by the length of a diagonal from the upper
left to the lower right corner of the screen. For cathode
ray tubes (CRTs) it is important to compare the viewable
size. Manufacturers do not extend the image to the edge
of the CRT because it is impossible to avoid distortions
near the edges. It should also be noted that higher resol-
utions are required on larger monitors to maintain sharp
images.
The dot pitch is the distance between pixels. The larger
the dot pitch, the more obvious the individual pixels will
be, leading to grainy images. In general, the dot pitch
should be no more than 0.28 mm, with 0.25 mm preferred.
A nal word regarding monitor types is in order. Liquid
crystal display (LCD) monitors have become an economi-
cal alternative to CRTs for all but the most demanding
video applications. They require much less space and
energy than traditional CRTs. They may also be con-
sidered safer in the laboratory environments since they
operate at lower voltages, produce less heat, and there is
no implosion hazard from a tube. As of yet, however,
they cannot match the refresh rates of CRTs and tend to
distort motion which may be an issue with 3D molecular
modeling applications.
E. Other Peripherals
The system contains other peripherals that will not be dis-
cussed at any length. External storage devices such as the
CD or DVD drives are widely used to read and save data
not needed on a frequent basis. Most software applications
have reached sizes requiring these devices to install them.
The electronic revolution has also failed to wean us of our
dependence/preference to paper, and the printer continues
to be a key component of the computer. The serial and
parallel communication ports as well as NICs will be
discussed in the digital transmission section that follows.
III. COMPUTER MAINTENANCE
There are two main aspects in maintaining a computer:
monitoring system performance so that corrective action
can be taken when needed and protecting the systems
data against hardware failure as well as malicious
attacks from outside sources.
A. Performance Monitoring
The systems performance should be periodically moni-
tored. If the computers performance when running the
instrument and/or its software leaves something to be
desired, the rst step is to determine what resource is
being used to its maximum extent, thus limiting the speed
of the entire process. Most modern operating systems
provide a means of determining system resource (memory,
CPU, and disk) usage. Microsoft operating systems can be
monitored using its task manager facility as the top and
ps commands can be used in UNIX-based systems. By
monitoring system resource usage while running the instru-
ment and its associated software, it is often a simple matter
to determine what the bottleneck is.
If the systems physical memory is being completely
used, increasing the system memory will alleviate the
problem. Be careful not to confuse physical memory
with virtual memory which includes the pagele. Almost
all systems allocate space on the system drive to the page-
le, which is used by the system as auxiliary memory.
While occasional use of the page le is normal, the
system should not need to use the pagele for the active
application as the data transfer rates to the hard drive are
much slower than to the system memory.
If the CPU is continually running at or near capacity,
you must either decrease the load on the processor by shut-
ting down unneeded processes or by purchasing a faster
processor.
If the system is spending the bulk of its time to read/
write to the disk system, the rst step should be to check
the fragmentation status of the drive. All disks become
fragmented with time and in fact a lot of systems start
out that waythe installation of the operating system
already fragments les on the drive. Generally, the
system writes a new le to the disk in the rst available
open space on the drive. If that space is not large enough to
contain the le, what did not t is written to the next open
space after that and if the remainder of the le still does not
t, a third open space is found and so on. A single le can
be spread out over dozens of locations on the drive. The
disk drive must then jump back and forth to all of these
locations to read the entire le, signicantly decreasing
the speed at which the drive can read the le. There are
several good third party disk defragmenters on the
market, and some operating systems include it as part of
the operating system. The defragmentation software
attempts to move les around until every le is stored con-
tiguously on the drive, thus minimizing the access time for
the drive.
B. Virus Protection
There is a signicant criminal element in the computer
world that takes great pleasure in disrupting the computers
and lives of other users. In some cases, this takes the form
of a prank by merely copying itself onto as many compu-
ters as it can and in other cases it involves the very mali-
cious destruction of data on the systems hard drive.
Denial of service attacks where large numbers of compu-
ters are duped into bombarding a single website to
prevent legitimate users from being able to access them
have also become quite popular.
The threat posed by these hackers has greatly
increased with high-speed connections to the internet.
Computers have become very highly networked, con-
stantly interacting with one another in order to obtain or
give information requested by their users. Laboratory
systems are increasingly networked to high-speed local
area networks (LANs) and WANS and are susceptible as
well. It is very important that the laboratory computer
and its hard earned scientic data be protected. The most
common threats to a computer are:
. VirusesA virus is a small program that integrates
itself with a legitimate program. Every time the
program is run, the virus runs as well giving it the
chance to reproduce and deliver its payload.
. WormsA worm is a small piece of software that
takes advantage of security holes in the operating
systems and applications to gain access to a system.
After inltrating the system, the worm scans the
network looking for systems available from the
machine it just infected, which also have the security
hole. It then starts replicating from there, as well.
. Trojan horsesA Trojan horse is simply a compu-
ter program that masquerades as another legitimate
program. The user is duped into running the
program which may cause great damage when it
is run (such as erasing the hard disk).
Here are some minimal steps to protect the laboratory
computer from these threats:
1. Do not open les from unknown sources.
2. Run an antivirus program and set it to scan all new
les as they are written to disk. Make sure that the
virus denitions used by the program are kept up to
date. The software can often be set to automatically
update itself at set intervals.
3. Keep the operating system current by installing
security patches as they are made available.
4. Ensure that the system is protected by a rewall. By
default, most networking installations allow unfettered
access to the computer fromoutside sources. Firewalls
scan incoming and outgoing data and block it from
unrecognized(nontrusted) sources. Most large organiz-
ations protect their local area networks from unauthor-
ized probing by the use of rewalls.
C. Backup and Recovery
The data on the hard drive is often a prime target for
viruses and worms. The computers hard drive is also
the component most likely to fail. Indeed most people
have suffered the failure of their computers hard drive.
Since the end product of all the scientists experimentation
resides on the hard drive, it is imperative that an appropri-
ate backup and recovery strategy be implemented.
The rst step in implementing a backup/recovery strat-
egy is to dene the maximum acceptable data loss and
time the backups appropriately. If experiments can be
reasonably repeated within a 24 h period, backing up the
system daily should be sufcient.
This process should accurately dene the backup and
restore needs of the system. Insufcient backups will
result in a needless loss of data. Strategies involving elab-
orate, complicated backup schemes place are difcult to
implement and maintain, thus unnecessarily sapping the
computers resources and the scientists time. They are
also less likely to be faithfully executed as designed.
In cases where very little downtime due to the computer
can be tolerated, disk mirroring may be an option. This
functionality used to be available only on high-end
servers but now redundant array of inexpensive disks
(RAID) controllers can be purchased for around $100.
This can be thought of as performing real-time backups
of data as it is written to the disk. All data are written to
both drives so that each one is a copy of the other.
Should one of the drives fail, the system continues to func-
tion on the remaining drive. When the failed drive is then
replaced, the controller copies all the information on the
original surviving drive to the new drive to reestablish
the redundancy protection. Servers often employ a
second RAID technologyRAID-5. Here the RAID sets
of n drives are formed (315 for the SCSI drives) and all
the information on each drive is copied across the remain-
ing drives. Instead of using half the available disk capacity
for redundancy, only
1
n
th is used for redundancy. Keep in
mind that the RAID technology does not protect your
system against a virus infection, as the virus will be
copied to the redundant portions of the drives as well.
If the computer is on a networked enterprise-type
environment such as a university or large company,
backed-up le servers may be available for the storage
of data. If such services exist, it can be an excellent
means of decreasing the time the scientist needs to spend
on nonscientic housekeeping while maintaining a high
level of protection. The backup/restore procedures avail-
able on any network le server should be conrmed
before entrusting your valuable data to them.
IV. DATA TRANSFER/INSTRUMENT
INTERFACES
Computers dwell exclusively in the digital realm.
Quantum chemistry notwithstanding, normal chemical
analyses and the responses measured by them are analog
phenomena. In order for computers to be able to analyze
and store analog data from a laboratory instrument, the
data must be converted to a digital form. This is the basic
task of the instrument interface(s). In order for the compu-
ter to analyze and store the data from a given laboratory
instrument, several processes must take place. In general
terms they are: conversion of the physical signal to an
analog signal, analog transmission, analog ltering,
analog-to-digital conversion, digital transmission, and
digital ltering. It may not be immediately obvious to
the casual observer of the analytical instrument where
these processes take place or even that they do. The trans-
ducer and the analog signal lters normally occur within
the enclosure of the detector. The analog-to-digital conver-
sion and/or the digital ltering can also occur within the
detector, in the computer or in a physically separate inter-
face. In many cases, the analog-to-digital conversion takes
place in the detector to minimize the efforts required to
protect the analog signal from ambient noise. Digital lter-
ing is typically performed in the computer as it can be very
computationally intensive and to simplify optimizing the
digital ltering parameters by the scientist if desired.
A. Transducers
The rst step is to convert the physical phenomena being
studied into an electronic signal by the transducer. Typical
transducers used in analytical instrumentation include:
photomultiplier tube (PMT) in traditional scanning spectro-
photometers using monochromators, charge-coupled devi-
ces for photodiode array instruments, particle detectors for
X-ray diffraction equipment, and electrodes for measuring
potentials in solutions. The transducer is obviously very
specic to the instrument and will not be further discussed
here except to caution that the properties of the transducer
must be taken into account in the design of the analog-to-
digital circuitry. For example, high impedance transducers
such as electrodes likewise require that the amplier circui-
try be of high impedance.
B. Analog Signal Transmission
We will begin by examining signal transmission. Earlier this
was almost always accomplished by transmitting the analog
signal, usually in the form of varying voltage, from a
detector such as a PMT or a transducer such as thermistor
to the analog-to-digital converter installed on the PC.
Noise is everywhere but can be especially prevalent in the
laboratoryenvironment where it canweakenor worse yet alter
the useful signal. Proper cable design and use is more import-
ant as the distance that the analog signal must travel increases.
Voltage (as opposed to current) based signals are especially
prone to electromagnetic interferences.
A simple method of reducing interference from external
noise is to simply twist the signal and ground wires around
one another. Any noise induced in one wire is offset by an
equal but opposite noise signal in the other. Coaxial cables
are another effective means of protecting data from noise.
A coaxial cable consists of a central conducting (positive)
wire separated from an outer conducting cylinder by an
insulator. They are not affected by external electric and
magnetic elds.
Cables are often shielded by surrounding themwith a layer
of conductive metal such as aluminum or copper to prevent
them from acting as antennas for radio and other electromag-
netic noise sources. The shield is then grounded only at the
detector end to protect from low frequency (e.g., 60 Hz)
noise or at both ends to protect high frequency signals. If
the shield is grounded at both ends, care must be taken to
ensure that both grounds are identical or a ground loop will
form resulting in current owing through the shield, which
will affect the signal being carried by the cable. The shield
should also not be left ungrounded at both ends as this will
enhance the noise one is trying to eliminate.
C. Analog Signal Filtering
The most important application of analog ltering is low-
pass antialiasing lters that must be applied to the analog
signal when periodic signals are being measured. Unless
systematic noise and harmonics at higher frequencies are
removed from the analog signal, the low sampling rate
in comparison with the noise can result in the noise alias-
ing down to the frequency of the signal itself. Aliasing will
be discussed in the digital ltering section.
True analog ltering has become much less important
in the digital age. Digital lters can easily simulate low-
pass, band-pass, and high-pass lters and have two distinct
advantages. They can be applied without permanently
obscuring the original signal which is generally saved in
an electronic le and they can use subsequent data
points to lter the current data point such as in moving
average schemes, something that is impossible in analog
circuitry. All of this assumes, of course, that the analog
signal was converted to a digital signal with sufcient
resolution. This will be discussed in Section D.1.
D. Analog-to-Digital Conversion
There are several different methods of analog-to-digital con-
version and although a comprehensive treatment of the
analog-to-digital conversion is beyond the scope of this
work, a basic discussion of their advantages and disadvan-
tages is in order to ensure that the method chosen is compa-
tible with the given application. Table 1 gives a brief
summary of the various types of analog-to-digital converters
along with approximate performance ranges. Most commer-
cial devices employ some type of track and hold circuit so
that the conversion can take place on a static value. Com-
ponent manufacturers such as National Semiconductor,
Analog Devices, and Texas Instruments among others
provide detailed specication sheets for their products as
There are two important attributes of the analog-to-
digital converters: resolution and response time. Resolution
is the smallest analog change resulting in an incremental
change in the digital output. The resolution of an analog-
to-digital converter is determined by the number of output
bits it produces. An 8 bit analog-to-digital converter is
capable of separating a 1 V signal into 2
8
segments or
3.9 mV. This, for example, would be insufcient for chro-
matographic UV detectors that often have rated noises of
0.03 mV. A 16 bit analog-to-digital converter is capable of
resolving a 1 V signal into 65,536 segments or 0.015 mV,
which would be sufcient. Resolution is also referred to as
the least signicant bit (LSB).
Another point to consider when using analog-to-digital
converters is range matching. If an instrument detector is
only capable of outputting from 0 to 50 mV, using an
analog-to digital converter with a 1 V range is an expensive
waste of 99.5% of the ADCs range. Many commercial inter-
faces allow the user to adjust the range of the ADC for this
reason.
1. Sampling Interval (Rise-Time)
The sampling interval is a second important consideration.
If the sampling rate is too low, important ne features
may be lost while using excessively high sampling rates
wastes both computing time and disk space.
shows the effect of sampling rate on the depiction of a
modeled chromatographic peak with a partially
eluting smaller peak with narrower peak width.
Note that the sampling interval problem also applies to
experiments that at rst glance do not appear to be time
based. The use of the Fourier transform methodology for
frequency vs. intensity experiments such as infrared and
nuclear magnetic resonance spectroscopy (NMR) is an
important example. Nyquist rmly established that the
sampling frequency must be at least twice that of the
highest frequency component in the signal. If the signal
is sampled at less than the Nyquist frequency, those com-
ponents above half the sampling frequency appear as com-
ponents with frequencies less than half the sampling
Table 1 Analog-To-Digital Converters
Type Description
Response
time
Resolution
(bit)
Parallel encoder The input signal is simultaneously compared to a series of reference voltages
and the digital output is based on the last highest voltage reference that was
exceeded by the reference voltage
3200 ns 410
Successive
approximation
The digital voltage value is derived by comparing input voltage to sequential
series of reference voltages produced by a digital-to-analog converter
3200 ms 812
Voltage-to-frequency
conversion
The input voltage is converted to a frequency that is proportional to the input
voltage. The digital voltage value is derived from the frequency
633 ms 12
Single-slope integration The input voltage is compared to an internal voltage ramp of known slope.
The digital voltage value is derived from the time required for the voltage
ramp to exceed the input voltage
340 ms 12
Dual-slope integration The digital voltage value is derived from the time for a capacitor to discharge
after being charged proportional to the input voltage
340 ms 1018
Delta sigma converters The digital voltage value is derived from successive subtractions of known
voltages from the input voltage
10100 ms 1218
co-
2 Figure
well as useful application notes and guides (Fig. 1).
frequency (see Fig. 3).
Figure 1 (a) Signal composed of multifrequency components in the time domain. (b) Signal composed of multifrequency components
in the frequency domain. (c) Signal resolved into frequency components.
Figure 2 Effect of sampling rate on time-based measurements. Figure 3 Effect of aliasing below Nyquist frequency.
As it is quite possible that the value of the highest fre-
quency component in the signal is not known, an analog
low-pass lter is added before the analog-to-digital conver-
ter to remove any signal with a frequency greater than half
the sampling frequency. Since the ltering prole of analog
lters resembles an exponential decay more than an ideal
strict cut-off at the desired frequency, care must be taken
to ensure lter reduces all the undesired higher frequencies
to below the detection limit of the analog-to-digital conver-
ter. All of this means that analog-to-digital converter must
be able to sample the signal at over twice the frequency of
the highest expected component in the signal to ensure
that no alias artifacts are present in the analyzed data.
E. Digital Signal Transmission
It is now much more prevalent to convert the signal to a
digital one in an onboard computer on the instrument
and transmit the digital signal to the computer. Digital
signals are generally less susceptible to noise because
they are discrete pulses, whereas analog signals are con-
tinuous. In cases where the analysis of the raw data is com-
putationally very intensive, the workload is also split
between the two computers. The detector computer can
also be optimized for performing its data reduction tasks
without being concerned about user interfaces.
The transmission of digital data requires that a com-
munication protocol be dened so that the computer
understands the series of discrete binary pulses being
received. A brief discussion of the most popular communi-
1. Point-to-Point Communication
(Serial, Parallel)
The simplest scenario is a direct connection between the
computer and a single other device (the detector, terminal,
or other computer). RS232 serial communication is one of
the earliest communication protocols and is an example of
such a point-to-point protocol. Both control and data
signals are transmitted over eight distinct lines for hard-
ware control or four lines when software control is used.
The process of exchanging control signals to ensure
proper signal transmission is also called handshaking.
As indicated by the name, serial communication seri-
alizes the native 8 bit words used by the computer in
order to transmit the data bit by bit:
In order to establish communication between the two
devices, they both must use the same serial port settings:
transmission rate, data bits (58), parity (even, odd, and
none), stop bits (1,2), and ow control/handshaking (hard-
ware, XON/XOFF). The transmission rate is the number
of bits per second that can be transmitted and is dependent
on the actual hardware serial port. The data bit setting indi-
cates how many data bits are sent per transmission (typi-
cally 8 bits for binary data). The parity bit is used for
error checking. If set to odd, the transmitting device sets
the parity bit such that the data bits plus the parity bit
give an odd number of ones. If set to even, the transmitting
device sets the parity bit such that the data bits plus the
parity bit give an odd number of ones. For XON/XOFF
(software) ow control, a device sends ASCII character
19 (Ctrl-S) in the data bits to stop transmission of data
from the other device instead of using the hardware
clear to send line. It then sends ASCII 17 (Ctrl-Q) to
inform the other device that it can continue sending data.
Parallel communication is also point-to-point but sends
all 8 data bits at once and is thus faster than serial
communication. Parallel communication was originally
unidirectionaldata were only sent from the computer
to the device, usually a printer. Bidirectional parallel
ports are now standard.
2. Short DistanceMultiple Device
Communication
Often laboratory instrumentation consists of several dis-
tinct modules that must interact with one another and the
computer to function properly. The point-to-point
scheme is insufcient in this case unless one dedicates
one serial port for each device. Hewlett Packard developed
a mulitple device interface and protocol to address this
issue. The Hewlett Packard interface bus (HPIB) was
adopted industry-wide and became known as the general
purpose interface bus (GPIB) and the IEEE488. An
active controller talks to up to 14 other devices over 8
data lines (parallel), 5 bus management lines, and 3 hand-
shaking lines.
The SCSI described earlier to connect peripherals such as
hard drives to the system bus of the computer can also be
used to connect to external instruments. This is a relatively
expensive solution, however, would only be used when
extremely high data transfer rates are required.
The universal serial bus (USB) was developed for per-
sonal computers to simplify the connection of external
Start bit
(always 0)
Data bit
1LSB
Data
bit 2
Data
bit 3
Data
bit 4
Data
bit 5
Data
bit 6
Data
bit 7
Data bit 8
the most
signicant bit
Parity bit
(odd, even,
none)
Stop bit
(always 1)
Stop bit
(optional and
always 1)
cation methods is given later (Table 2).
devices. The USB cable consists of a single twisted pair for
transmitting data and two wires for supplying 5 V power to
low power devices (maximum 500 mA) such as mouse.
The host computer automatically polls all devices on the
bus and assigns them an address. Data are transmitted in
1500-byte frames every millisecond. (A frame is a com-
munication protocol which is standardized and the user
only needs to plug the device into an available USB port
in order to use it.) This relatively new protocol has not
been used extensively for instrumentation but will prob-
ably be adopted by instrumentation that transfers small
amounts of data and currently uses the standard serial port.
3. Local Area Networks (LANs)
It is not uncommon to place instruments on an ethernet
LAN with a variety of other devices and workstations all
sharing the same transmission medium. A robust com-
munication protocol is required if large numbers of
devices are communicating with one another. Typically
most large organizations use the TCP/IP network protocol
on ethernet. Each device is assigned a unique IP address
based on four 8 bit address segments corresponding to
the network, subnet, and host/device, which is represented
in the familiar decimal dot notation: 255.255.255.255.
A number of instrument vendors are taking advantage
of the low cost of NICs and network hubs to establish
small LANs to provide communications between the lab-
oratory computer and the instrument. It is also a simple
matter to congure a point-to-point network between a
computer and an instrument with a cross-over cable and
two NICs congured for TCP/IP.
F. Digital Data Filtering
One of the simplest noise ltering techniques is signal
averaging. By repeating the experiment several times
and averaging the results, the noise will tend to cancel
itself out while the signal should remain constant. If the
noise is truly random, the signal-to-noise ratio of the
signal increases by the square root of the number of
times successive signals are averaged.
Experiments need not necessarily be repeated many
times in order to achieve better signal-to-noise as a
result of signal averaging. In cases where there are a suf-
cient number of data points dening the signal curve, adja-
cent data points can be averaged to remove noise. This
approach assumes that the analytical signal changes only
slightly (and therefore linearly) within window of data
points being averaged. In practice, 250 successive data
points can be averaged in this way depending on the
data being processed.
Several key analytical techniques use advanced math-
ematical treatment to obtain a strong signal while discrimi-
nating against noise. Examples include pulsed NMR and
the Michelson interferometer-based infrared spectroscopy.
In the case of NMR, the sample is irradiated with a short
burst of RF radiation and then the amplitude of the decay-
ing resonant frequencies of the excited sample is measured
as a function of time. The Michelson interferometer mea-
sures the shift in frequencies caused by a fast moving
mirror in the light path. These techniques have several
distinct advantages over their scanning monochromator
counterparts: the experiments are much quicker, thus
lending themselves to signal averaging and there is no
requisite loss of the source radiation from restricting the
passed energy to a narrow energy band. In both cases,
the signal is transformed to a response (amplitude) based
on frequency instead of time by the Fourier transform
method. An in-depth discussion of the Fourier transform
is beyond this work, but the amplitude response S at
given frequency is represented by the Fourier transform as:
S^ xx(f )
1
1
x(t)e
i2pft
dt (1)
This requires the integral to be calculated over all time in a
continuous manner. Out of practical necessity, however,
one can only obtain the amplitude at discrete intervals
for a nite period of time. Several approximations are
Table 2 Digital Communication Interfaces
Maximum
transmission speed Maximum cable length Maximum number of devices
Serial 64 kbps 10 ft 2
Parallel 50100 kbytes/s 912 ft 2
USB 2.0 480 mbits/s 5 m Segments with a maximum of six segments
between device and host
127
IEEE488 1 MByte/s 20 m (2 m per device) 15
Twisted pair ethernet 1000 Mbps 82 ft (329 ft at 100 Mbps) 254 per subnet using TCP/IP
SCSI 160 Mbytes/s 6 m
a
16
a
Single ended cabledifferential on cables can be up to 25 m long.
made in order to calculate the transform digitally, yielding
the discrete fourier transform:
S
0
^ xx(kDf )
1
N
X
N1
n0
x(n)e
i(2pnk=N)
(2)
where N is the number of points sampled. This allows the
signal to be represented in the frequency domain as in
The function assumes that every component of the
input signal is periodic, that is, every component of the
signal has an exact whole number of periods within
the timeframe being studied. If not, discontinuities at the
beginning and ending border conditions develop, resulting
in a distortion of the frequency response known as leakage.
In this phenomena, part of the response from the true fre-
quency band is attributed to neighboring frequency bands.
This leads to articially broaden the signal response over
larger frequency bands that obscure smaller amplitude fre-
quency signals. A technique known as windowing is used
to reduce the signal amplitude to zero at the beginning and
end of the time record (band-pass lter). This eliminates
the discontinuities at the boundary time points, thus
greatly reducing the leakage.
V. DATA ANALYSIS/CHEMOMETRICS
Besides their roles in controlling instruments, collecting
and storing data, computers play a critical role in the com-
putations and data processing needed for solving chemical
problems. A good example is multivariate data analysis in
analytical chemistry. The power of multivariate data
analysis combined with modern analytical instruments is
best demonstrated in areas where samples have to be ana-
lyzed as is and relevant information must be extracted
from interference-laden data. These cases include charac-
terization of chemical reactions, exploration of the
relationship between the properties of a chemical and its
structural and functional groups, fast identication of
chemical and biological agents, monitoring and controling
of chemical processes, and much more. The multivariate
data analysis technique for chemical data, also known as
chemometrics, heavily relies on the capabilities of compu-
ters because a lot of chemometrics algorithms are compu-
tationally intensive, and the data they designed to analyze
are usually very large in size. Fortunately, advances in
computer technology have largely eliminated the perform-
ance issues of chemometrics applications due to limit-
ations in computer speed and memory encountered in
early days.
Chemometrics is a term given to a discipline that uses
mathematical, statistical, and other logic-based methods
to nd and validate relationships between chemical data
sets, to provide maximum relevant chemical information,
and to design or select optimal measurement procedures
and experiments. It covers many areas, from traditional
statistical data evaluation to multivariate calibration, mul-
tivariate curve resolution, pattern recognition, experiment
design, signal processing, neural network, and more. As
chemical problems get more complicated and more soph-
isticated mathematical tools are available for chemists,
this list will certainly grow. It is not possible to cover all
these areas in a short chapter like this. Therefore we
chose to focus on core areas. The rst is multivariate cali-
bration. This is considered as the center piece of chemo-
metrics, for its vast applications in modern analytical
chemistry and great amount of research work done since
the coinage of this discipline. The second area is pattern
recognition. If multivariate calibration deals with predo-
minantly quantitative problems, pattern recognition rep-
resents the other side of chemometricsqualitative
techniques that answer questions such as Are the
samples different? How are they related?, by separating,
clustering, and categorizing data. We hope in this way
readers can get a relatively full picture of chemometrics
from such a short article.
A. Multivariate Calibration
1. General Introduction
In science and technology, establishing quantitative
relationships between two or more measurement data
sets is a basic activity. This activity, also know as cali-
bration, is a process of nding the transformation to
relate a data set with the other ones that have explicit infor-
mation. A simple example of calibration is to calibrate a
pH meter. After reading three standard solutions, the elec-
trical voltages from the electrode are compared with the
pH values of the standard solutions and a mathematical
relation is dened. This mathematical relationship, often
referred as a calibration curve, is used to transform the
electric voltages into pH values when measuring new
samples. This type of calibration is univariate in nature,
which simply relates a single variable, voltage, to the pH
values. One of the two conditions must be met in order
to have accurate predictions with a univariate calibration
curve. The measurement must be highly selective, that
is, the electrode only responds to pH changes and nothing
else, or interferences that can cause changes in the elec-
trode response are removed from the sample matrix and/
or the measurement process. The later method is found
in chromatographic analyses where components in a
sample are separated and individually detected. At each
point of interest within a chromatographic analysis, the
sample is pure and the detector is responding to the com-
ponent of interest only. Univariate calibration works
Fig. 3.
perfectly here to relate the chromatographic peak heights
or areas to the concentrations of the samples. On the
other side of the measurement world, measurement
objects are not preprocessed or puried to eliminate
things that can interfere with the measurement. Univariate
calibration could suffer from erroneous data from the
instrument, and worse yet there is no way for the analyst
to tell if he/she is getting correct results or not. This short-
coming of univariate calibration, which is referred as zero-
order calibration in tensor analysis for its scalar data
nature, has seriously limited its use in modern analytical
chemistry.
One way to address the problem with univariate cali-
bration is to use more measurement information in estab-
lishing the transformation that relates measurement data
with the reference data (data that have more explicit infor-
mation). Modern analytical instrumentations such as
optical spectroscopy, mass spectroscopy, and NMR
deliver multiple outputs with a single measurement, pro-
viding an opportunity to overcome the shortcomings of
univariate calibration. The calibration involving multiple
variables is called multivariate calibration which has
been the core of chemometrics since the very beginning.
The major part of this section will be focused on the dis-
cussion of multivariate calibration techniques.
The capability of multivariate calibration in dealing
with interferences lies on two bases: (1) the unique
pattern in measurement data (i.e., spectra) for each com-
ponent of interest and (2) independent concentration vari-
ation of the components in the calibration standard set. Let
us consider the data from a multivariate instrument, say, an
optical spectrometer. The spectrum from each measure-
ment is represented by a vector x [x
1
, x
2
, . . . , x
n
],
which carries unique spectral responses for component(s)
of interest and interference from sample matrix. Measure-
ment of a calibration set with m standards generates a data
matrix X with m rows and n columns, with each row repre-
senting a spectrum. The concentration data matrix Y with
p components of interest will have m rows and p columns,
with each row containing the concentrations of com-
ponents of interest in a particular sample. The relationship
between X (measurement) and Y (known values) can be
described by the following equation:
Y XB E (3)
The purpose of calibration is to nd the transform matrix B
and evaluate the error matrix E. Using linear regression,
the transformation matrix B, or the regression coefcient
matrix as it is also called, can be found as:
B (X
0
X)
1
X
0
Y (4)
where X
0
represents the transpose of X. The inversion of
square matrix X
0
X is a critical step in multivariate
calibration, and the method of inversion essentially differ-
entiates the techniques of multivariate calibration. The
following sections will discuss the most commonly used
methods.
2. Multiple Linear Regression
Multiple linear regression (MLR) is the simplest multi-
variate calibration method. In this method, the transform-
ation matrix B [Eq. (4)] is calculated by direct inversion of
measurement matrix X. Doing so requires the matrix X to
be in full rank. In other words, the variables in X are inde-
pendent to each other. If this condition is met, then the
transformation matrix B can be calculated using Eq. (4)
without carrying over signicant errors. In the prediction
step, B is applied to the spectrum of unknown sample to
calculate the properties of interest:
^ yy
i
x
0
i
(X
0
X)
1
X
0
Y (5)
The most important thing in MLR is to ensure the indepen-
dence of the variables in X. If the variable are linearly
dependent, that is, at least one of the rows can be written
as an approximate or exact linear combination of the
others, matrix X is called collinear. In the case of colli-
nearity some elements of B from least square t have
large variance and the whole transform matrix loses its
stability. Therefore, for MLR collinearity in X is a
serious problem and limits its use.
To better understand the problem, consider an example
in optical spectroscopy. If one wants to measure the con-
centration of two species in a sample, he/she has to
hope that the two species have distinct spectral features
and it is possible to nd a spectral peak corresponding to
each of them. In some cases, one would also like to add
a peak for correcting variations such as spectral baseline
drift. As a result, he would have a 3 nX matrix after
measuring n standard samples. To ensure independence
of X variables, the spectral peaks used for measuring the
components of interest have to be reasonably separated.
In some cases, such as in FTIR and Raman spectroscopy
whose nger print regions are powerful in differentiating
chemical compounds, it might be possible. MLR is accu-
rate and reliable when the no collinearity requirement
is met. In other cases, however, nding reasonably separ-
ated peaks is not possible. This is often true in near
infrared (NIR) and UV/Vis spectroscopy which lack the
capability in differentiating between chemicals. The NIR
and UV/Vis peaks are broad, often overlapping, and the
spectra of different chemicals can look very similar. This
makes choosing independent variables (peaks) very dif-
cult. In cases where the variables are collinear, using
MLR could be problematic. This is probably one of the
reasons that MLR is less used in the NIR and UV/Vis
applications.
Due to the same collinearity reason, one should also be
aware of problems brought by redundancy in selecting
variables for X matrix. Having more variables to describe
a property of interest (e.g., concentration) may generate a
better t in the calibration step. However, this can be mis-
leading. The large variance in the transformation matrix
caused by collinearity will ultimately harm the prediction
performance of calibration model. It is not uncommon to
see the mistake of using an excessive number of variables
to establish a calibration model. This kind of calibration
model can be inaccurate in predicting unknown samples
and is sensitive to minor variations.
3. Factor Analysis Based Calibration
In their book, Martens and Naes listed problems encoun-
tered in dealing with complex chemical analysis data
using traditional calibration methods such as univariate
calibration:
1. Lack of selectivity: No single X-variable is suf-
cient to predict Y (property matrix). To attain
selectivity one must use several X-variables.
2. Collinearity: There may be redundancy and hence
collinearity in X. A method that transforms corre-
lated variables into independent ones is needed.
3. Lack of knowledge: Our a priori understanding of
the mechanisms behind the data may be incomplete
or wrong. Calibration models will fail when new
variations or constituents unaccounted by the cali-
bration occur in samples. One wishes at least to
have a method to detect outliers, and further to
improve the calibration technology so that this
kind of problem can be solved.
Factor analysis based calibration methods have been
developed to deal with these problems. Due to space
limit, we only discuss two most popular methods, principal
component regression (PCR) and partial least square
(PLS).
Principle Component Regression
The base of PCR is principal component analysis
(PCA) which computes so-called principle components
to describe variation in matrix X. In PCA, the main vari-
ation in X fx
k
, k 1, 2, . . . , Kg is represented by a
smaller number of variables T ft
1
, . . . , t
A
g (A , K). T
represents the principle components computed from X.
The principle components are calculated by nding the
rst loading vector u
1
that maximizes the variance of
u
1
x and have u
1
X Xu
1
t
1
t
1
, where t
1
is
the corresponding score vector. The next principle com-
ponent is calculated in the same way but with the restric-
tion that t
1
and t
2
are orthogonal (t
1
t
2
0). The
procedure continues under this restriction until it reaches
the dimension limit of matrix X. Figure 4 may help to
understand the relationship between original variables x
and principle component t.
Consider a sample set measured with three variables x
1
,
x
2
, and x
3
. Each sample is represented by a dot in the coor-
dinate system formed by x
1
, x
2
, and x
3
. What PCA does is
to nd the rst principle component (t
1
) that points to the
direction of largest variation in that data set, then the
second principle component (t
2
) capturing the second
largest variation and orthogonal to t
1
, and nally the
third principle component (t
3
) describes the remaining
variation and orthogonal to t
1
and t
2
. From the gure it
is clear that principle components replace x
1
, x
2
, and x
3
to form a new coordination system, these principle com-
ponents are independent to each other, and they are
arranged in a descending order in terms the amount of var-
iance they describe. In any data set gathered from reason-
ably well-designed and measured experiments, useful
information is stronger than noise. Therefore it is fair to
expect that the rst several principle components mainly
contain the useful information, and the later ones are domi-
nated by noise. Users can conveniently keep the rst
several principle components for use in calibration and
discard the rest of them. Thus, useful information will be
kept while noise is thrown out. Through PCA, the original
matrix X is decomposed into three matrices. V consists of
normalized score vectors. Uis the loading matrix and the S
is a diagonal matrix containing eigenvalues resulting from
normalizing the score vectors.
X VSU
0
(6)
As just mentioned, X can be approximated by using the
rst several signicant principle components,
~
XX V
A
S
A
U
0
A
(7)
where V
A
, S
A
, and U
A
are subsets of V, S, and U, respect-
ively, formed by the rst A principle components.
~
XX is a
close approximation of X, with minor variance removed
by discarding the principle components after A.
Figure 4 PCA illustrated for three x-variables with three prin-
ciple components (factors).
Principal components are used both in qualitative
interpretation of data and in regression to establish
quantitative calibration models. In qualitative data inter-
pretation, the so-called score plot is often used. The
element v
ij
in V is the projection of ith sample on jth prin-
ciple component. Therefore each sample will have a
unique position in a space dened by score vectors, if
they are unique in the measured data X. Figure 5 illustrates
use of score plot to visually identify amorphous samples
fromcrystalline samples measured with NIRspectroscopy.
The original NIR spectra show some differences between
amorphous and crystalline samples, but they are subtle
and complex to human eyes. PCA and score plot present
the differences in a much simpler and straight forward
way. In the gure, the circular dots are samples used as
standards to establish the score space. New samples (tri-
angular and star dots) are projected into the space and
grouped according to their crystallinity. It is clear that
the new samples are different: some are crystalline, so
they fall into the crystalline circle. The amorphous
samples are left outside of the circle due to the abnormal-
ities that have shown up in NIR spectra. Based on the devi-
ations of the crystalline sample dots on each principle
component axis, it is possible to calculate statistical bound-
ary to automatically detect amorphous samples. This kind
of scheme is the bases of outlier detection by factor analy-
sis based calibration methods such as PCA and PLS.
PCA combined with a regression step forms PCR. PCR
has been widely used by chemists for its simplicity in
interpretation of the data through loading matrix and
score matrix. Equation (6) shows how regression is per-
formed with the principle components.
b yV
A
S
1
A
U
0
A
(8)
In this equation, y is the property vector of the calibration
standard samples and b is the regression coefcient vector
used in predicting the measured data x of unknown
sample.
^ yy bx
0
(9)
where ^ yy is the predicted property. A very important aspect
of PCR is to determine the number of factors used in
Eq. (6). The optimum is to use as much information in X
as possible and keep noise out. That means one needs to
decide the last factor (principle component) that has
useful information and discard all factors after that one.
A common mistake in multivariate calibration is to use
too many factors to over t the data. The extra factors
can make a calibration curve look unrealistically good
(noise also gets tted) but unstable and inaccurate when
used in prediction. A rigorous validation step is necessary
to avoid these kinds of mistakes. When the calibration
sample set is sufciently large, the validation samples
can be randomly selected from the sample pool. If the cali-
bration samples are limited in number, a widely used
method is cross-validation within the calibration sample
set. In cross-validation, a sample (or several samples)
is taken out from the sample set and predicted by the
calibration built on the remaining samples. The prediction
errors corresponding to the number of factors used in cali-
bration are recorded. Then the sample is put back into
the sample set and another one is taken out in order to
repeat the same procedure. The process continues until
each sample has been left out once and predicted. The
average error is calculated as the function of the
number of principle components used. The formulas for
SEP (using a separate validation sample set) and for
SECV cross-validation are slightly different:
SEP
P
(y
i
^ yy
i
)
2
n
s
(10)
SECV
P
n
i1
y
i
^ yy
i

2
n A
s
(11)
where ^ yy
i
is the model predicted value and y
i
is reference
value for sample i, n is the total number of samples used
in calibration, and A is the number of principle com-
ponents used. When the errors are plotted against the
number of factors used in calibration, they typically look
ponents (factors) are added into the calibration one at a
time, SEP or SECV decreases, hits a minimum, and then
bounces back. The reason is that the rst several principle
components contain information about the samples and are
needed in improving the calibration models accuracy. The
later principle components, on the other hand, are predo-
minated by noise. Using them makes the model sensitive
to irrelevant variations in data, thus being less accurate Figure 5 PCA applied to identify sample crystallinity.
like the one illustrated in Fig. 6. As the principle com-
and potentially more vulnerable to process variations and
instrument drifts. There is clearly an optimal number of
principle components for each calibration model. One of
the major tasks in multivariate calibration is to nd the
optimum which will keep the calibration model simple
(small number of principle components) and achieve the
highest accuracy possible.
Partial Least Square
PLS is another multivariate calibration method which
uses principle components rather than original
X-variables. It differs from PCR by using the y-variables
actively during the decomposition of X. In PLS, principle
components are not calculated along the direction of
largest variation in X at each iteration step. They are
calculated by balancing the information from the X and
y matrices to best describe information in y. The rationale
behind PLS is that, in some cases, some variations in X,
although signicant, may not be related to y at all. Thus,
it makes sense to calculate principle components more
relevant to y, not just to X. Because of this, PLS may
yield simpler models than PCR.
Unlike PCR, the PLS decomposition of measurement
data X involves the property vector y. A loading weight
vector is calculated for each loading of X to ensure the
loadings are related to the property data y. Furthermore,
the property data y is not directly used in calibration.
Instead, its loadings are also calculated and used together
with the X loadings to obtain the transformation vector b.
b W PW
1
q (12)
where W is the loading weight matrix, P is the loading
matrix for X, and q is the loading vector for y. Martens
and Neas gave out detailed procedures of PLS in their
book.
In many cases, PCR and PLS yield similar results.
However, because the PLS factors are calculated utilizing
both X and y data, PLS can sometimes give useful results
from low precision X data where PCR may fail. Due to the
same reason, PLS has a stronger tendency to over t noisy
data y than that of PCR.
B. Pattern Recognition
Modern analytical chemistry is data rich. Instruments such
as mass spectroscopy, optical spectroscopy, NMR, and
many hyphenated instruments generate a lot of data for
each sample. However, data rich does not mean infor-
mation rich. How to convert the data into useful infor-
mation is the task of chemometrics. Here, pattern
recognition plays an especially important role in explor-
ing, interpreting, and understanding the complex nature
of multivariate relationships. Since the tool was rst
used on chemical data by Jurs et al. in 1969, many new
applications have been published, including several
books containing articles on this subject. Based on
Lavines review, here we will briey discuss four main sub-
divisions of pattern recognition methodology: (1) mapping
and display, (2) clustering, (3) discriminant development,
and (4) modeling.
1. Mapping and Display
When there are two to three types of samples, mapping and
display is an easy way to visually inspect the relationship
between the samples. For example, samples can be plotted
in a 2D or 3D coordinate system formed by variables
describing the samples. Each sample is represented by a
dot on the plot. The distribution and grouping of the
samples reveal the relationship between them. The fre-
quently encountered problem with modern analytical
data is that the number of variables needed to describe a
sample is often way too large for this simple approach.
An ordinal person cannot handle a coordinate system
with more than three dimensions. The data size from an
instrument, however, can be hundreds or even thousands
of variables. To utilize all information carried by so
many variables, factor analysis method can be used to
compress the dimensionality of the data set and eliminate
collinearity between the variables. In last section, we dis-
cussed the use of principle components in multivariate
data analysis (PCA). The plot generated by principle com-
ponents (factors) is exactly the same as the plots used in
mapping and display method. The orthogonal nature of
principle components allows convenient evaluation of
Figure 6 SECV plotted against number of principle com-
ponents (factors) used in calibration.
factors affecting samples based on their positions in the
principle component plot.
The distance between samples or from a sample to the
centroid of a group provides a quantitative measure of the
degree of similarity of the sample with others. The
most frequently used ones are the Euclidean distance and
the Mahalanobis distance. The Euclidean distance is
expressed as:
D
E

X
n
j1
x
Kj
x
Lj

2
v
u
u
t
(13)
where x
Kj
and x
Lj
are the jth coordinate of samples K and L,
respectively and n is the total number of coordinates. The
Mahalanobis distance is calculated by the following
equation:
D
2
M
x
L
xx
K

0
C
1
K
x
L
xx
K
(14)
where x
L
and xx
K
are, respectively, the data vector of
sample L and mean data vector for class K. C
1
K
is the
covariance matrix of class K. The Euclidean distance is
simply the geometric distance between samples. It does
not consider the collinearity between the variables that
forms the coordinator system. If variables x
1
and x
2
are
independent, the Euclidean distance is not affected by
the position of the sample in the coordinate system and
truly reects the similarity between the samples or
sample groups. When variables are correlated, this may
not be true. The Mahalanobis distance measurement
takes into account the problem by including a factor of cor-
relation (or covariance).
2. Clustering
Clustering methods are based on the principle that the dis-
tance between pairs of points (i.e., samples) in the
measurement space is inversely related to their degree of
similarity. There are several types of clustering algorithms
using distance measurement. The most popular one is
called hierarchical clustering. The rst step in this algor-
ithm is to calculate the distances of all pairs of points
(samples). Two points having the smallest distance will
be paired and replaced by a new point located in the
midway of the two original points. Then the distance cal-
culation starts again with the new data set. Another new
point is generated between the two data points having
the minimal distance and replaces the original data
points. This process continues until all data points have
been linked.
3. Classication
Both mapping and displaying clustering belong to unsu-
pervised pattern recognition techniques. No information
about the samples other than the measured data is used
in analyses. In chemistry, there are cases where a classi-
cation rule has to be developed to predict unknown
samples. Development of such a rule needs training
datasets whose class memberships are known. This is
called supervised pattern recognition because the knowl-
edge of class membership of the training sets is used in
development of discriminant functions. The most
popular methods used in solving chemistry problems
include the linear learning machine and the adaptive
least square (ALS) algorithm. For two classes separated
in a symmetric manner, a linear line (or surface) can be
found to divide the two classes (Fig. 7). Such a discrimi-
nant function can be expressed as:
D wx
0
(15)
where w is called weight vector and w fw
1
, w
2
, . . . ,
w
n1
g and x fx
1
, x
2
, . . . , x
n1
g is pattern vector whose
elements can be the measurement variables or principle
component scores. Establishing the discriminant function
is to determine the weight vector w with the restraint
that it provides the best classication (most correct Ds)
for two classes. The method is usually iterative in which
error correction or negative feedback is used to adjust w
until it allows the best separation for the classes. The
samples in the training set are checked one at a time by
the discriminant function. If classication is correct, w is
kept unchanged and the program moves to the next
sample. If classication is incorrect, w is altered so that
correct classication is obtained. The altered w is then
used in the subsequent steps till the program goes
through all samples in the training set. The altered w is
dened as:
w
000
w
2s
i
x
i
x
0
i

x
i
(16)
Figure 7 Example of a linear discriminate function separating
two classes.
where w
000
is the altered weight factor, s
i
is the discriminant
for the misclassied sample i, and x
i
is the pattern vector
of sample i. In situations where separation cannot be
best achieved by a simple linear function, ALS can be
used. In ALS, w is obtained using least squares:
w (X
0
X)
1
X
0
f (17)
where f is called forcing factor containing forcing factors
f
i
for each sample i. When classication of sample i is
correct, f
i
s
i
, where s
i
is the discriminant score for
sample i. If classication is incorrect, f is modied accord-
ing to the following equation:
f
i
s
i

0:1
(a d
i
)
2
b(a d
i
) (18)
where a and b are constants that are empirically deter-
mined and is the distance between the pattern vector and
the classication surface (i.e., the discriminant score).
With the corrected forcing factor, an improved weight
vector w is calculated using Eq. (16) and used in next
round of discriminant score calculation. The procedure
continues until favorable classication results are obtained
or preselected number of feedback has been achieved.
The nonparametric linear discriminant functions dis-
cussed earlier have limitations when dealing with classes
separated in asymmetric manners. One could imagine a
situation where a group of samples is surrounded by
samples that do not belong to that class. There is no way
for a linear classication algorithm to nd a linear discri-
minant function that can separate the class from these
samples. Apparently there is a need for algorithms that
have enough exibility in dealing with this type of
situations.
4. SIMCA
SIMCA stands for soft independent modeling of class
analogy. It was developed by Wold and coworkers for
dealing with asymmetric separation problems. It is based
on PCA. In SIMCA, PCA is performed separately on
each class in the dataset. Then each class is approximated
by its own principle components:
X
i

XX
i
T
iA
P
iA
E
i
(19)
where X
i
(N P) is the data matrix of class i,

XX
i
is the
mean matrix of X
i
with each row being the mean of X
i
.
T
iA
and P
iA
are the score matrix and loading matrix,
respectively, using A principle components. E
i
is the
residual matrix between the original data and the approxi-
mation by the principle component model.
The residual variance for the class is dened by:
S
2
0

X
N
i1
X
P
j1
e
ij
2
(P A)(N A 1)
(20)
where e
ij
is the element of residual matrix E
i
and S
0
is the
residual variance which is a measurement for the tightness
of the class. A smaller S
0
indicates a more tightly distrib-
uted class.
In classication of unknown samples, the sample data is
projected on the principle component space of each class
with the score vector calculated as:
t
ik
x
i
P
1
k
(21)
where t
ik
is the score vector of sample i in the principle
component space of class k. P
1
k
is the loading matrix
of class k. With score vector t
ik
and loading matrix P
k
,
the residual vector of sample i tting into class K can
be calculated similarly to Eq. (19). Then the residual
variance of t for sample i is:
S
2
i

X
P
j1
e
ij
2
P A
(22)
The residual variance of t is compared with the residual var-
iance of each class. If S
i
is signicantly larger than S
0
, sample
i does not belong to that class. If S
i
is signicantly larger than
S
0
, sample i is considered a member of that class. F-test is
employed to determine if S
i
is signicantly larger than S
0
.
The number of principle components (factors) used for
each class to calculate S
i
and S
0
is determined through
cross-validation. Similar to what we have discussed in
multivariate calibration, cross-validation in SIMCA is to
take out one (or several) sample a time from a class and
use the remaining samples to calculate the residual
variance of t with different number of principle com-
ponents. After all samples have been taken out once, the
overall residual variance of t as the function of
the number principle components used are calculated.
The optimal number of principle components is the one
that gives the smallest classication error.
SIMCA is a powerful method for classication of
complex multivariate data. It does not require a mathemat-
ical function to dene a separation line or surface. Each
sample is compared with a class within the class sub-
principle component space. Therefore, it is very exible
in dealing with asymmetrically separated data and
classes with different degree of complexity. Conceptually,
it is also easy to understand, if one has basic knowledge in
PCA. These are main reasons why SIMCA becomes very
popular among chemists.
VI. DATA ORGANIZATION AND
STORAGE
The laboratory and the individual scientists can easily be
overwhelmed with the sheer volumes of data produced
today. Very rarely can an analytical problem be answered
with a single sample let alone a single analysis. Compound
this by the number of problems or experiments that a
scientist must address and the amount of time spent
organizing and summarizing the data can eclipse the
time spent acquiring it. Scientic data also tends to be
spread out among several different storage systems. The
scientists conclusions based on a series of experiments
are often documented in formal reports. Instrument data
is typically contained on printouts or in electronic les.
The results of individual experiments tend to be documen-
ted in laboratory notebooks or on ofcial forms designed
for that purpose.
It is important that all of the data relevant to an exper-
iment be captured: the sample preparation, standard prep-
aration, instrument parameters, as well as the signicance
of the sample itself. This meta data must be cross-
referenced to the raw data and the nal results so that
they can be reproduced if necessary. It is often written in
the notebook or in many cases it is captured by the analyti-
cal instrument and stored in the data le and printed on the
report where it cannot be easily searched. Without this
information, the actual data collected by an instrument
can be useless, as this information may be crucial in its
interpretation.
Scientists have taken advantage of various personal
productivity tools such as electronic spreadsheets, per-
sonal databases, and le storage schemes to organize and
store their data. While such tools may be adequate for a
single scientist such as a graduate student working on a
single project, they fail for laboratories performing large
numbers of tests. It is also very difcult to use such
highly congurable, nonaudited software in regulated
environments. In such cases, a highly organized system
of storing data that requires compliance to the established
procedures by all of the scientic staff is required to ensure
an efcient operation.
A. Automated Data Storage
Ideally, all of the scientic data les of a laboratory would
be cataloged (indexed) and stored in a central data reposi-
tory. There are several commercial data management
systems that are designed to do just this. Ideally these
systems will automatically catalog the les using indexing
data available in the data les themselves and then upload
the les without manual intervention from the scientist. In
reality, this is more difcult than it would rst appear. The
scientist must enter the indexing data into the scientic
application and the scientic application must support its
entry. Another potential problem is the proprietary
nature of most instrument vendors data les. Even when
the instrument vendors are willing to share their data
formats, the sheer numbers of different instrument le
formats make this a daunting task. Still, with some stan-
dardization, these systems can greatly decrease the time
scientists spend on mundane ling-type activities and
provide a reliable archive for the laboratorys data.
These systems also have the added benet of providing
the le security and audit trail functionality required in
regulated laboratories on an enterprise-wide scale instead
of a system-by-system basis.
However, storing the data les in a database solves only
part of the archiving problem. Despite the existence of a
few industry standard le formats, most vendors use a pro-
prietary le format as already discussed. If the data les
are saved in their native le format, they are only useful
for as long as the originating application is available or
if a suitable viewer is developed. Rendering the data
les in a neutral le format such as XML mitigates the
obsolescence problem but once again requires that the
le format be known. It will also generally preclude reana-
lyzing the data after the conversion.
B. Laboratory Information Management
Systems
Analytical laboratories, especially quality control, clinical
testing labs, and central research labs, produce large
amounts of data that need to be accessed by several
different groups such as the customers, submitters, ana-
lysts, managers, and quality assurance personnel. Paper
les involve a necessarily manual process for searching
results, requiring both personnel and signicant amounts
of time. Electronic databases are the obvious solution
for storing the data so that it can be quickly retrieved as
needed. As long as sufcient order is imposed on the
storage of the data, large amounts of data can be retrieved
and summarized almost instantaneously by all interested
parties.
A database by itself, however, does not address the
workow issues that arise between the involved parties.
Laboratories under regulatory oversight such as pharma-
ceutical quality control, clinical, environmental control,
pathology, and forensic labs must follow strict procedures
with regard to sample custody and testing reviews.
Laboratory information management systems (LIMS)
were developed to enforce the laboratorys workow
rules as well as store the analytical results for convenient
retrieval. Everything from sample logging, workload
assignments, data entry, quality assurance review, man-
agerial approval, report generation, and invoice
processing can be carefully controlled and tracked. The
scope of a LIMS system can vary greatly from a simple
database to store nal results and print reports to a com-
prehensive data management system that includes raw
data les, notebook-type entries, and standard operating
procedures. The degree to which this can be done will
be dependent upon the ability and willingness of all con-
cerned parties to standardize their procedures. The LIMS
functions are often also event and time driven. If a sample
fails to meet specications, it can be automatically
programed to e-mail the supervisor or log additional
samples. It can also be programed to automatically log
required water monitoring samples every morning and
print the corresponding labels.
It was mentioned earlier that paper-based ling
systems were undesirable because of the relatively large
effort required to search for and obtain data. The LIMS
database addressed this issue. However, if the laboratory
manually enters data via keyboard into its LIMS database,
the laboratory can be paying a large up-front price in
placing the data in the database so that it can be easily
retrieved. Practically from the inception of the LIMS
systems, direct instrument interfaces were envisioned
whereby the LIMS would control the instrumentation
and the instrument would automatically upload its data.
Certainly this has been successfully implemented in
some cases but once again the proprietary nature of instru-
ment control codes and data le structures makes this a
monumental task for laboratories. Third party parsing
and interfacing software has been very useful in extracting
information from instrument data les and uploading the
data to the LIMS systems. Once properly programed
and validated, these systems can bring about very large
productivity gains in terms of the time saved entering
and reviewing the data as well as resolving issues
related to incorrect data entry. Progress will undoubtedly
continue to be made on this front since computers are
uniquely qualied to perform such tedious, repetitive
tasks, leaving the scientist to make conclusions based on
the summarized data.
BIBLIOGRAPHY
Bigelow, S. J. (2003). PC Hardware Desk Reference. Berkeley:
McGraw Hill.
Crecraft, D., Gergely, S. (2002). Analog Electronics: Circuits, Systems
and Signal Processing. Oxford: Butterworth-Heinemann.
Horowitz, P., Hill, W. (1989). The Art of Electronics. 2nd. ed.
New York: Cambridge University Press.
Isenhour, T. L., Jurs, P. C. (1971). Anal. Chem. 43: 20A.
Jurs, P. C., Kowalski, B. R., Isenhour, T. L. (1969). Anal. Chem.
41: 21.
Lai, E. (2004). Practical Digital Signal Processing for Engineers
and Technicians. Oxford: Newnes.
Lavine, B. K. (1992). Signal processing and data analysis. In:
Haswell, S. J., ed. Practical Guide to Chemometrics.
New York: Marcell Dekker, Inc.
Massart, D. L., Vandeginste, B. G. M., Deming, S. N.,
Michotte, Y., Kaufman, L. (1988). Chemometrics: A Text-
book. Amsterdam: Elsevier.
Materns, H., Naes, T. (1989). Multivariate Calibration. NewYork:
John Wiley and Sons Ltd.
Moriguchi, I., Komatsu K., Matsushita, Y. (1980). J. Med. Chem.
23: 20.
Mueller, S. (2003). Upgrading and Repairing PCs. 15th ed.
Indianapolis: Que.
Paszko, C., Turner, E. (2001). Laboratory Information Manage-
ment Systems. 2nd ed. New York: Marcel Dekker, Inc.
Van Swaay, M. (1997). The laboratory use of computers. In:
Ewing, G. W., ed. Analytical Instrumentation Handbook.
2nd ed. New York: Marcel Dekker, Inc.
Wold, S., Sjostrom, M. (1977). SIMCAa method for analyzing
chemical data in terms of similarity and analogy. In: Kowalski,
B. R., ed. Chemometrics: Theory and Practice. Society
Symp, Ser. No. 52. Washington D.C.: American Chemical
Society.

The Laboratory Use of Computers: Merck Research Laboratories, Merck & Co., Inc., Rahway, NJ, USA

Uploaded by

Copyright:

Available Formats

The Laboratory Use of Computers: Merck Research Laboratories, Merck & Co., Inc., Rahway, NJ, USA

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Laboratory Use of Computers: Merck Research Laboratories, Merck & Co., Inc., Rahway, NJ, USA

Uploaded by

Copyright:

Available Formats

1

The Laboratory Use of Computers

You might also like