This review covers the application of Genetic Algorithms (GAs) in Chemometrics. The first applications of GAs in
chemistry date back to the 1970s, and in the last decades, they have been more and more frequently used to solve
different kinds of problems, for example, when the objective functions do not possess properties such as continuity,
differentiability, and so on. These algorithms maintain and manipulate a family, or population, of solutions and
implement a “survival of the fittest” strategy in their search for better solutions. GAs are very useful in the optimization
and variable selection in modeling and calibration because of the strong effect of the relationship between presence/
absence of variables in a calibration model and the prediction ability of the model itself. This review is not a complete
summary of the applications of GAs to chemometric problems; its goal is rather to show the researchers the main
fields of application of GAs, together with providing a list of references on the subject.
operator for N times randomly selects a chromosome of the 2. APPLICATION OF GENETIC ALGORITHMS
population. The probability of a particular chromosome of being
selected is a function of its associated response so that the best
ones have a greater probability of being picked up than the worst Stochastic optimization techniques such as GAs are gaining
ones. Following this step, a new population is obtained in which increasing popularity in various fields of chemistry, and the
the best chromosomes are copied more often; this leads to a better number of papers describing successful applications continues
average response. In the cross-over step, the N chromosomes to grow at a quick rate [26–29]. These methods are especially
forming the new population are randomly paired to form N/2 pairs. beneficial when the search space is complex with many local
From each pair of “parents”, two new chromosomes (the minima (or maxima) so that conventional techniques fail to find
“offsprings”) will be created by randomly assigning to each of them the global minimum (or maximum) and a full search is not feasible.
the genes of one of the two parents. As a result, the cross-over Although it is generally accepted that stochastic methods are the
allows the exploration of new experimental conditions by best choice in complex search space, there is no guarantee that
mixing values of variables already tested, although in different they will find the global optimum [29].
combinations. Hibbert [30] used GAs to optimize the rate coefficients for the
Mutation: Although the cross-over operator is active at the gene hydrolysis of adenosine 5′-triphosphate by fitting a kinetic model
level (whole genes are involved), the mutation takes place at bit to concentration versus time data. The fastest convergence to a
level. To do this, for each bit of each chromosome, a random good optimum is achieved by a hybrid GA in which a steepest
number is drawn to decide whether it has to be affected by a descent, pseudo-Newton procedure is iterated with an incest-
mutation. If so, the bit will be flipped (it will become 0 if it was 1 preventing GA, each providing a starting point for the other. In a
and vice versa). This operator allows the “jump” to new regions study by Hartke [31], a GA is used to find the global minimum
of the experimental domain and avoids the risk of being stuck energy structure for Si4 on an empirical potential energy surface.
in some specific conditions (if a gene is the same in all the Given a suitable encoding of the cluster geometry, and an
chromosomes of the population, without mutations the value of exponential scaling of the potential energy values to obtain a
the corresponding variable will stay the same forever). fitness function, the GA can successfully optimize all degrees of
After the reproductions and the mutations, the new generation freedom. With the number of potential energy function evaluations
replaces the previous one and the algorithm continues from the as a measure, the GA is more economical than either a set of
evaluation of the response. Figure 1 shows a flowchart of a GA. traditional local minimizations or a molecular dynamics-simulated
In this paper, the authors will review the applications of GAs in annealing approach.
three different areas (optimization; quantitative structure-activity Other applications of GAs to optimization are reported in the
relationship (QSAR) and molecular modeling; multivariate papers by Weber et al. [32], Jiang et al. [33], Van Kampen
calibration); a list of miscellaneous examples in chemometrics et al. [34], Niesse and Mayne [35], Shaffer and Small [13], Lavine
will also be given. et al. [36], Hanger and Huttner [37], Smith and Gemperline [38],
Kabrede and Hentschke [39], and Chen et al. [40].
In 2005, Babic et al. [41] reported a method for optimization of
a thin layer chromatography separation on the basis the use of
GA, and in 2006, Yu et al. [42] reported an application of GA
Define the architecture of the GA (coding of the to optimize the buffer system of micellar electrokinetic
variables, number of chromosomes, probability capillary chromatography for separating the active components
of mutation, response, termination criteria, …) contained in Chinese medicine. Chedly et al. in 2009 [43]
used a GA for multiobjective optimization of molded foams
Generate initial population characteristics. The effects of injection process parameters on
the properties of molded foams are investigated. The input
Select-copy optimization parameters considered are injection temperature,
mold temperature, injection speed, plasticization back pressure,
Cross-over and screw rotation speed during the plasticization phase. The
output optimization parameters considered are density, shock
absorption, and acoustic absorption. Finally, models are used
to carry out multiobjective optimization of injected foam
characteristics in the presence of a few constraints on decision
Decode the chromosomes
variables. This optimization is carried out using a very robust
technique, Nondominated Sorting Genetic Algorithm II. Several
Evaluate the response of each chromosome two-objective functions involving sometimes the maximization
and other times the minimization of foam characteristics
have been studied to illustrate the procedures and explain and
Termination No interpret the results obtained.
criteria satisfied? Recently, several papers described applications of GAs in
optimization such as Madaeni et al. [44], Cano-Odena et al. [45],
Shi and Xue [46], and Vadood et al. [47]. Bhatti et al. in 2011 [48]
described response surface methodology and artificial neural
End network (ANN) approach for electrocoagulation of copper from
simulated wastewater. Multiobjective optimization for maximizing
Genetic algorithms in chemometrics
consumption was carried out using GAs over the ANN model. The developed a QSAR program combining a GA with MLR and cross-
optimization procedure resulted in the creation of nondominated validation.
optimal points that gave an insight regarding the optimal Some studies of GAs applied to QSAR/QSPR are reported in
operating conditions of the process. the papers by Hoffman et al. [56], Ros et al. [57], Hemmateennejad
Milani and Milani [49] presented a simple closed form equation et al. [58–60], Niculescu [61], Fatemi et al. [62], Kompani-Zareh [63],
for the prediction of cross-linking of ethylene propylene diene Guo et al. [64], Niazi et al. [65], and Wang et al. [66].
monomer rubber during accelerated sulfur vulcanization. To Ghasemi and Ahmadi [67] applied GAs for variable selection in a
estimate numerically the degree of cross-linking, kinetic model QSAR study of a series of pure nonionic surfactants containing
constants are evaluated through a simple data fitting, performed linear alkyl, cyclic alkyl, and alkeyphenyl ethoxylates. Modeling of
on experimental rheometer curves. The fitting procedure is a cloud point of these compounds as a function of the theoretically
new one and is achieved using an ad-hoc GA, provided that a derived descriptors was established by MLR and partial least
few points, strictly required to estimate model unknown constants squares (PLS) regression. The results indicate that GA is a very
with sufficient accuracy, are selected from the whole experimental effective variable selection approach for QSPR analysis. The
curve. To assess the results obtained with the model proposed, comparison of the two regression methods used showed that
a number of different compounds are analyzed, for which PLS has better prediction ability than MLR.
experimental or numerical data are available from the literature. Jalali-Heravi and Kyani [68] applied GA-KPLS (kernel PLS) as a
The important cases of moderate and strong reversions are novel nonlinear feature selection method in QSAR study. This
also considered, experiencing a convincing convergence of the technique combines GA as a powerful optimization method with
analytical model proposed. KPLS as a robust nonlinear statistical method for variable selection.
This feature selection method is combined with ANN to develop a
nonlinear QSAR model for predicting activities of a series of
substituted aromatic sulfonamides as carbonic anhydrase II
3. APPLICATION OF GENETIC ALGORITHMS inhibitors. Superiority of this method (GA-KPLS-ANN) over MLR
IN QUANTITATIVE STRUCTURE-ACTIVITY and GA-PLS-ANN (in which a linear feature selection method has
RELATIONSHIP/MOLECULAR MODELING been used) indicates that the GA-KPLS approach is a powerful
method for the variable selection in nonlinear systems.
Quantitative structure-activity relationship and quantitative Gharagheizi [69] reported using GA-based MLR for solubility
structure–property relationship (QSPR) studies are essentially parameter studies. Recently, several papers have been published
applied to chemometrics, pharmacodynamics, pharmacokinetics, by Riahi et al. [70], Ghavami et al. [71], Goodarzi et al. [72], Afiuni-
toxicity, and so on. A major step in constructing QSAR/QSPR Zadeh and Azimi [73], and Hao et al. [74].
models is finding one or more molecular descriptors. A wide
variety of descriptors have been reported to be used in QSAR
analysis. Whether by traditional methods or multivariate-based 4. APPLICATIONS OF GENETIC ALGORITHMS
techniques, the success of a modeling study depends also on IN MULTIVARIATE CALIBRATION
the selection of variables (molecular descriptors) and on the
representation of information. Variables should represent the Multivariate calibration is used to develop a quantitative
maximum information in activity variations, and collinearity relationship between the predictor variables in X and the
among them must be kept to a minimum. Among different response variable(s) in Y. Recently, multivariate calibration
variable selection strategies, GAs are an interesting, flexible, and underwent several enhancement/extensions [75,76] that have
widely used alternative [50,51]. found widespread use in analytical science. Nowadays, spectral
In 1998, Hou et al. applied a GA to the QSAR research of data are perhaps the most common type of data to which
pyrrolobenzothiazepinones and pyrrolobenzoxazepinones inhibi- chemometric techniques are applied. Owing to the development
tory activities with non-nucleoside HIV-1 reverse transcriptase [52]. of new instrumentation, data sets in which each object is
In 1999, Meusinger and Moros [53] determined the influence of described by several hundreds of variables can be easily
the molecular structure of organic compounds on their knocking obtained. Calibration methods, being based on latent variables,
behavior by using a nonbinary GA. Results obtained by GA allow taking into account the whole spectrum without having
were significantly better than those obtained by multiple linear to perform a previous feature selection. In the last decades, it
regression (MLR). The molecular structures of 240 potential gasoline has anyway been recognized that an efficient feature selection
components were described by 16 different structural groups. can be highly beneficial both to improve the predictive ability
Partial octane numbers were calculated for the structural groups to the model and to greatly reduce its complexity.
related to the substance classes paraffins, naphthenes, olefins, One of the greatest problems in multivariate analysis is to select
aromatics, and oxygenates. The sum of the calculated partial the combination of variables that produces the best result. This
octane numbers supplies the octane number of the compound. goal is attained through the elimination of those variables that
An MLR, a neural network, and a GA were used for the computations produce noise or that, although giving good information, are
of the connections between the structural groups and the knock strictly correlated with other already selected variables. Feature
ratings. Results obtained by GA were significantly better than those selection is very important both in studies of correlation and in
obtained by MLR. studies of classification and modeling.
In 1999, Hou et al. applied GAs to the structure-activity correlation Genetic algorithms have found widespread application in
study of a group of non-nucleoside HIV-1 inhibitors and some several fields involving multivariate calibration because one of
cinnamamides [54,55]. In these studies, it has been demonstrated the most important steps in a calibration is the selection of the
that GAs are very useful in data analysis and that they can be relevant variables. Leardi et al. [77] published one of the very first
applied as a very powerful technique in QSAR. The authors papers about the application of GAs to variable selection. Lucasius
A. Niazi and R. Leardi
and Kateman [78] showed that a GA generally performs better that Hervas et al. [120] coupled GAs and pruning computational
simulated annealing and stepwise regression; on the other hand, neural networks for the selection of the number of inputs
Horchner and Kalivas [79] demonstrated that simulated annealing required to correct temperature variations in kinetic-based
can give the same results. Wise et al. [80] also developed a GA for determinations. Giro et al. [121] developed a new methodology
feature selection. to design conducting polymers on the basis of the use of GAs
Broudiscou et al. [81] described a new technique based on GAs coupled to negative factor counting techniques. The authors
for constructing experimental designs; also, in 1996, Jouan- showed the results for a case study of polyanilines, one of
Rimbaud et al. [82] studied the random correlation in variable the most important families of conducting polymers. The
selection using GA in multivariate calibration. Several papers about methodology proved to be able of generating automatic
the application of GAs in multivariate calibration were published solutions for the problem of determining the optimum relative
before 2000 [83–90]. concentration for binary and ternary disordered polyaniline
In 2001, Liu and Wang [91,92] used a GA for the quantitative alloys exhibiting metallic properties.
analysis of overlapped spectra in Fourier transform infrared Maeder et al. [122] reported the application of GAs to the task
spectroscopy (FTIR) data, and Yoshida et al. [93] used a GA for of determining initial parameter estimates that lie near the
feature selection in mass data. Leardi et al. [94] used a GA for global optimum. In iterative nonlinear least squares fitting, the
variable selection for multivariate calibration for predicting reliable estimation of initial parameters that lead to convergence
concentrations in polymer films in FTIR data, and several to the global optimum can be difficult. Irrespective of the
researchers [95–115] published papers in which they used GAs algorithm used, poor parameter estimates can lead to abortive
for variable selection in different fields such as spectroscopy, divergence or in rare cases convergence to a local optimum.
electrochemistry, and chromatography. For the determination of the parameters of complex reaction
Goicoechea and Olivieri [95] presented a new method for mechanisms, where often little is known about what value these
wavelength interval selection with a GA to improve the predictive parameters should take, the task of determining good initial
ability of PLS calibration. It involves separately labeling each of the estimates can be time consuming and unreliable. In this
selected sensor ranges with an appropriate inclusion ranking. The contribution, the methodology of applying a GA to the task of
new approach intends to alleviate overfitting without the need of determining initial parameter estimates that lie near the global
preparing an independent monitoring sample set. A theoretical optimum is explained. A generalized GA was implemented
example is worked out to compare the performance of the new according to the methodology, and the results of its application
approach with previous implementations of GAs. Two experimen- are also given. The parameter estimates obtained were then
tal data sets are also studied: target parameters are the concentra- used as the starting parameters for a gradient search method,
tion of glucuronic acid in complex mixtures studied by Fourier which quickly converged to the global optimum. The GA was
transform mid-infrared spectroscopy and the octane number in successfully applied to both simulated kinetic measurements
gasolines monitored by near-infrared spectroscopy. Ghasemi where the reaction mechanism contained one equilibrium
et al. [98] proposed GAs for selecting wavelengths for PLS constant and two rate constants to be fitted and to kinetic
calibration using spectrophotometric method. The method is measurements of the complexation.
based on the development of the reaction between the analytes Fatemi et al. [123] used GAs in kinetic modeling and
and Zincon reagent. A series of synthetic solutions containing reaction mechanism studies. This study is focused on the
different concentrations of copper and zinc were used to check development of a systematic computational approach that
the prediction ability of the GA-PLS models. implements GA to find the optimal rigorous kinetic models.
Majidi et al. [104] used GAs for potential selection in differential This model consists of eight continuous parameters (e.g.,
pulse voltammetry method in simultaneous determination of Arrhenius and Van’t Hoff parameters) and six discrete
cysteine, tyrosine, and tryptophan on the unmodified glassy parameters representing the order of the reaction with respect
carbon electrode. The main difficulty in the analysis of these to each concentration. The optimal values of these parameters
analytes in the same samples is the high degree of overlapping have been obtained on the basis of GA. Furthermore, the best
of the voltammograms. The relationships between the currents type of Genetic operators and their corresponding parameters
and the concentrations are complex and highly nonlinear. The for this type of problems have been obtained on the basis of a
predictive ability of principal component regression (PCR), PLS, comprehensive study of the effect of these parameters on the
GA-PLS, and principal component-artificial neural network (PC- efficiency of the GA.
ANN) were examined for simultaneous determination of three Gianoli et al. [124] reported the application of GAs in kinetic
amino acids. For a regression model, everything that does not help modeling, and also, Sadi and Dabir [125] applied GAs for the
in constructing the model may be considered as noise. PC-ANN determination of kinetic parameters of free radical polymeriza-
and GA-PLS use significant data and show superiority over other tion of vinyl acetate by multiobjective optimization technique.
applied multivariate methods. Harris [126] studied applications of GAs for obtaining structure
solution from powder X-ray diffraction data, and Guruprasad
and Behera [127] applied GAs to textile.
Genetic algorithms were employed in curve fitting [116]. In 1995,
Benedetti and Morosetti [117] reported the application of a GA Acknowledgement
to search for optimal and suboptimal RNA secondary structures.
In 1996, Dods et al. [118] used a GA approach for fitting polyatomic Financial support from the Italian Ministry of University and
spectra. Kariuki et al. [119] described the development of GAs for Research (PRIN 2008, CUP:D31J0000020001) is gratefully
Genetic algorithms in chemometrics
