.
.
.
12 2007
An artificial neural network (ANN) is successfully presented for prediction acidity constant (pKa) of various
benzoic acids and phenols with diverse chemical structures using a nonlinear quantitative structure-property
relationship. A three-layered feed forward ANN with back-propagation of error was generated using six
molecular descriptors appearing in the multi-parameter linear regression (MLR) model. The polarizability term
(π I), most positive charge of acidic hydrogen atom (q+), molecular weight (MW), most negative charge of the
acidic oxygen atom (q−), the hydrogen-bond accepting ability (ε B) and partial charge weighted topological
electronic (PCWTE) descriptors are inputs and its output is pKa. It was found that properly selected and trained
neural network with 205 compounds could fairly represent dependence of the acidity constant on molecular
descriptors. For evaluation of the predictive power of the generated ANN, an optimized network was applied
for prediction pKa values of 37 compounds in the prediction set, which were not used in the optimization
procedure. Squared correlation coefficient (R2) and root mean square error (RMSE) of 0.9147 and 0.9388 for
prediction set by the MLR model should be compared with the values of 0.9939 and 0.2575 by the ANN model.
These improvements are due to the fact that acidity constant of benzoic acids and phenols in water shows
nonlinear correlations with the molecular descriptors.
Key Words : Quantitative structure-property relationship, Artificial neural networks, Acidity constant, Phe-
nols, Benzoic acids
perty/activity relationships (QSPR/QSAR) on theoretical programs designed to simulate the way in which the human
descriptors is a powerful tool not only for prediction of the brain processes information. ANNs gather their knowledge
chemical, physical and biological properties/activities of by detecting the patterns and relationships in data and
compounds, but also for deeper understanding of the de- learned (or trained) through experience, not from program-
tailed mechanisms of interactions in complex systems that ming. There are many types of neural networks designed by
predetermine these properties/activities. QSPR/QSAR
1-10
now and new ones are invented every week. The behavior
20
models are essentially calibration models in which the of a neural network is determined by transfer functions of its
independent variables are molecular descriptors that describe neurons, by learning rule, and by the architecture itself. An
the structure of the molecules and the dependent variable is ANN is formed from artificial neuron or processing
the property or activity of interest. Since these theoretical elements (PE), connected with coefficients (weights), which
descriptors are determined solely from computational constitute the neural structure and are organized in layers.
methods, a priori predictions of the properties/activities of The first layer is termed the input layer, and the last layer is
compounds are possible, no laboratory measurements are the output layer. The layers of neurons between the input and
needed thus saving time, space, materials, equipment and output layers are called hidden layers. Neural networks do
alleviating safety (toxicity) and disposal concerns. An enor- not need on explicit formulation of the mathematical or
mous number of descriptors have been used by researchers physical relationships of the handled problem. These give
to increase the ability to correlate biological, chemical and ANNs an advantage over traditional fitting methods for
physical properties. To obtain a significant correlation, it is some chemical application. For these reason in recent years,
crucial that appropriate descriptors be employed. 11,12
ANNs have been used to a wide variety of chemical
Various methods for constructing QSPR/QSAR models problems such as simulation of mass spectra, ion interaction
have been used including multi-parameter linear regression chromatography, aqueous solubility and partition coeffi-
(MLR), principal component analysis (PCA) and partial cient, simulation of nuclear magnetic resonance spectra,
least-squares regression (PLS). In some cases, it is more
13-16
prediction of bioconcentration factor, solvent effects on
convenient that a linear relationship between property/ reaction rate and prediction of normalized polarity parameter
2008 Bull. Korean Chem. Soc. 2005, Vol. 26, No. 12 Aziz Habibi-Yangjeh et al.
in mixed solvent systems. 21-36
molecular electronic descriptors were calculated by Dragon
It has been shown that the acid-base properties affect the package version 2.1. For this propose the output of the
48
toxicity, chromatographic retention behavior and pharma- HyperChem software for each compound feed into the
ceutical properties of organic acids and bases. On the
37,38
Dragon program and the descriptors were calculated. As a
other hand, interpretation and prediction of pKa values for result, a total of 18 theoretical descriptors were calculated
chemical compounds are of general importance and useful- for each compound in the data sets (242 compounds).
ness for chemists. Although in the last years several
39
Linear correlations. Acidity constant of benzoic acids
theoretical studies have been performed for correlation of and phenols are literature values at 25 ºC. MLR model was 49
pKa values with molecular parameters, but in these studies developed for prediction of pKa values by molecular
linear equations have been used. 38-46
descriptors. The method of stepwise multi-parameter linear
The main aim of present work is to develop a linear and regression was used to select the most important descriptors
nonlinear QSPR models based on molecular descriptors for and to calculate the coefficients relating the pKa to the
prediction pKa values of various benzoic acids and phenols descriptors. The MLR models were generated using spss/pc
with diverse chemical structures (including 242 com- software package release 9.0.
pounds). Neural network generation. The specification of a
typical neural network model requires the choice of the type
Theory of inputs, the number of hidden layers, the number of
neurons in each hidden layer and the connection structure
A detailed description of theory behind a neural network between the inputs and the output layers. The number of
has been adequately described by different researchers. 17-19
input nodes in the ANNs was equal to the number of
There are many types of neural network architectures, but molecular descriptors in the MLR model. A three-layer
the type that has been most useful for QSAR/QSPR studies network with a sigmoidal transfer function was designed.
is the multilayer feed - forward network with back-propa- The initial weights were randomly selected between 0 and 1.
gation (BP) learning rule. The number of neurons in the
20
Before training, the input and output values were normalized
input and output layers are defined by system's properties. between 0.1 and 0.9. The optimization of the weights and
The number of neurons in the hidden layer could be biases was carried out according to the resilient back-
considered as an adjustable parameter, which should be propagation algorithm. The data set was randomly divided
50
optimized. The input layer receives the experimental or into three groups: a training set, a validation set and a
theoretical information. The output layer produces the prediction set consisting of 168, 37 and 37 molecules,
calculated values of dependent variable. The use of ANNs respectively. The training and validation sets were used for
consists of two steps: “training” and “prediction”. In the the model generation and the prediction set was used for
training phase the optimum structure, weight coefficients evaluation of the generated model, because a prediction set
and biases are searched for. These parameters are found is a better estimator of the ANN generalization ability than a
from training and validation data sets. After the training validation (monitoring) set. 51
phase, the trained network can be used to predict (or The performances of training, validation and prediction of
calculate) the outputs from a set of inputs. ANNs allow one ANNs are evaluated by the mean percentage deviation
to estimate relationships between input variables and one or (MPD) and root-mean square error (RMSE), which are
several output dependent variables. The ANN reads the defined as follows:
input and target values in the training data set and changes
the values of the weighted links to reduce the difference N P exp – P cal )
MPD = ---1- ∑ (-----------------------------
i i - (1)
between the calculated output and target values. The error N i=1 exp
Pi
between output and target values is minimized across many
training cycles until network reaches specified level of cal 2
accuracy. If a network is left to train for too long, however, it P
N ( i – i )
exp
P
will overtrain and will lose the ability to generalize. 22-36
RMSE = ∑
i=1
--------------------------------
N (2)
Experimental Section where Pi and Pical are experimental and calculated values
exp
to calculate some of theoretical descriptors, the molecular The processing of the data was carried using Matlab 6.5. 52
geometries of molecules were further optimized with the The neural networks were implemented using Neural
same algorithm in MOPAC program version 6.0. The other Network Toolbox Ver. 4.0 for Matlab. 50
Prediction Acidity Constant Using QSPR Models Bull. Korean Chem. Soc. 2005, Vol. 26, No. 12 2009
Results and Discussion tors. Acidity constant of the compounds decrease with
53
A major challenge in the development of MLR equations phenolic oxygen atom increases with increasing these
is connected with the possible multicollinearity of molecular descriptors. Effects of π , q and MW on pKa are higher than
I
+
descriptors. In order to decrease the redundancy existed in that of the other descriptors, because standardized coeffi-
the descriptors data matrix, the correlation of descriptors cients of π , q and MW are higher than those of the other
I
+
with each other and with pKa of the compounds was descriptors.
examined and collinear descriptors were detected (r > 0.85). The calculated values of pKa for the compounds in
Among the collinear descriptors, one with the lowest training, validation and prediction sets using the MLR model
correlation with the property was removed from the data have been plotted versus the experimental values of it in
matrix. Table 1 demonstrates that all of the descriptors are Figure 1.
strongly orthogonal which reflects the statistical reliability The next step in this work was generation of the ANN
of the model. model. There are no rigorous theoretical principles for
Multi-parameter linear correlation of pKa values of 168 choosing the proper network topology; so different struc-
benzoic acids and phenols versus the molecular descriptors tures were tested in order to obtain the optimal hidden
in the training set gives the results in Table 2. It can be seen neurons and training cycles. Before training the network,
36
from this table that six descriptors are appeared in the MLR the number of nodes in the hidden layer was optimized. In
model. These descriptors are: polarizability index (π ), most I order to optimize the number of nodes in the hidden layer,
positive charge of acidic hydrogen atom (q ), molecular +
several training sessions were conducted with different
weight (MW), most negative charge of acidic oxygen atom numbers of hidden nodes. The root mean squared error of
(q−), the hydrogen-bond accepting ability (ε ) and partial B training (RMSET) and validation (RMSEV) sets were
charge weighted topological electronic (PCWTE) descrip- obtained at various iterations for different number of neu-
tors.
The negative coefficient for π , q , q− and MW descriptors
I
+
Table 2. Descriptors, symbols and results of the multi-parameter linear regression (MLR) modela
No. Descriptor Symbol Coefficient β
1 polarizability term πI −8.3610 0.080
2 most positive charge of acidic hydrogen atom q+ −110.4710 0.521
3 molecular weight MW −0.0051 0.074
4 most negative charge of the phenolic oxygen atom q_ −26.3940 0.321
5 the hydrogen-bond accepting ability εB 34.4450 0.080
6 partial charge weighted topological electronic PCWTE 0.0902 0.101
7 constant 42.2780
a
The β is standardized coefficient of descriptors. The polarizability term (π ) is obtained by dividing the polarizability volume by the molecular volume.
I
The ε is equal 0.3–0.01(E –E ), in which E and E are referring to the LUMO energy for water and HOMO energy for the compound, respectively.
B lw h lw h
2010 Bull. Korean Chem. Soc. 2005, Vol. 26, No. 12 Aziz Habibi-Yangjeh et al.
iterations are stopped when overtraining begins. To control
the overtraining of the network during the training proce-
dure, the values of RMSET and RMSEV were calculated
and recorded to monitor the extent of the learning in various
iterations. Results obtained showed that after 77000 iter-
ations the value of RMSEV started to increase very little and
overfitting slightly began (Figure 3).
The generated ANN was then trained using the training
and validation sets for the optimization of the weights and
biases. For the evaluation of the predictive power of the
generated ANN, an optimized network was applied for
prediction of the p a values of the compounds in the
K
number of nodes in hidden layer. from the ANN model and the experimental values is as
follows:
rons at the hidden layer and the minimum value of RMSEV p a(cal) = 0.99299 p a(exp) + 0.04454 (4)
was recorded as the optimum value. Plot of RMSET and K K
RMSEV versusthe number of nodes in the hidden layer has (R = 0.9931; MPD = 4.5044; RMSE = 0.2648; F = 34295.94)
2
been shown in Figure 2. It is clear that the twenty-four nodes Similarly, the correlation of p a (cal) p a (exp)
in hidden layer is optimum value. K versus
Table 3. Experimental and calculated values of pKa for various benzoic acids and phenols in water at 25 ºC for training, validation and
prediction sets by multi-parameter linear regression (MLR) and artificial neural network (ANN) models along with individual percent
deviation (IPD)a
No. Compound Exp. MLR IPDMLR ANN IPDANN
Training set
Table 3. Continued
Table 3. Continued
No. Compound Exp. MLR IPDMLR ANN IPDANN
214 2,5-dimethylphenol 10.22 10.115 −1.03 10.31 0.88
215 3,4-dinitrophenol 5.424 7.121 31.29 5.319 −1.94
216 4-ethylphenol 10.0 10.064 0.64 10.293 2.93
217 2-hydroxybenzaldehyde 8.34 9.833 17.90 8.155 −2.22
218 4-hydroxybenzonitrile 7.95 7.911 −0.49 8.166 2.72
219 4-hydroxy-3-methoxybenzaldehyde 7.396 7.974 7.82 7.896 6.76
220 3-iodophenol 8.879 8.099 −8.78 8.921 0.47
221 4-methoxyphenol 10.20 9.587 −6.01 10.282 0.80
222 4-methylsulfonylphenol 7.83 7.936 1.35 7.647 −2.34
223 4-nitrophenol 7.150 7.232 1.15 7.219 0.97
224 3-phenylphenol 9.63 8.485 −11.89 9.671 0.43
225 1,3,5-trihydroxybenzene 8.45 8.929 5.67 8.107 −4.06
226 3-acetoxybenzoic acid 4.00 3.47 −13.25 3.822 −4.45
227 anthracene-2-carboxylic acid 4.18 2.148 −48.61 4.186 0.14
228 1,2,3,5-benzenetetracarboxylic acid 2.38 1.625 −31.72 2.379 −0.04
229 2-benzoylbenzoic acid 3.54 3.223 −8.95 3.185 −10.03
230 2-bromo-6-nitrobenzoic acid 1.37 2.004 46.28 0.957 −30.15
231 2-chloro-3-nitrobenzoic acid 2.02 2.266 12.18 2.536 25.54
232 4-cyanobenzoic acid 3.55 2.619 −26.23 3.873 9.10
233 2,6-dihydroxybenzoic acid 1.30 2.864 120.31 1.084 −16.62
234 3,4-dimethylbenzoic acid 4.41 4.471 1.38 4.255 −3.51
235 3,4-dinitrobenzoic acid 2.82 2.251 −20.18 2.738 −2.91
236 2-hydroxybenzoic acid 2.98 4.091 37.28 3.313 11.17
237 2-hydroxy-4-methylbenzoic acid 3.17 4.308 35.90 3.128 −1.32
238 3-iodobenzoic acid 3.86 2.771 −28.21 3.529 −8.58
239 2-methylbenzoic acid 3.90 4.357 11.72 3.749 −3.87
240 2-methyl-6-nitrobenzoic acid 1.87 4.44 137.43 1.939 3.69
241 3-nitrobenzene-1,2-dicarboxylic acid 1.88 2.334 24.15 1.872 −0.43
242 4-phenoxybenzoic acid 4.52 3.194 −29.34 3.993 −11.66
a
Exp. refers to the experimental values of pKa, MLR and ANN refer to multi-parameter linear regression and artificial neural network calculated values
of pKa, respectively.
Figure 4 . Plot of the calculated values of pKa from the ANN model . Plot of the residual for calculated values of pKa from the
Figure 5
versus the experimental values of it for training, validation and ANN model versus the experimental values of it for prediction set.
prediction sets.
2016 Bull. Korean Chem. Soc. 2005, Vol. 26, No. 12 Aziz Habibi-Yangjeh et al.
Table 4. Comparsion of statistical parameters obtained by the MLR and ANN models for correlation acidity constant of phenols and
benzoic acids with the molecular descriptorsa
Model R2tot R2train R2valid R2pred RMSEtot RMSEtrain RMSEvalid RMSEpred
MLR 0.9266 0.9268 0.9400 0.9147 0.8610 0.8553 0.8034 0.9388
ANN 0.9931 0.9926 0.9943 0.9939 0.2648 0.2700 0.2479 0.2575
a
Subscript train is referring to the training set, valid is referring to the validation set and pred is referring to the prediction set, tot is refering to the total
23. Bunz, A. P.; Braun, B.; Janowsky, R. Fluid Phase Equilib. 1999 ,
values of RMSE for training, validation and prediction sets 158, 367.
(and other statistical parameters in Table 4) for the MLR and 24. Homer, J.; Generalis, S. C.; Robson, J. H. Phys. Chem. Chem.
ANN models show superiority of the nonlinear model over Phys. 1999 , 1, 4075.
the regression model. Root-mean square error of 0.9388 for 25. Goll, E. S.; Jurs, P. C. J. Chem. Inf. Comp. Sci. 1999 , 39, 974.
26. Vendrame, R.; Braga, R. S.; Takahata, Y.; Galvao, D. S. J. Chem.
the prediction set by the MLR model should be compared Inf. Comput. Sci. 1999 , 39, 1094.
with the value of 0.25751 for the ANN model. Since the 27. Gaspelin, M.; Tusar, L.; Smid-Korbar, J.; Zupan, J.; Kristl, J. Int.
improvement of the results obtained using nonlinear model J. Pharm. 2000 , 196, 37.
(ANN) is considerable, it can be concluded that the 28. Gini, G.; Cracium, M. V.; Konig, C.; Benfenati, E. J. Chem. Inf.
nonlinear characteristics of the molecular descriptors on the Comput. Sci. 2004 , 44, 1897.
29. Urata, S.; Takada, A.; Uchimaru, T.; Chandra, A. K.; Sekiya, A. J.
pKa values of the compounds in water is serious. Fluorine Chem. 2002 , 116, 163.
30. Koziol, J. Internet Electron J. Mol. Des. , 2, 315.
Acknowledgements. The Authors wish to acknowledge
2003
31. Wegner, J. K.; Zell, A. J. Chem. Inf. Comput. Sci. 2003 , 43, 1077.
the vice-presidency of research, university of Mohaghegh 32. Valkova, I.; Vracko, M.; Basak, S. C. Anal. Chim. Acta 2004 , 509,
Ardebili, for financial support of this work. 179.
33. Sebastiao, R. C. O.; Braga, J. P.; Yoshida, M. I. Thermochimica
Acta , 412, 107.
References 2004