Heat of Formation For DPE
Heat of Formation For DPE
Heat of Formation For DPE
2007, 8, 407-432
International Journal of
Molecular Sciences
ISSN 1422-0067
© 2007 by MDPI
www.mdpi.org/ijms/
Full Research Paper
Abstract: The standard enthalpy of formation of 1115 compounds from all chemical
groups, were predicted using genetic algorithm-based multivariate linear regression (GA-
MLR). The obtained multivariate linear five descriptors model by GA-MLR has correlation
coefficient ( R 2 = 0.9830 ). All molecular descriptors which have entered in this model are
calculated from chemical structure of any molecule. As a result, application of this model
for any compound is easy and accurate.
1. Introduction
Physical and thermodynamic properties data of compounds are needed in the design and operation
of industrial chemical processes. Of them, standard enthalpy of formation or standard heat of
formation, ∆H f is an important fundamental physical property of compounds which is defined as
o
change of enthalpy that accompanies the formation of 1 mole of compound in its standard state from its
constituent elements in their standard states (the most stable form of the element at 1 atm of pressure
and the specified temperature usually 298 K or 25 degrees Celsius). All elements in their standard
states (such as hydrogen gas, solid carbon in the form of graphite, etc.) have standard enthalpy of
formation of zero, as there is no change involved in their formation.
The standard enthalpy change of formation is used in thermo-chemistry to find the standard enthalpy
change of reaction. This is done by subtracting the summation of the standard enthalpies of formation
of the reactants from the summation of the standard enthalpies of formation of the products, as shown
in the equation below.
Int. J. Mol. Sci. 2007, 8 408
∆H reaction = ∑ ∆H f − ∑ ∆H f (1)
p r
where ∆H reaction , ∑ ∆H f , and ∑ ∆H f are standard enthalpy change of reaction, standard enthalpies of
p r
formation of the products, and standard enthalpies of formation of the reactants, respectively.
There are many methods for calculation of ∆H f in the literature, but of them, only three methods
o
are widely used. These three methods are the Benson method [1], Jobak and Reid method [2], and
Constantinou and Gani method [3]. All of these methods are classified in the field of group
contribution methods which in these methods, the property of a compound is estimated as a summation
of the contributions of simple chemical groups which can occur in the molecular structure. They
provide the important advantage of rapid estimates without requiring substantial computational
resources.
Application of quantitative structure-property relationship (QSPR) models in prediction and
estimation of physical properties of materials is widely developing [4-5]. In QSPR, advanced
mathematical methods (Genetic algorithm, neural networks, and etc.) are used to find a relation
between property of interest and the basic molecular properties which are obtained solely from the
chemical structure of compounds and called "molecular descriptors".
In this study, a new QSPR model for prediction of ∆H f of 1115 organic compounds is presented.
o
These 1115 compounds belong to all families of materials, as a result the obtained model can be
applied for prediction of ∆H f for any compound.
o
Many compilations for ∆H f have been published in the literature, but of them, we selected the
o
DIPPR 801 [6] compilation for our problem. This compilation has been recommended by AIChE
(American Institute of Chemical Engineers). From this compilation, 1115 compounds were selected
and ∆H f of them were extracted from this database.
o
In the calculation of molecular descriptors, the optimized chemical structures of compounds are
needed. The chemical structures of all 1115 compounds in our data set, were drawn in Hyperchem
software [7], and pre-optimized using MM+ mechanical fore field. A more precise optimization was
done with PM3 semi empirical method in Hyperchem.
In the next step for all 1115 compounds, molecular descriptors were calculated by Dragon software
[8]. Dragon can calculate 1664 molecular descriptors for any chemical structure. After calculating
molecular descriptors for all 1115 chemical structures, we must reject non informative descriptors from
output of Dragon. First the descriptors with standard deviation lower than 0.0001, have been rejected
because these descriptors were near constant. In second step, the descriptors with only one value
different from the remaining ones are rejected. In the third step, the pair correlation of each two
Int. J. Mol. Sci. 2007, 8 409
descriptors was checked and one of two descriptors with a correlation coefficient equal one (as a
threshold value) was excluded. For each pair of correlated descriptors, the one showing the highest pair
correlation with the other descriptors rejected from the pool of descriptors.
Finally, the pool of molecular descriptors was reduced by deleting descriptors which could not be
calculated for every structure in our data set.
As a result, from the calculated 1664 molecular descriptors, in the first step, only 1477 molecular
descriptors remained in the pool of molecular descriptors.
In this step, 20% of our database (223 compounds) is randomly removed and entered to test set as an
excluded data set. This test set was used in next steps, only for testing the prediction power of obtained
model and are not used for developing model. The remaining 80% (892 compounds) of our data set
was used for training set.
In this step our problem is to find the best multivariate linear model which has the most accuracy as
well as the minimum number of possible molecular descriptors. One of the best algorithms for these
types of problems has been proposed by Leardi et al. [9]. In order to perform this algorithm, a program
was written based on MATLAB (Mathworks Inc. software). This program finds the best multivariate
linear model by genetic algorithm based multivariate linear regression (GA-MLR) which has proposed
by Leardi et al. [9] and we have used it to our previous works, successfully [10-12]. The input of this
program is the molecular descriptors which have been obtained in previous section and the desired
number of parameter of multivariate linear model. The fitness function of our program was the cross
validated coefficient. For obtaining the best model, we must consider the effect of increase in the
number of molecular descriptors on the increase in the value of the cross validated coefficient. When
the cross validated coefficient was quite constant with increasing the number of molecular descriptors,
we must stop our search, and the best result has been obtained.
For obtaining the best multivariate linear model, first, we started with one molecular descriptor
model and found the best multivariate linear model, then the two molecular descriptors model were
tested, and the best multivariate linear two descriptors model was found. This work was repeated and
the number of descriptors was increased, till, we found that increase in the number of molecular
descriptors does not affect the accuracy of the best model. The best obtained model has six parameters
and is presented below:
where the molecular descriptors of Eq.(2) and their meaning are presented in Table 1.
The statistical parameters of fitting for Eq.(1) are the following: R 2 = 0.9830 , F = 10239 .02 ,
s = 58.541 , Q 2 = 0.9826 , where R 2 is the squared correlation coefficient, F is the Fisher factor, s is
the standard deviation, and Q 2 is the squared cross validated correlation coefficient. The statistical
parameters of coefficients of the Eq. (2) are presented in the Table 2.
Int. J. Mol. Sci. 2007, 8 410
Table 2. The values of the constants of Eq. (2) and their statistical interpretations.
There are many validation techniques for checking the validation of the obtained model [13].
Todeschini et al. [13] presented a quick rule for checking the validity of obtained model. This rule
compares the multivariate correlation index K X of X-block of the predictor variables with the
multivariate correlation index K XY obtained by the augmented X-block matrix by adding the column
of the response variable. This rule says that if K XY is greater than K X , the model is predictive [13].
Obtained values of these two indexes in our problem are K X = 31.62 and K XY = 40.81 , as a result,
with respect to this quick rule, obtained model is predictive ( K XY > K X ).
Cross-validation technique is the most common validation technique [13]. In this technique each
member of our data set is deleted, then, with the other members a model is produced, and the value of
the deleted object is predicted. This technique is performed for all members of the data set and finally,
a squared cross validated correlation is obtained. In our problem this work was done and the values of
squared cross validated correlation ( Q 2 ) was 0.9826. The difference between R 2 and Q 2 is promising
and thus validity of this model is confirmed by this technique.
Another validation technique is bootstrap technique [13]. By this technique, validation is performed
by randomly generating training sets with sample repetitions and then evaluating the predicted
responses of the samples not included in the training set. This work usually repeated thousands of
2
times. After 5000 times repetition of this technique, the parameter QBoot was 0.9823. As can be found,
Int. J. Mol. Sci. 2007, 8 411
2
the difference between the QBoot , Q 2 ,and R 2 is promising and thus the predictive power of model is
confirmed.
Ultimately, the last validation technique which we used was external validation. In this section by
means of test set which we had separated from the original data set, the prediction power of the Eq.(2)
2
was checked. The squared cross validated coefficient for the test ( Qext ) set is 0.9894, which the
promising difference between this value and the value of Q 2 shows the prediction power of the Eq. (2).
The calculated and DIPPR 801 values of ∆H f for training set are presented in the Table-3. Also,
o
the predicted and DIPPR 801 values of ∆H f for test set are presented in Table 4. The comparison
o
between the results of Eq.(2) and the DIPPR 801 values for training set and test set are shown in the
Figure 1.
1000
0
Training set
Prediction set
−1000
∆Hof calculated from Eq.(2) [KJ/mol]
−2000
−3000
−4000
−5000
−6000
−7000
−8000
−8000 −7000 −6000 −5000 −4000 −3000 −2000 −1000 0 1000
o
∆Hf from DIPPR 801 [KJ/mol]
Figure 1. Comparison between the results of Eq. (2) for training set and predicted values
for training set.
3. Discussion
In the formation of a molecule from its constituent elements, ∆H f , is the difference between the
o
enthalpy of this molecule and the elements which conform it. This enthalpy is a result of breaking
bonds of the elements in the free form (breaking reaction) and formation of new bonds in the molecule
of product (formation reaction). Breaking reaction is endothermic, but the formation reaction is
exothermic.
Any thing which can affect the bond properties and strength of the bonds in the molecule can affect
the value of ∆H f of that molecule. Of them, the number of atoms and number of the bonds and order
o
Int. J. Mol. Sci. 2007, 8 412
of the bonds and number of non-organic elements (heavy atoms) in a molecule directly affect on the
value of ∆H f .
o
Increase in the values of number of atoms in the H-depleted chemical structure of molecule
decreases ∆H f of a molecule. Increase in the order of bonds in a molecule increases ∆H f . Also the
o o
number of atoms which are commonly existed in all molecules such as oxygen and fluorine atoms, and
even heavy atoms affect ∆H f of a molecule. Increase in the number of these atoms in a molecule,
o
Table 3. The obtained results from Eq. (2) for training set.
4. Conclusions
In this present study, a simple five descriptors linear model was presented. This model was the
result of a QSPR study on the standard enthalpy of formation of 1115 compounds. These compounds
have been selected from all families of compounds as a result there are no specific limit in application
of this model. Also the simplicity of the use of it is one of the advantages of this model.
All molecular descriptors of this model can be easily calculated from the chemical structure of a
molecule.
Int. J. Mol. Sci. 2007, 8 427
Table 4. The predicted ∆H f by the Eq. (2) for test set as an excluded data set.
o
∆Hfo(kJ/mol)
ID Name Res
DIPPR 801 Calculated from Eq. (2)
Acknowledgment
The authors gratefully acknowledge Mr. Reza Barzin from University of California (San Diego) for
his helps, in this project.
References