Szakdolgozat Bence MSC
Szakdolgozat Bence MSC
Szakdolgozat Bence MSC
Dajka Bence
Consultant:
ii
Acknowledgement
I would like to thank Gábor Ribárik for helping me to prepare and write
the thesis.
iii
Contents
1 Theory of X-ray Diraction 1
1.1 Geometry of Crystalline Materials . . . . . . . . . . . . . . . . . . . . 1
1.2 Elements of X-ray Diraction . . . . . . . . . . . . . . . . . . . . . . 4
1.3 X-ray Line Broadening . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Size broadening . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Strain broadening . . . . . . . . . . . . . . . . . . . . . . . . . 11
iv
4.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.1 Structure of Neural Networks . . . . . . . . . . . . . . . . . . 29
4.4.2 Working Principles of Neural Networks . . . . . . . . . . . . . 30
4.4.3 Backpropagation and Stochastic Gradient Descent . . . . . . . 30
5 Results 31
5.1 Out-of-core learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Out of the box methods . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Ensamble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.5 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.6 Serial evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Conclusion 31
v
List of Figures
1.1 Prole function for large number of atoms . . . . . . . . . . . . . . . 8
1.2 Size function with xed σ . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Size function with xed m . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Dislocation congurations . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Strain prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 Al6Mg6 XRD peak pattern . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 CMWP t example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
List of Tables
2.1 Parameters for the theoretical XRD pattern generation. . . . . . . . . 18
vi
1 Theory of X-ray Diraction
This chapter lays the groundwork for understanding the fundamental principles
and concepts that underlie the research presented in this thesis. Its purpose is to
provide a comprehensive theoretical framework that enables a deeper understanding
of the phenomena, models, and methodologies employed throughout the study.
r′ = r + n1 a1 + n2 a2 + n3 a3 = r + R, (1.1)
where r and r′ represent identical positions. Here, a1 , a2 , and a3 are the primitive
lattice vectors, and the R vectors are the lattice vectors formed by combining the
primitive lattice vectors as R = 3i=1 ni ai , where ni are integers.
P
V0 = a1 · a2 · a3 . (1.2)
Within the elementary cell, the positions, types, and numbers of atoms (ions,
molecules) are determined. These can range from one to several thousand, depend-
ing on the complexity of the crystal. The combination of the elementary cells and
the atoms within them is referred to as the basis. The basis repeats periodically
in crystals, and the endpoints of the lattice vectors dene a lattice structure. By
knowing the lattice structure and the basis, we can dene the crystal structure.
The repeated spatial patterns present in the crystal structure are referred to
as the symmetries of the crystal. Symmetry elements, such as rotations and reec-
tions, are used to characterize these patterns. In two dimensions, there are ten such
1
elements, which give rise to plane point groups. By combining these elements and
considering the symmetries arising from sliding operations, a total of 17 plane groups
can be obtained. Expanding the discussion of symmetries to three dimensions, we
have 230 space groups derived from the 17 plane groups and 32 plane point groups
derived from the ten plane point groups. Additionally, there are 14 distinct lattice
structures obtained from the ve dierent plane lattices. These are known as Bra-
vais lattices. The space groups characterize the lattice structures, while the Bravais
lattices describe the elementary cells. To distinguish these symmetry groups, they
are denoted by letter or number symbols. For example, m originates from the word
mirror and denotes reection for symmetry elements, while p refers to primitive
lattices for lattice types.
The 14 distinct Bravais lattices can be classied into 7 crystal systems. These
systems include cubic, tetragonal, orthorhombic, rhombohedral, hexagonal, mono-
clinic, and triclinic Bravais lattices. The specic crystal system to which a crystal
belongs is determined by the lengths of the sides of its elementary cell and the angles
between them.
Materials with a crystalline structure exhibit long-range periodicity in their
atomic arrangement. To characterize the lattice structure, we introduce a concept
called the reciprocal lattice. Instead of using the lattice vectors a1 , a2 , and a3 , we
dene reciprocal lattice vectors b1 , b2 , and b3 with a unique correspondence such
that
ai · bj = 2πδij , (1.3)
where δij is the Kronecker delta. The reciprocal lattice vectors can be obtained
as follows:
a2 × a3
b1 = 2π , (1.4)
V0
a3 × a1
b2 = 2π , (1.5)
V0
a1 × a2
b3 = 2π , (1.6)
V0
2
V0 = a1 · (a2 × a3 ). (1.7)
The volume of the reciprocal lattice cell, denoted as VB , is given by
8π 3
VB = b1 · (b2 × b3 ) = . (1.8)
V0
The reciprocal lattice vectors can be expressed as linear combinations of the
reciprocal lattice basis vectors:
3
where Kj = 0, ±1, ±2, . . . (1.9)
X
K= Kj bj ,
j=1
The reciprocal lattice vectors span a reciprocal lattice in the reciprocal space,
and their scalar products with the corresponding lattice vectors are integer multiples
of 2π .
Lattice planes are crucial for characterizing the lattice structure and are often
referred to as lattice planes. Lattice planes that are equidistant and parallel to each
other form a lattice plane set. These plane sets are characterized by Miller indices,
denoted as (hkl). To determine the Miller indices, we consider three lattice vectors
ai (i = 1, 2, 3) that form a coordinate system intersecting the lattice planes. The
reciprocals of the distances from the lattice plane intersections with the axes, when
multiplied by the least common multiple of these distances, yield the Miller indices
(hkl).
The reciprocal lattice vectors and lattice plane sets are uniquely related. Speci-
cally, the vector (a1 /h − a2 /k) lies within the (hkl) lattice plane. The scalar product
of this vector with the reciprocal lattice vector Ghkl is zero, which also applies to
the vectors (a2 /k − a3 /l) and (a3 /l − a1 /h):
a 3
a2 a a2
1
(1.10)
X 1
K − hk = Kj bj − hk = 0.
h k j=1
h l
The distance dhkl between two neighboring (hkl) lattice planes can be obtained
by expressing the distance of the plane intersected by a1 /h, a2 /k, and a3 /l from the
origin. This can be calculated by nding the projection of the vector a1 /h onto the
normal to the (hkl) plane, denoted as Khkl :
3
a1 Khkl a1 hb1 + kb2 + lb3 2π
dhkl = = = . (1.11)
h |Khkl | h |Khkl | |Khkl |
Thus, the reciprocal lattice vectors and lattice plane sets uniquely determine
each other in reciprocal space. They provide essential information about the lattice
structure in solid-state physics.
The phase shift during diraction is consistent among dierent scattering cen-
ters, ensuring coherent scattering.
(k0 − k) · r = −G · r, (1.12)
where G is the reciprocal lattice vector. The amplitude of the radiation in the
direction of k is given by:
4
where A0 depends on the scattering strength and the intensity of the incident
beam. When considering multiple scattering centers, the resulting amplitude can be
expressed as the sum of the individual amplitudes:
!
(1.14)
X X
A(K) = fp e−iK·rp e−iK·Rn ,
n p
where Rn is the vector from the origin O to the nth lattice cell, and rp is the
position vector of the pth atom in the primitive cell. The term within the parentheses
is known as the structure factor, which describes the scattering of a cell. It can be
written as:
(1.15)
X
F (K) = fp e−iK·rp ,
p
where fp represents the scattering contribution of the pth atom in the primitive
cell, depending on the atom type and the directions of the incident and scattered
beams. Substituting this into the equation for A(K), we have:
(1.16)
X
A(K) = F (K) e−iK·Rn .
n
The intensity of the radiation can be obtained by squaring the absolute value of
the amplitude:
k − k0 = Khkl , (1.18)
and Ghkl = hb1 + kb2 + lb3 , which is the scattering vector corresponding to the
(hkl) Miller indices. During X-ray diraction measurements, the detector rotates
5
around the sample to investigate the scattering of rays, allowing determination of
the reciprocal lattice vectors Ghkl , which characterize the lattice structure.
The equation k − k0 = Khkl can be expressed in another form. The dierence
between the wave vectors k and k0 , calculated from a common starting point, is
equal to the reciprocal lattice vector Khkl . The magnitudes of the two wave vectors
are 2π/λ, while the magnitude of the reciprocal lattice vector is 2π/dhkl . In the
right-angled triangle formed by these three vectors, the following relationship holds:
I0
FWHM(I(s)) = s2 − s1 , where s1 < s2 and I(s1 ) = I(s2 ) = . (1.22)
2
6
The integral breadth (equivalent to the area under the normalized intensity
curve): R∞
I(s) ds
β= −∞
. (1.23)
I0
sin2 (N x)
I(s) ∼ , (1.24)
sin2 (x)
where x = πGa, G = g + ∆g , g is the diraction vector, ∆g is a small vector, and
a is the unit cell vector perpendicular to the plane of the crystallite. The function
sin2 (N x)
sin2 (x)
represents the shape and position of the peaks in this particular case. Figure
1.1 illustrates this function for dierent values of N . It reaches a maximum value at
positions x = nπ , where n ∈ Z. This condition is equivalent to the Laue equations.
The maximum value of this function is given by limx→0 sinsin2(N(x)x) = N 2 .
2
For large values of N , the prole function can be approximated by the following
simple expression:
2
sin2 (N x)
sin(N x)
2 = N2 . (1.25)
sin (x) Nx
The theoretical description requires the Fourier transform of the intensity prole
function sin x(N as well:
2 x)
2
π(N − π|L|) if π < |L| ≤ N
π
0 if |L| > Nπ
For an innite plane crystallite with a thickness of N atoms, the Fourier trans-
form of the size function, I(s), is given by:
7
Figure 1.1: The sin2 (N x)
sin2 (x)
function, close to it's rst maximum for dierent N values.
N
LG,a
(N − N
LG,a
|L|) if |L| ≤ LG,a
0 if |L| > LG,a
where LG,a can be determined from the initial slope of the Fourier transform.
The size parameter L0 is generally dened for an arbitrary I(s) intensity prole as
the initial slope of the AS (L) Fourier transform:
AS (0) d
− =− (1.26)
L0 dL L=0
8
tion, making it widely employed in microstructural investigations. In this distribu-
tion, the logarithm of the crystallite size follows a normal distribution. The density
function of the lognormal size distribution can be expressed as:
(log(x/m))2
1
f (x) = √ exp − , (1.27)
2πσx 2σ 2
where m and σ are the distribution parameters. The quantity log m represents
the median, and σ corresponds to the variance of the normal distribution. These
parameters, m and σ , are referred to as the median and variance of the lognormal
size distribution, respectively.
The volume sum of the columns with a height between M and M + dM from all
crystallites can be expressed as:
(1.30)
X
g(M )dM = dVj (M, dM )
j
9
Thus, the size prole can be obtained by determining g(M )dM , which depends
on the shape and size distribution of the crystallites.
For spherical crystallites and the lognormal size distribution, g(M )dM can be
approximated by the volume of the part of a sphere with a column length between
M and M + dM :
Using the distribution density function given in equation 1.27, this integral can
be expressed as:
Z ∞
1 log(M/m)
f (x)dx = erfc √ (1.35)
M 2 2σ
where erfc(x) is the complementary error function dened as:
Z ∞
2
(1.36)
2
erfc(x) = √ e−t dt
π x
10
1.3.
Figure 1.2: The size function for spherical crystallites with lognormal distribution
with xed value of σ = 0.71, as a function of s.
Here, g represents the absolute value of the diraction vector, and ⟨ε2g,L ⟩ denotes
11
Figure 1.3: The size function for spherical crystallites with lognormal distribution
with xed value of m = 2.72 nm, as a function of s.
the mean square strain, which depends on the atom displacements from their ideal
positions. The spatial averaging is denoted by the brackets.
Wilkens introduced the eective outer cuto radius of dislocations, denoted as
Re∗ , as a length parameter instead of the crystal diameter. This modication elimi-
nates the logarithmic singularity in the expression for the mean square strain. The
crystal is divided into separate regions with a diameter of Re∗ , where randomly dis-
tributed screw dislocations exist. Within each region, the distribution of dislocations
is completely random with a density of ρ, and there is no interaction between dislo-
cations outside these regions.
To characterize the dislocation arrangement, we can introduce the dimensionless
parameter M ∗ :
R∗
M ∗ = √e . (1.41)
ρ
The value of M ∗ indicates the strength of correlation between dislocations: a
small M ∗ value implies a strong correlation, while a large M ∗ value indicates a
random distribution of dislocations within the crystallite. Figure 1.4 shows two dis-
location congurations: in the rst the dislocations are strongly correlated and the
12
value of M ∗ is small and in the second the correlation is weak and M ∗ is large.
Figure 1.5 presents the strain prole for xed ρ and variable M ∗ values.
13
Figure 1.5: The shape of the strain prole for xed ρ and variable M ∗ values.
C = Ch00 (1 − qH 2 ), (1.42)
where
h2 k 2 + h2 l2 + k 2 l2
2
H = . (1.43)
(h2 + k 2 + l2 )2
For hexagonal crystals, the average contrast factor is given by:
a1 H 2
C = Chk0 1 + , (1.44)
1 + a2 H 2
where
These expressions provide a means to describe and understand the strain anisotropy
14
observed in diraction experiments for dierent crystal structures.
15
single wavelength beam is scattered and reaches the detector. In the next section
the CMWP method will be discussed, which was the method for generating the
theoretical Al6Mg6 XRD peak patterns.
2.1 CMWP
The Convolutional Multiple Whole Prole (CMWP) tting procedure, developed
as a computer program, enables the determination of microstructural parameters
from diraction proles of materials with cubic, hexagonal, or orthorhombic crystal
lattices. However, its principles can be applied to all crystal systems. The method
operates directly on the measured intensity pattern without requiring prole sepa-
ration. Additionally, there is no need to correct the measured data for instrumental
eects. Instead, the instrumental eect is incorporated into the theoretical pattern
through convolution, thereby avoiding numerical division of small numbers.
The tting process involves directly modeling the entire measured powder dirac-
tion pattern as the sum of a background function and prole functions obtained
through convolution. These prole functions are derived from ab-initio theoretical
functions representing size, strain, and planar faults, combined with the measured
instrumental proles. The convolutional equation for a specic reection (hkl) can
be expressed as:
(2.1)
X
I hkl
theoretical
= BG(2θ) + I hkl I(2θ − 2θhkl
MAX
0
)
Here, BG(2θ) represents the background, I hkl is the peak intensity, and 2θhkl
MAX
0
denotes the 2θ value at the peak center. The theoretical prole for the hkl reection,
I hkl , is obtained by convolving the measured instrumental prole (I ) with the instr.
ab-initio prole functions, including the size prole (I ), the theoretical strain size
prole for dislocations (I ), and the theoretical prole function for planar faults
disl.
(Ipl. faults
):
I hkl = I instr.
∗I size
∗I disl.
∗I pl. faults
(2.2)
The CMWP method performs the convolution in Fourier-space, utilizing the an-
alytical form of the Fourier transforms of the proles. This approach is advantageous
in terms of computational eciency compared to directly applying the denition of
16
convolution. By inverse Fourier-transforming the product of the theoretical Fourier
transforms and the complex Fourier transform of the corresponding measured in-
strumental proles, the theoretical intensity proles (I hkl ) are obtained. The tting
procedure can provide various microstructural parameters depending on the eects
included, such as the median and variance of the size distribution, ellipticity of crys-
tallites, density and arrangement of dislocations, strain anisotropy parameters, and
the probability of planar faults. An example of a prole t with the CMWP software
can be seen on gure 2.2
Figure 2.2: XRD pattern t with the CMWP software of an Al6Mg6 sample
For the thesis dataset, 1591200 theoretical Al6Mg6 XRD patterns were created.
The parameters used for the data generation are in table 2.1. These are arbitrary
numbers, derived from experience with similar material samples and can be changed
for other tasks.
2.2 Preprocessing
Machine learning methods require careful handling of the training data, as values
on dierent scales can result in inaccurate models and wrong conclusions. Therefore,
it is essential to preprocess the XRD peak intensities before training any model with
them. In addition, certain data exploration methods, such as dimension reduction,
17
Table 2.1: Parameters for the theoretical XRD pattern generation.
a b c d e
-3 5 0.05 0.1 0.01
-2.5 10 0.1 0.25 0.025
-2 15 0.15 0.5 0.05
-1.5 20 0.2 0.75 0.075
-1 25 0.25 1 0.1
-0.5 30 0.3 2.5 0.25
0 35 0.35 5 0.5
0.25 40 0.4 7.5 0.75
0.5 45 0.45 10 1
0.75 50 0.5 25 1.25
1 60 0.6 50 1.5
1.25 70 0.7 75 1.75
1.5 80 0.8 100 2
1.75 90 0.9 250
2 100 1 500
2.25 110 1.1 750
2.5 120 1.2 1000
2.75 130 1.3
140 1.4
150 1.5
may require the data to have zero mean and unit variance, which can also be achieved
through preprocessing. In this thesis, the two most important data handling concepts
are normalization and standardization.
2.2.1 Normalization
Normalization, also known as min-max scaling, is a data preprocessing technique
that scales the features of a dataset to a xed range, usually between 0 and 1. This
is achieved by subtracting the minimum value from each feature and then dividing
it by the range (i.e., the dierence between the maximum and minimum values).
Mathematically, the normalization of a feature x is given by:
x − min(x)
x norm = (2.3)
max(x) − min(x)
where xnorm represents the normalized value of x. The resulting normalized values
will lie within the range [0, 1].
18
Normalization is particularly useful when the absolute values of the features do
not matter, but their relative positions or proportions are important. It ensures that
all features have a similar scale, preventing any particular feature from dominating
the others during model training.
2.2.2 Standardization
Standardization, also known as z-score normalization or feature scaling, is a data
preprocessing technique that transforms the features of a dataset to have zero mean
and unit variance. This is achieved by subtracting the mean value from each feature
and then dividing it by the standard deviation. Mathematically, the standardization
of a feature x is given by:
x − mean(x)
x = (2.4)
std
std(x)
where x represents the standardized value of x. The resulting standardized
std
19
the concept of multi-output regression, which extends the regression framework to
handle multiple dependent variables.
Y = f (X) + ε (3.1)
where Y is the response variable, f (X) is the regression function, and ε is a
random error term representing the variability that is not explained by the predic-
tors. The regression function f (X) is typically estimated using various statistical
methods, such as ordinary least squares (OLS) or maximum likelihood estimation
(MLE), to minimize the discrepancy between the observed values of Y and the
predicted values from f (X).
Y = f (X) + ε (3.2)
where f (X) = (f1 (X), f2 (X), . . . , fp (X)) represents the vector-valued regression
function that maps the predictors to the response variables. Each component func-
20
tion fi (X) estimates the relationship between the predictors and the ith response
variable Yi . The error term ε captures the unexplained variability in the response
variables.
where Y represents the true values of the response variables, Ŷ represents the
predicted values, n is the number of observations, and p is the number of response
variables. The MSE measures the average squared dierence between the predicted
and true values of the response variables.
The MAE measures the average absolute dierence between the predicted and
true values of the response variables.
21
3.4 Evaluation Metrics
In addition to loss functions, evaluation metrics are used to assess the perfor-
mance of multi-output regression models. These metrics provide a comprehensive
understanding of the model's predictive capability. Common evaluation metrics for
multi-output regression include:
where R2 (Yi , Ŷi ) represents the R2 value for the ith response variable.
The RMSE quanties the average discrepancy between the predicted values and
the actual values. A lower RMSE indicates a better t of the model to the data,
with smaller residuals.
22
3.4.3 Relative Root Mean Squared Error (RRMSE)
The Relative Root Mean Squared Error (RRMSE) is a normalized version of the
RMSE and is often used to compare the performance of dierent models or evaluate
the accuracy of a model in relation to the scale of the target variable. RRMSE is
calculated by dividing the RMSE by the range of the target variable.
Let's denote the range of the target variable as R = max(yi ) − min(yi ). The
RRMSE is then given by:
RMSE
RRMSE = . (3.7)
R
By normalizing the RMSE with respect to the range of the target variable, the
RRMSE allows for a standardized comparison across dierent datasets and target
variable scales. A lower RRMSE indicates a better predictive accuracy of the model
relative to the range of the target variable.
Both RMSE and RRMSE are widely used in regression analysis to evaluate the
performance of models and compare dierent approaches. These metrics provide
quantitative measures of the model's predictive accuracy and can help researchers
and practitioners in selecting the most suitable model for their specic regression
tasks.
23
ple output nodes in the network's architecture, neural networks can learn the joint
representation of the predictors and generate predictions for multiple response vari-
ables.
Additionally, ensemble methods, such as stacking and bagging, can be employed
to combine the predictions from multiple models and enhance the overall predictive
performance in multi-output regression.
where Ŷ represents the predicted output and Ti (X) denotes the prediction of
24
the ith decision tree.
The Random Forest Regressor introduces randomness in two ways. First, during
the construction of each tree, a random subset of features is considered at each
split point. This random feature selection reduces the correlation among the trees
and improves the diversity of the ensemble. Second, the bootstrap aggregation (or
bagging) technique is used to randomly sample the training data with replacement.
This sampling strategy further enhances the model's robustness.
where Ŷ represents the predicted output and Ti (X) denotes the prediction of
the ith decision tree.
The main advantage of the Extremely Random Trees Regressor is its computa-
tional eciency. Since the splits are generated randomly without any optimization,
the training process is faster compared to other tree-based models. Additionally,
the ETR model tends to be less prone to overtting, especially when the dataset
contains noisy or irrelevant features.
Both the Random Forest Regressor and the Extremely Random Trees Regressor
are powerful tree-based machine learning models for multi-output regression tasks.
They leverage the ensemble approach to combine the predictions of multiple decision
trees, thereby improving the overall accuracy and robustness of the model. The
25
Random Forest Regressor selects the best split based on optimal thresholds, while
the Extremely Random Trees Regressor generates splits randomly. These models
oer a exible and eective solution for capturing complex relationships in XRD
peak intensities and can be a valuable tool in various scientic and engineering
applications.
culate the distances between X and all the points in the training dataset. The
query
where m is the number of features, X represents the query point, and Xi rep-
query
are selected based on the calculated distances. These K neighbors are the data points
in the training dataset that have the smallest distances to X . query
3. Weighted Averaging: Once the K nearest neighbors are identied, the predicted
output for the query point is calculated as the weighted average of the target values
of these neighbors. The weights are typically assigned based on the inverse of the
distances, giving more weight to the closer neighbors. The predicted output can be
expressed as:
K
1 X
Ŷ query = Yi (4.4)
K i=1
26
where Ŷquery represents the predicted output for the query point and Yi represents
the target value of the ith nearest neighbor.
One crucial aspect of KNN regression is the selection of the optimal value for
K, which determines the number of neighbors considered during prediction. A small
value of K may result in a more exible and locally adaptive model but can be
sensitive to noise. On the other hand, a large value of K may lead to a smoother but
less responsive model. The choice of K depends on the dataset and the underlying
problem, and it is often determined through cross-validation or other model selection
techniques.
KNN regression is a versatile algorithm that can capture complex relationships
in the data without assuming a specic functional form. However, it can be compu-
tationally expensive, especially for large datasets, as it requires calculating distances
for all training points during prediction.
4.3.1 Averaging
Averaging, also known as model averaging or democratic voting, is a simple
yet eective ensemble technique. In this approach, multiple individual models are
trained independently on the same dataset, and their predictions are averaged to
obtain the nal ensemble prediction. Averaging can be applied to both regression
and classication problems.
Mathematically, for a regression problem, the ensemble prediction ŷ is obtained
by averaging the predictions yi of the individual models:
N
1 X
ŷ = yi
N i=1
27
where N is the number of individual models.
For a classication problem, the ensemble prediction is typically determined by
majority voting. Each individual model predicts a class label, and the most common
class label among the models is selected as the ensemble prediction.
4.3.3 Boosting
Boosting is an ensemble technique that aims to improve the performance of
individual models by sequentially training them in a stage-wise manner. Unlike
bagging, boosting focuses on creating a strong model by iteratively learning from
the mistakes made by the previous models.
Boosting algorithms work by assigning higher weights to misclassied instances
in each iteration, forcing subsequent models to focus more on these dicult in-
stances. The nal ensemble prediction is obtained by aggregating the predictions of
all individual models, with each model's contribution weighted based on its perfor-
mance.
One popular boosting algorithm is AdaBoost (Adaptive Boosting), which assigns
weights to each instance in the training data and adjusts these weights based on the
misclassication rate of the previous model. AdaBoost places more emphasis on
misclassied instances in subsequent iterations, allowing the ensemble to focus on
challenging data points.
Gradient boosting is another widely used boosting algorithm that constructs an
28
ensemble of decision trees, with each tree trained to minimize the residual errors of
the previous trees. Gradient boosting incorporates gradient information and employs
optimization techniques to nd the best splitting points, resulting in a strong and
accurate model.
Both bagging and boosting techniques oer improvements over individual mod-
els by leveraging the diversity and collective wisdom of multiple models. Bagging
reduces variance and improves stability, while boosting focuses on reducing bias and
increasing predictive power.
Here, n is the number of neurons in the previous layer, wji denotes the weight
connecting the i-th neuron in the previous layer to the j -th neuron in the current
layer, ai−1 is the output of the i-th neuron in the previous layer, and bj is the bias
term for the j -th neuron.
29
4.4.2 Working Principles of Neural Networks
Neural networks employ a process called forward propagation to generate pre-
dictions or outputs. During forward propagation, the input data is passed through
the network, and each neuron computes its output based on the weighted sum of
its inputs and an activation function. The output of the last layer represents the
network's prediction.
Activation functions introduce non-linearity to the network, allowing it to learn
complex patterns and relationships in the data. Commonly used activation functions
include sigmoid, tanh, and Rectied Linear Unit (ReLU).
30
5 Results
5.1 Out-of-core learning
5.2 Dimensionality reduction
5.3 Out of the box methods
5.4 Ensamble methods
5.5 Neural networks
5.6 Serial evaluation
6 Conclusion
31