Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
40 views

Notes - Machine Learning

Uploaded by

yajak70324
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Notes - Machine Learning

Uploaded by

yajak70324
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 138

CONTENTS

Acknowledgments I

1 Introduction To Augmented Reality 2

1.1 History Of Machine Learning 5

1.2 Machine Learning Categories 8

1.3 Kdd- Knowledge Discovery In Databases 15

1.4 Semma Model 18

2 Machine Learning Perspective Of Data 20

2.1 Scales Of Measurements 21

2.2 Dealing With Missing Data In Machine Learning 23

2.3 Handling Categorical Data 30

2.4 Normalizing Data 35

2.5 Feature Construcion Or Generation In Machine Learning 39

2.6 Correlation And Causation 41

2.7 Ml Polynomial Regression 43

2.8 Logistic Regression 50

2.9 Roc Curve 60

3 Introduction To Machine Learning Algorithms 64

3.1 Decision Tree Classification Algorithm 65

3.2 Support Vector Machine Algorithm 69

3.3 What Is K-Nearest Neighbors Algorithm? 73

3.4 Time Series Forecasting In Machine Learning 78

3.5 Clustering In Machine Learning 81

3.6 Principal Component Analysis 86

4 Model Dignosts And Tuning In Machine Learning 90


BOOK TITLE

4.1 Bias And Variance 92

4.2 K-Fold Cross-Validation 96

4.3 Bagging Machine Learning 98

4.4 Random Forest Algorithm 104

4.5 Gradient Boosting In Machine Learning 106

4.6 Stacking 108

5 Artificial Neural Network Tutorial 112

5.1 Perceptron- Single Artificial Neuron 117

5.2 Multi-Layer Perceptron (Feed Forward Neural Network) 123

5.3 Restricted Boltzmann Machine 129

Progress And Profile Of Learner 138

Basis For Further Pedagogic Decisions 139

Reporting A Consolidated Learner Profile 142

vi
BOOK TITLE

vii
ACKNOWLEDGMENTS

Insert acknowledgments text here. Insert acknowledgments text here. Insert


acknowledgments text here. Insert acknowledgments text here. Insert acknowledgments text
here. Insert acknowledgments text here. Insert acknowledgments text here. Insert
acknowledgments text here. Insert acknowledgments text here. Insert acknowledgments text
h

i
BOOK TITLE

1 INTRODUCTION TO AUGMENTED REALITY

What is Machine Learning?


In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which work
on our instructions. But can a machine also learn from experiences or past data like a human
does? So here comes the role of Machine Learning.

Introduction to Machine Learning


A subset of artificial intelligence known as machine learning focuses primarily on the
creation of algorithms that enable a computer to independently learn from data and previous
experiences. Arthur Samuel first used the term "machine learning" in 1959. It could be
summarized as follows:
Without being explicitly programmed, machine learning enables a machine to automatically
learn from data, improve performance from experiences, and predict things.
Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample historical
data, or training data. For the purpose of developing predictive models, machine learning
brings together statistics and computer science. Algorithms that learn from historical data are
either constructed or utilized in machine learning. The performance will rise in proportion to
the quantity of information we provide.
A machine can learn if it can gain more data to improve its performance.

How does Machine Learning work


A machine learning system builds prediction models, learns from previous data, and predicts
the output of new data whenever it receives it. The amount of data helps to build a better
model that accurately predicts the output, which in turn affects the accuracy of the predicted
output.
Let's say we have a complex problem in which we need to make predictions. Instead of
writing code, we just need to feed the data to generic algorithms, which build the logic based
on the data and predict the output. Our perspective on the issue has changed as a result of

2
BOOK TITLE

machine learning. The Machine Learning algorithm's operation is depicted in the following
block diagram:

Features of Machine Learning:


• Machine learning uses data to detect various patterns in a given dataset.
• It can learn from past data and improve automatically.
• It is a data-driven technology.
• Machine learning is much similar to data mining as it also deals with the huge
amount of the data.

Need for Machine Learning


The demand for machine learning is steadily rising. Because it is able to perform tasks that
are too complex for a person to directly implement, machine learning is required. Humans
are constrained by our inability to manually access vast amounts of data; as a result, we
require computer systems, which is where machine learning comes in to simplify our lives.
By providing them with a large amount of data and allowing them to automatically explore
the data, build models, and predict the required output, we can train machine learning
algorithms. The cost function can be used to determine the amount of data and the machine
learning algorithm's performance. We can save both time and money by using machine
learning.
The significance of AI can be handily perceived by its utilization's cases, Presently, AI is
utilized in self-driving vehicles, digital misrepresentation identification, face
acknowledgment, and companion idea by Facebook, and so on. Different top organizations,
for example, Netflix and Amazon have constructed AI models that are utilizing an immense
measure of information to examine the client interest and suggest item likewise.

Following are some key points which show the importance of Machine Learning:
• Rapid increment in the production of data
• Solving complex problems, which are difficult for a human
• Decision making in various sector including finance
• Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning


At a broad level, machine learning can be classified into three types:
1. Supervised learning

3
BOOK TITLE

2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning
In supervised learning, sample labeled data are provided to the machine learning system for
training, and the system then predicts the output based on the training data.
The system uses labeled data to build a model that understands the datasets and learns about
each one. After the training and processing are done, we test the model with sample data to
see if it can accurately predict the output.
The mapping of the input data to the output data is the objective of supervised learning. The
managed learning depends on oversight, and it is equivalent to when an understudy learns
things in the management of the educator. Spam filtering is an example of supervised
learning.
Supervised learning can be grouped further in two categories of algorithms:
• Classification
• Regression

2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:
• Clustering
• Association

3) Reinforcement Learning

4
BOOK TITLE

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a


reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the
most reward points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

1.1 HISTORY OF MACHINE LEARNING


Before some years (about 40-50 years), machine learning was science fiction, but today it is
the part of our daily life. Machine learning is making our day to day life easy from self-
driving cars to Amazon virtual assistant "Alexa". However, the idea behind machine
learning is so old and has a long history. Below some milestones are given which have
occurred in the history of machine learning:

The early history of Machine Learning (Pre-1940):


• 1834: In 1834, Charles Babbage, the father of the computer, conceived a device
that could be programmed with punch cards. However, the machine was never
built, but all modern computers rely on its logical structure.
• 1936: In 1936, Alan Turing gave a theory that how a machine can determine and
execute a set of instructions.

The era of stored program computers:


• 1940: In 1940, the first manually operated computer, "ENIAC" was invented,
which was the first electronic general-purpose computer. After that stored
program computer such as EDSAC in 1949 and EDVAC in 1951 were
invented.

5
BOOK TITLE

• 1943: In 1943, a human neural network was modeled with an electrical circuit.
In 1950, the scientists started applying their idea to work and analyzed how
human neurons might work.

Computer machinery and intelligence:


• 1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery
and Intelligence," on the topic of artificial intelligence. In his paper, he
asked, "Can machines think?"

Machine intelligence in Games:


• 1952: Arthur Samuel, who was the pioneer of machine learning, created a
program that helped an IBM computer to play a checkers game. It performed
better more it played.
• 1959: In 1959, the term "Machine Learning" was first coined by Arthur
Samuel.

The first "AI" winter:


• The duration of 1974 to 1980 was the tough time for AI and ML researchers,
and this duration was called as AI winter.
• In this duration, failure of machine translation occurred, and people had
reduced their interest from AI, which led to reduced funding by the government
to the researches.

Machine Learning from theory to reality


• 1959: In 1959, the first neural network was applied to a real-world problem to
remove echoes over phone lines using an adaptive filter.
• 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural
network NETtalk, which was able to teach itself how to correctly pronounce
20,000 words in one week.
• 1997: The IBM's Deep blue intelligent computer won the chess game against
the chess expert Garry Kasparov, and it became the first computer which had
beaten a human chess expert.

Machine Learning at 21st century


2006:
• Geoffrey Hinton and his group presented the idea of profound getting the hang
of utilizing profound conviction organizations.
• The Elastic Compute Cloud (EC2) was launched by Amazon to provide
scalable computing resources that made it easier to create and implement
machine learning models.
2007:

6
BOOK TITLE

• Participants were tasked with increasing the accuracy of Netflix's


recommendation algorithm when the Netflix Prize competition began.
• Support learning made critical progress when a group of specialists utilized it to
prepare a PC to play backgammon at a top-notch level.
2008:
• Google delivered the Google Forecast Programming interface, a cloud-based
help that permitted designers to integrate AI into their applications.
• Confined Boltzmann Machines (RBMs), a kind of generative brain organization,
acquired consideration for their capacity to demonstrate complex information
conveyances.
2009:
• Profound learning gained ground as analysts showed its viability in different
errands, including discourse acknowledgment and picture grouping.
• The expression "Large Information" acquired ubiquity, featuring the difficulties
and open doors related with taking care of huge datasets.
2010:
• The ImageNet Huge Scope Visual Acknowledgment Challenge (ILSVRC) was
presented, driving progressions in PC vision, and prompting the advancement
of profound convolutional brain organizations (CNNs).
2011:
• On Jeopardy! IBM's Watson defeated human champions., demonstrating the
potential of question-answering systems and natural language processing.
2012:
• AlexNet, a profound CNN created by Alex Krizhevsky, won the ILSVRC,
fundamentally further developing picture order precision and laying out
profound advancing as a predominant methodology in PC vision.
• Google's Cerebrum project, drove by Andrew Ng and Jeff Dignitary, utilized
profound figuring out how to prepare a brain organization to perceive felines
from unlabeled YouTube recordings.
2013:
• Ian Goodfellow introduced generative adversarial networks (GANs), which
made it possible to create realistic synthetic data.
• Google later acquired the startup DeepMind Technologies, which focused on
deep learning and artificial intelligence.
2014:
• Facebook presented the DeepFace framework, which accomplished close
human precision in facial acknowledgment.
• AlphaGo, a program created by DeepMind at Google, defeated a world
champion Go player and demonstrated the potential of reinforcement learning
in challenging games.
2015:

7
BOOK TITLE

• Microsoft delivered the Mental Toolbox (previously known as CNTK), an


open-source profound learning library.
• The performance of sequence-to-sequence models in tasks like machine
translation was enhanced by the introduction of the idea of attention
mechanisms.
2016:
• The goal of explainable AI, which focuses on making machine learning models
easier to understand, received some attention.
• Google's DeepMind created AlphaGo Zero, which accomplished godlike Go
abilities to play without human information, utilizing just support learning.
2017:
• Move learning acquired noticeable quality, permitting pretrained models to be
utilized for different errands with restricted information.
• Better synthesis and generation of complex data were made possible by the
introduction of generative models like variational autoencoders (VAEs) and
Wasserstein GANs.
• These are only a portion of the eminent headways and achievements in AI
during the predefined period. The field kept on advancing quickly past 2017,
with new leap forwards, strategies, and applications arising.

Machine Learning at present:


The field of machine learning has made significant strides in recent years, and its applications
are numerous, including self-driving cars, Amazon Alexa, Catboats, and the recommender
system. It incorporates clustering, classification, decision tree, SVM algorithms, and
reinforcement learning, as well as unsupervised and supervised learning.
Present day AI models can be utilized for making different expectations, including climate
expectation, sickness forecast, financial exchange examination, and so on.

1.2 MACHINE LEARNING CATEGORIES


Machine learning is a subset of AI, which enables the machine to automatically learn
from data, improve performance from past experiences, and make predictions.
Machine learning contains a set of algorithms that work on a huge amount of data. Data is
fed to these algorithms to train them, and on the basis of training, they build the model &
perform a specific task.
These ML algorithms help to solve different business problems like Regression,
Classification, Forecasting, Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

8
BOOK TITLE

In this topic, we will provide a detailed description of the types of Machine Learning along
with their respective algorithms:
1. Supervised Machine Learning
As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and based
on the training, the machine predicts the output. Here, the labelled data specifies that some
of the inputs are already mapped to the output. More preciously, we can say; first, we train
the machine with the input and corresponding output, and then we ask the machine to
predict the output using the test dataset.
Let's understand supervised learning with an example. Suppose we have an input dataset of
cats and dog images. So, first, we will provide the training to the machine to understand the
images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour,
height (dogs are taller, cats are smaller), etc. After completion of training, we input the
picture of a cat and ask the machine to identify the object and predict the output. Now, the
machine is well trained, so it will check all the features of the object, such as height, shape,
colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is
the process of how the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x)
with the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.

Categories of Supervised Machine Learning


Supervised machine learning can be classified into two types of problems, which are given
below:
• Classification
• Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.

9
BOOK TITLE

Some popular classification algorithms are given below:


• Random Forest Algorithm
• Decision Tree Algorithm
• Logistic Regression Algorithm
• Support Vector Machine Algorithm
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
Some popular Regression algorithms are given below:
• Simple Linear Regression Algorithm
• Multivariate Regression Algorithm
• Decision Tree Algorithm
• Lasso Regression

Advantages and Disadvantages of Supervised Learning


Advantages:
• Since supervised learning work with the labelled dataset so we can have an exact
idea about the classes of objects.
• These algorithms are helpful in predicting the output on the basis of prior
experience.
Disadvantages:
• These algorithms are not able to solve complex tasks.
• It may predict the wrong output if the test data is different from the training
data.
• It requires lots of computational time to train the algorithm.

Applications of Supervised Learning


Some common applications of Supervised Learning are given below:
• Image Segmentation:
o Supervised Learning algorithms are used in image segmentation. In this
process, image classification is performed on different image data with pre-
defined labels.
• Medical Diagnosis:
o Supervised algorithms are also used in the medical field for diagnosis
purposes. It is done by using medical images and past labelled data with
labels for disease conditions. With such a process, the machine can identify
a disease for the new patients.
• Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data
to identify the patterns that can lead to possible fraud.

10
BOOK TITLE

• Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent to
the spam folder.
• Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.

2. Unsupervised Machine Learning


Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without
any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown to
the model, and the task of the machine is to find the patterns and categories of the objects.
So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.

Categories of Unsupervised Machine Learning


Unsupervised Learning can be further classified into two types, which are given below:
• Clustering
• Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It
is a way to group the objects into a cluster such that the objects with the most similarities
remain in one group and have fewer or no similarities with the objects of other groups. An
example of the clustering algorithm is grouping the customers by their purchasing behaviour.
Some of the popular clustering algorithms are given below:
• K-Means Clustering algorithm
• Mean-shift algorithm
• DBSCAN Algorithm
• Principal Component Analysis
• Independent Component Analysis
2) Association
Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset. The main aim of this learning algorithm is to
find the dependency of one data item on another data item and map those variables

11
BOOK TITLE

accordingly so that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, Web usage mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.

Advantages and Disadvantages of Unsupervised Learning Algorithm


Advantages:
• These algorithms can be used for complicated tasks compared to the supervised
ones because these algorithms work on the unlabeled dataset.
• Unsupervised algorithms are preferable for various tasks as getting the
unlabeled dataset is easier as compared to the labelled dataset.
Disadvantages:
• The output of an unsupervised algorithm can be less accurate as the dataset is
not labelled, and algorithms are not trained with the exact output in prior.
• Working with Unsupervised learning is more difficult as it works with the
unlabelled dataset that does not map with the output.

Applications of Unsupervised Learning


• Network Analysis: Unsupervised learning is used for identifying plagiarism
and copyright in document network analysis of text data for scholarly articles.
• Recommendation Systems: Recommendation systems widely use
unsupervised learning techniques for building recommendation applications for
different web applications and e-commerce websites.
• Anomaly Detection: Anomaly detection is a popular application of
unsupervised learning, which can identify unusual data points within the dataset.
It is used to discover fraudulent transactions.
• Singular Value Decomposition: Singular Value Decomposition or SVD is
used to extract particular information from the database. For example,
extracting information of each user located at a particular location.

3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no
labelled training data) algorithms and uses the combination of labelled and unlabeled datasets
during the training period.
Although Semi-supervised learning is the middle ground between supervised and
unsupervised learning and operates on the data that consists of a few labels, it mostly
consists of unlabeled data. As labels are costly, but for corporate purposes, they may have
few labels. It is completely different from supervised and unsupervised learning as they are
based on the presence & absence of labels.

12
BOOK TITLE

To overcome the drawbacks of supervised learning and unsupervised learning


algorithms, the concept of Semi-supervised learning is introduced. The main aim
of semi-supervised learning is to effectively use all the available data, rather than only
labelled data like in supervised learning. Initially, similar data is clustered along with an
unsupervised learning algorithm, and further, it helps to label the unlabeled data into labelled
data. It is because labelled data is a comparatively more expensive acquisition than unlabeled
data.
We can imagine these algorithms with an example. Supervised learning is where a student is
under the supervision of an instructor at home and college. Further, if that student is self-
analysing the same concept without any help from the instructor, it comes under
unsupervised learning. Under semi-supervised learning, the student has to revise himself
after analyzing the same concept under the guidance of an instructor at college.

Advantages and disadvantages of Semi-supervised Learning


Advantages:
• It is simple and easy to understand the algorithm.
• It is highly efficient.
• It is used to solve drawbacks of Supervised and Unsupervised Learning
algorithms.
Disadvantages:
• Iterations results may not be stable.
• We cannot apply these algorithms to network-level data.
• Accuracy is low.

4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent (A
software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance. Agent gets
rewarded for each good action and get punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning is
to play a game, where the Game is the environment, moves of an agent at each step define
states, and the goal of the agent is to get a high score. Agent receives feedback in terms of
punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision
Process(MDP). In MDP, the agent constantly interacts with the environment and performs
actions; at each action, the environment responds and generates a new state.

13
BOOK TITLE

Categories of Reinforcement Learning


Reinforcement learning is categorized mainly into two types of methods/algorithms:
• Positive Reinforcement Learning: Positive reinforcement learning specifies
increasing the tendency that the required behaviour would occur again by
adding something. It enhances the strength of the behaviour of the agent and
positively impacts it.
• Negative Reinforcement Learning: Negative reinforcement learning works
exactly opposite to the positive RL. It increases the tendency that the specific
behaviour would occur again by avoiding the negative condition.

Real-world Use cases of Reinforcement Learning


• Video Games:
o RL algorithms are much popular in gaming applications. It is used to gain
super-human performance. Some popular games that use RL algorithms
are AlphaGO and AlphaGO Zero.
• Resource Management:
o The "Resource Management with Deep Reinforcement Learning" paper
showed that how to use RL in computer to automatically learn and schedule
resources to wait for different jobs in order to minimize average job
slowdown.
• Robotics:
o RL is widely being used in Robotics applications. Robots are used in the
industrial and manufacturing area, and these robots are made more
powerful with reinforcement learning. There are different industries that
have their vision of building intelligent robots using AI and Machine
learning technology.
• Text Mining
o Text-mining, one of the great applications of NLP, is now being
implemented with the help of Reinforcement Learning by Salesforce
company.

Advantages and Disadvantages of Reinforcement Learning


Advantages
• It helps in solving complex real-world problems which are difficult to be solved
by general techniques.
• The learning model of RL is similar to the learning of human beings; hence
most accurate results can be found.
• Helps in achieving long term results.
Disadvantage
• RL algorithms are not preferred for simple problems.
• RL algorithms require huge data and computations.

14
BOOK TITLE

• Too much reinforcement learning can lead to an overload of states which can
weaken the results.

1.3 KDD- KNOWLEDGE DISCOVERY IN DATABASES


The term KDD stands for Knowledge Discovery in Databases. It refers to the broad
procedure of discovering knowledge in data and emphasizes the high-level applications of
specific Data Mining techniques. It is a field of interest to researchers in various fields,
including artificial intelligence, machine learning, pattern recognition, databases, statistics,
knowledge acquisition for expert systems, and data visualization.
The main objective of the KDD process is to extract information from data in the context
of large databases. It does this by using Data Mining algorithms to identify what is deemed
knowledge.
The Knowledge Discovery in Databases is considered as a programmed, exploratory analysis
and modeling of vast data repositories.KDD is the organized procedure of recognizing valid,
useful, and understandable patterns from huge and complex data sets. Data Mining is the
root of the KDD procedure, including the inferring of algorithms that investigate the data,
develop the model, and find previously unknown patterns. The model is used for extracting
the knowledge from the data, analyze the data, and predict the data.
The availability and abundance of data today make knowledge discovery and Data Mining a
matter of impressive significance and need. In the recent development of the field, it isn't
surprising that a wide variety of techniques is presently accessible to specialists and experts.

The KDD Process


The knowledge discovery process(illustrates in the given figure) is iterative and interactive,
comprises of nine steps. The process is iterative at each stage, implying that moving back to
the previous actions might be required. The process has many imaginative aspects in the
sense that one cant presents one formula or make a complete scientific categorization for the
correct decisions for each step and application type. Thus, it is needed to understand the
process and the different requirements and possibilities in each stage.
The process begins with determining the KDD objectives and ends with the implementation
of the discovered knowledge. At that point, the loop is closed, and the Active Data Mining
starts. Subsequently, changes would need to be made in the application domain. For
example, offering various features to cell phone users in order to reduce churn. This closes
the loop, and the impacts are then measured on the new data repositories, and the KDD
process again. Following is a concise description of the nine-step KDD process, Beginning
with a managerial step:

15
BOOK TITLE

1. Building up an understanding of the application domain


This is the initial preliminary step. It develops the scene for understanding what should be
done with the various decisions like transformation, algorithms, representation, etc. The
individuals who are in charge of a KDD venture need to understand and characterize the
objectives of the end-user and the environment in which the knowledge discovery process
will occur ( involves relevant prior knowledge).
2. Choosing and creating a data set on which discovery will be performed
Once defined the objectives, the data that will be utilized for the knowledge discovery
process should be determined. This incorporates discovering what data is accessible,
obtaining important data, and afterward integrating all the data for knowledge discovery
onto one set involves the qualities that will be considered for the process. This process is
important because of Data Mining learns and discovers from the accessible data. This is the
evidence base for building the models. If some significant attributes are missing, at that
point, then the entire study may be unsuccessful from this respect, the more attributes are
considered. On the other hand, to organize, collect, and operate advanced data repositories
is expensive, and there is an arrangement with the opportunity for best understanding the
phenomena. This arrangement refers to an aspect where the interactive and iterative aspect
of the KDD is taking place. This begins with the best available data sets and later expands
and observes the impact in terms of knowledge discovery and modeling.
3. Preprocessing and cleansing
In this step, data reliability is improved. It incorporates data clearing, for example, Handling
the missing quantities and removal of noise or outliers. It might include complex statistical
techniques or use a Data Mining algorithm in this context. For example, when one suspects
that a specific attribute of lacking reliability or has many missing data, at this point, this
attribute could turn into the objective of the Data Mining supervised algorithm. A prediction
model for these attributes will be created, and after that, missing data can be predicted. The
expansion to which one pays attention to this level relies upon numerous factors. Regardless,
studying the aspects is significant and regularly revealing by itself, to enterprise data
frameworks.

16
BOOK TITLE

4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction( for example, feature selection and
extraction and record sampling), also attribute transformation(for example, discretization of
numerical attributes and functional transformation). This step can be essential for the
success of the entire KDD project, and it is typically very project-specific. For example, in
medical assessments, the quotient of attributes may often be the most significant factor and
not each one by itself. In business, we may need to think about impacts beyond our control
as well as efforts and transient issues. For example, studying the impact of advertising
accumulation. However, if we do not utilize the right transformation at the starting, then we
may acquire an amazing effect that insights to us about the transformation required in the
next iteration. Thus, the KDD process follows upon itself and prompts an understanding of
the transformation required.
5. Prediction and description
We are now prepared to decide on which kind of Data Mining to use, for example,
classification, regression, clustering, etc. This mainly relies on the KDD objectives, and also
on the previous steps. There are two significant objectives in Data Mining, the first one is a
prediction, and the second one is the description. Prediction is usually referred to as
supervised Data Mining, while descriptive Data Mining incorporates the unsupervised and
visualization aspects of Data Mining. Most Data Mining techniques depend on inductive
learning, where a model is built explicitly or implicitly by generalizing from an adequate
number of preparing models. The fundamental assumption of the inductive approach is that
the prepared model applies to future cases. The technique also takes into account the level of
meta-learning for the specific set of accessible data.
6. Selecting the Data Mining algorithm
Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For
example, considering precision versus understandability, the previous is better with neural
networks, while the latter is better with decision trees. For each system of meta-learning,
there are several possibilities of how it can be succeeded. Meta-learning focuses on clarifying
what causes a Data Mining algorithm to be fruitful or not in a specific issue. Thus, this
methodology attempts to understand the situation under which a Data Mining algorithm is
most suitable. Each algorithm has parameters and strategies of leaning, such as ten folds
cross-validation or another division for training and testing.
7. Utilizing the Data Mining algorithm
At last, the implementation of the Data Mining algorithm is reached. In this stage, we may
need to utilize the algorithm several times until a satisfying outcome is obtained. For
example, by turning the algorithms control parameters, such as the minimum number of
instances in a single leaf of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact
on the Data Mining algorithm results. For example, including a feature in step 4, and repeat

17
BOOK TITLE

from there. This step focuses on the comprehensibility and utility of the induced model. In
this step, the identified knowledge is also recorded for further use. The last step is the use,
and overall feedback and discovery results acquire by Data Mining.
9. Using the discovered knowledge
Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and
measure the impacts. The accomplishment of this step decides the effectiveness of the whole
KDD process. There are numerous challenges in this step, such as losing the "laboratory
conditions" under which we have worked. For example, the knowledge was discovered from
a certain static depiction, it is usually a set of data, but now the data becomes dynamic. Data
structures may change certain quantities that become unavailable, and the data domain might
be modified, such as an attribute that may have a value that was not expected previously.

1.4 SEMMA MODEL


SEMMA is the sequential methods to build machine learning models incorporated in ‘SAS
Enterprise Miner’, a product by SAS Institute Inc., one of the largest producers of
commercial statistical and business intelligence software. However, the sequential steps guide
the development of a machine learning system. Let’s look at the five sequential steps to
understand it better.

SEMMA model in Machine Learning


Sample: This step is all about selecting the subset of the right volume dataset from a large
dataset provided for building the model. It will help us to build the model very efficiently.
Basically in this step, we identify the independent variables(outcome) and dependent
variables(factors). The selected subset of data should be actually a representation of the

18
BOOK TITLE

entire dataset originally collected, which means it should contain sufficient information to
retrieve. The data is also divided into training and validation purpose.
Explore: In this phase, activities are carried out to understand the data gaps and relationship
with each other. Two key activities are univariate and multivariate analysis. In univariate
analysis, each variable looks individually to understand its distribution, whereas in
multivariate analysis the relationship between each variable is explored. Data visualization is
heavily used to help understand the data better. In this step, we do analysis with all the
factors which influence our outcome.
Modify: In this phase, variables are cleaned where required. New derived features are
created by applying business logic to existing features based on the requirement. Variables
are transformed if necessary. The outcome of this phase is a clean dataset that can be passed
to the machine learning algorithm to build the model. In this step, we check whether the
data is completely transformed or not. If we need the transformation of data we use the label
encoder or label binarizer.
Model: In this phase, various modelling or data mining techniques are applied to the pre-
processed data to benchmark their performance against desired outcomes. In this step, we
perform all the mathematical which makes our outcome more precise and accurate as well.
Assess: This is the last phase. Here model performance is evaluated against the test data
(not used in model training) to ensure reliability and business usefulness. Finally, in this step,
we perform the evaluation and interpretation of data. We compare our model outcome with
the actual outcome and analysis of our model limitation and also try to overcome that
limitation.

===000===

19
BOOK TITLE

2 MACHINE LEARNING PERSPECTIVE OF DATA

From a machine learning perspective, data is the lifeblood of the entire process. Machine
learning is all about developing algorithms and models that can learn patterns, make
predictions, and automate decision-making tasks based on data. Here's how data fits into the
machine learning pipeline:
1. Data Collection: This is the starting point of any machine learning project. You
gather data from various sources, which could include sensors, databases, web
scraping, user inputs, and more. The quality and quantity of data play a crucial role
in the success of a machine learning model.
2. Data Preprocessing: Raw data often needs to be cleaned and preprocessed. This
includes handling missing values, normalizing data, encoding categorical variables,
and removing outliers. Proper preprocessing is essential to ensure that the data is in
a format that can be used by machine learning algorithms.
3. Feature Engineering: This is the process of selecting or creating relevant features
from the data. Feature engineering can significantly impact the model's
performance. It involves domain knowledge, creativity, and data analysis to decide
which features are most informative for the task at hand.
4. Data Splitting: The data is typically split into training, validation, and testing sets.
The training set is used to train the model, the validation set is used to fine-tune
hyperparameters, and the testing set is used to evaluate the model's performance.
5. Model Training: Machine learning algorithms learn from the training data to build
models. These models can be classifiers, regressors, clustering models, or more
advanced deep learning networks. During training, the model optimizes its
parameters to minimize the difference between its predictions and the actual target
values.
6. Model Evaluation: After training, the model is evaluated using the validation and
testing datasets. Evaluation metrics such as accuracy, precision, recall, F1 score, and
mean squared error are used to assess how well the model performs.
7. Model Fine-Tuning: Based on the evaluation results, hyperparameters may be
adjusted to optimize the model's performance. This process may involve iterations
of training and evaluation until a satisfactory model is obtained.
8. Model Deployment: Once a model is developed and validated, it can be deployed
in a real-world application. This can involve integrating the model into a software
system, a website, or any other relevant platform.
9. Monitoring and Maintenance: Machine learning models require ongoing
monitoring and maintenance. As new data becomes available, the model may need
to be retrained or updated to ensure it continues to make accurate predictions.
10. Feedback Loop: In some cases, machine learning models can benefit from a
feedback loop. This involves collecting data on the model's predictions in real-world
scenarios and using this feedback to further improve the model.

20
BOOK TITLE

In summary, data is the foundation of machine learning. The quality of data, along with the
effectiveness of data preprocessing and feature engineering, greatly influences the success of
a machine learning project. The ultimate goal is to develop a model that can learn from data
and make accurate predictions or automate decision-making based on that data.

2.1 SCALES OF MEASUREMENTS


In Statistics, the variables or numbers are defined and categorised using different scales of
measurements. Each level of measurement scale has specific properties that determine
the various use of statistical analysis. In this article, we will learn four types of scales such as
nominal, ordinal, interval and ratio scale.

What is the Scale?


A scale is a device or an object used to measure or quantify any event or another object.
Levels of Measurements
There are four different scales of measurement. The data can be defined as being one of the
four scales. The four types of scales are:
• Nominal Scale
• Ordinal Scale
• Interval Scale
• Ratio Scale

Nominal Scale
A nominal scale is the 1st level of measurement scale in which the numbers serve as “tags” or
“labels” to classify or identify the objects. A nominal scale usually deals with the non-
numeric variables or the numbers that do not have any value.
Characteristics of Nominal Scale
• A nominal scale variable is classified into two or more categories. In this
measurement mechanism, the answer should fall into either of the classes.
• It is qualitative. The numbers are used here to identify the objects.
• The numbers don’t define the object characteristics. The only permissible aspect of
numbers in the nominal scale is “counting.”

21
BOOK TITLE

Example:
An example of a nominal scale measurement is given below:
What is your gender?
M- Male
F- Female
Here, the variables are used as tags, and the answer to this question should be either M or F.

Ordinal Scale
The ordinal scale is the 2nd level of measurement that reports the ordering and ranking of
data without establishing the degree of variation between them. Ordinal represents the
“order.” Ordinal data is known as qualitative data or categorical data. It can be grouped,
named and also ranked.
Characteristics of the Ordinal Scale
• The ordinal scale shows the relative ranking of the variables
• It identifies and describes the magnitude of a variable
• Along with the information provided by the nominal scale, ordinal scales give the
rankings of those variables
• The interval properties are not known
• The surveyors can quickly analyse the degree of agreement concerning the identified
order of variables
Example:
• Ranking of school students – 1st, 2nd, 3rd, etc.
• Ratings in restaurants
• Evaluating the frequency of occurrences
o Very often
o Often
o Not often
o Not at all
• Assessing the degree of agreement
o Totally agree
o Agree
o Neutral
o Disagree
o Totally disagree

Interval Scale
The interval scale is the 3rd level of measurement scale. It is defined as a quantitative
measurement scale in which the difference between the two variables is meaningful. In other
words, the variables are measured in an exact manner, not as in a relative way in which the
presence of zero is arbitrary.
Characteristics of Interval Scale:
• The interval scale is quantitative as it can quantify the difference between the values

22
BOOK TITLE

• It allows calculating the mean and median of the variables


• To understand the difference between the variables, you can subtract the values
between the variables
• The interval scale is the preferred scale in Statistics as it helps to assign any
numerical values to arbitrary assessment such as feelings, calendar types, etc.
Example:
• Likert Scale
• Net Promoter Score (NPS)
• Bipolar Matrix Table

Ratio Scale
The ratio scale is the 4th level of measurement scale, which is quantitative. It is a type of
variable measurement scale. It allows researchers to compare the differences or intervals.
The ratio scale has a unique feature. It possesses the character of the origin or zero points.
Characteristics of Ratio Scale:
• Ratio scale has a feature of absolute zero
• It doesn’t have negative numbers, because of its zero-point feature
• It affords unique opportunities for statistical analysis. The variables can be orderly
added, subtracted, multiplied, divided. Mean, median, and mode can be calculated
using the ratio scale.
• Ratio scale has unique and useful properties. One such feature is that it allows unit
conversions like kilogram – calories, gram – calories, etc.
Example:
An example of a ratio scale is:
What is your weight in Kgs?
• Less than 55 kgs
• 55 – 75 kgs
• 76 – 85 kgs
• 86 – 95 kgs
• More than 95 kgs

2.2 DEALING WITH MISSING DATA IN MACHINE LEARNING


Dealing with missing data is a crucial step in the data preprocessing phase of a machine
learning project. Missing data can lead to biased or inaccurate models, so it's important to
handle it effectively. Here are some common strategies for dealing with missing data in
machine learning:
1. Identify Missing Data: First, you need to identify the missing values in your
dataset. Most programming libraries represent missing data as NaN (Not-a-
Number) or null values. You can use functions like isnull() or isna() to detect
missing values in your dataset.
2. Remove Missing Data: The simplest approach is to remove rows or columns with
missing data. This is appropriate when the amount of missing data is small, and you

23
BOOK TITLE

can afford to discard those instances or features without significantly impacting the
quality of your dataset. However, this approach may result in loss of valuable
information.
3. Imputation: Imputation is the process of filling in missing values with estimated or
predicted values. There are several methods for imputing missing data:
a. Mean, Median, or Mode Imputation: Replace missing values with the
mean, median, or mode of the observed values in that column. This is a
simple and quick method, but it can introduce bias if the missing data is not
missing at random.
b. Constant Value Imputation: Replace missing values with a predefined
constant value. For example, you might replace missing values with zero or
a specific value that makes sense in the context of your data.
c. Regression Imputation: Use regression models to predict missing values
based on the relationships between the missing feature and other features.
This is a more sophisticated approach, and it can capture complex
relationships in the data.
d. K-Nearest Neighbors (KNN) Imputation: For each missing value, find
the K-nearest data points (based on other features) and impute the missing
value as a weighted average of the K-nearest neighbors' values.
e. Multiple Imputation: This method involves generating multiple imputed
datasets with different imputed values and then aggregating the results.
Multiple imputation accounts for the uncertainty associated with imputing
missing data.
4. Missing Data Indicators: Create binary indicator variables to represent
missingness in the dataset. This allows the model to learn from the fact that data
was missing, which can be informative. However, it increases the dimensionality of
the data.
5. Advanced Techniques: There are advanced imputation methods, such as using
machine learning models (e.g., decision trees, random forests, or deep learning) to
predict missing values based on the available data. These methods may be more
accurate when there is a complex relationship between features.
6. Domain Knowledge: In some cases, domain knowledge can guide you in
determining the best approach for handling missing data. For example, you may
know that certain missing values are meaningful and should be treated differently.
7. Time-Series Interpolation: When dealing with time-series data, you can use time-
based interpolation techniques to estimate missing values based on the values before
and after the missing point.
The choice of method for dealing with missing data depends on the nature and amount of
missing data, the specific problem you're working on, and the goals of your machine learning
project. It's important to carefully consider the implications of your chosen method on the
quality and fairness of your model. Additionally, cross-validation and model evaluation
should be performed to ensure that the chosen approach does not introduce bias or
adversely affect the model's performance

24
BOOK TITLE

5 Ways To Handle Missing Values In Machine Learning Datasets


In real world data, there are some instances where a particular element is absent because of
various reasons, such as, corrupt data, failure to load the information, or incomplete
extraction. Handling the missing values is one of the greatest challenges faced by analysts,
because making the right decision on how to handle it generates robust data models. Let us
look at different ways of imputing the missing values.

Note: We will be using libraries in Python such as Numpy, Pandas and SciKit Learn to
handle these values.
Let us get started. To understand various methods we will be working on the Titanic dataset:

1. Deleting Rows
This method commonly used to handle the null values. Here, we either delete a particular
row if it has a null value for a particular feature and a particular column if it has more than
70-75% of missing values. This method is advised only when there are enough samples in
the data set. One has to make sure that after we have deleted the data, there is no addition of
bias. Removing the data will lead to loss of information which will not give the expected
results while predicting the output.

25
BOOK TITLE

Pros:
• Complete removal of data with missing values results in robust and highly accurate
model
• Deleting a particular row or a column with no specific information is better, since it
does not have a high weightage
Cons:
• Loss of information and data
• Works poorly if the percentage of missing values is high (say 30%), compared to the
whole dataset

2. Replacing With Mean/Median/Mode


This strategy can be applied on a feature which has numeric data like the age of a person or
the ticket fare. We can calculate the mean, median or mode of the feature and replace it with
the missing values. This is an approximation which can add variance to the data set. But the
loss of the data can be negated by this method which yields better results compared to
removal of rows and columns. Replacing with the above three approximations are a
statistical approach of handling the missing values. This method is also called as leaking the
data while training. Another way is to approximate it with the deviation of neighbouring
values. This works better if the data is linear.

26
BOOK TITLE

To replace it with median and mode we can use the following to calculate the same:

Pros:
• This is a better approach when the data size is small
• It can prevent data loss which results in removal of the rows and columns
Cons:
• Imputing the approximations add variance and bias
• Works poorly compared to other multiple-imputations method

3. Assigning An Unique Category

27
BOOK TITLE

A categorical feature will have a definite number of possibilities, such as gender, for example.
Since they have a definite number of classes, we can assign another class for the missing
values. Here, the features Cabin and Embarked have missing values which can be replaced
with a new category, say, U for ‘unknown’. This strategy will add more information into the
dataset which will result in the change of variance. Since they are categorical, we need to find
one hot encoding to convert it to a numeric form for the algorithm to understand it. Let us
look at how it can be done in Python:

Pros:
• Less possibilities with one extra category, resulting in low variance after one hot
encoding — since it is categorical
• Negates the loss of data by adding an unique category
Cons:
• Adds less variance
• Adds another feature to the model while encoding, which may result in poor
performance

28
BOOK TITLE

4. Predicting The Missing Values


Using the features which do not have missing values, we can predict the nulls with the help
of a machine learning algorithm. This method may result in better accuracy, unless a missing
value is expected to have a very high variance. We will be using linear regression to replace
the nulls in the feature ‘age’, using other available features. One can experiment with
different algorithms and check which gives the best accuracy instead of sticking to a single
algorithm.

Pros:
• Imputing the missing variable is an improvement as long as the bias from the same
is smaller than the omitted variable bias
• Yields unbiased estimates of the model parameters
Cons:
• Bias also arises when an incomplete conditioning set is used for a categorical
variable
• Considered only as a proxy for the true values

5. Using Algorithms Which Support Missing Values


KNN is a machine learning algorithm which works on the principle of distance measure.
This algorithm can be used when there are nulls present in the dataset. While the algorithm
is applied, KNN considers the missing values by taking the majority of the K nearest values.
In this particular dataset, taking into account the person’s age, sex, class etc, we will assume
that people having same data for the above mentioned features will have the same kind of
fare.

29
BOOK TITLE

Unfortunately, the SciKit Learn library for the K – Nearest Neighbour algorithm in Python
does not support the presence of the missing values.
Another algorithm which can be used here is RandomForest. This model produces a robust
result because it works well on non-linear and the categorical data. It adapts to the data
structure taking into consideration of the high variance or the bias, producing better results
on large datasets.
Pros:
• Does not require creation of a predictive model for each attribute with missing data
in the dataset
• Correlation of the data is neglected
Cons:
• Is a very time consuming process and it can be critical in data mining where large
databases are being extracted
• Choice of distance functions can be Euclidean, Manhattan etc. which is do not yield
a robust result

Conclusion
Every dataset we come across will almost have some missing values which need to be dealt
with. But handling them in an intelligent way and giving rise to robust models is a
challenging task. We have gone through a number of ways in which nulls can be replaced. It
is not necessary to handle a particular dataset in one single manner. One can use various
methods on different features depending on how and what the data is about. Having a small
domain knowledge about the data is important, which can give you an insight about how to
approach the problem.

2.3 HANDLING CATEGORICAL DATA


Handling categorical data in machine learning is a crucial aspect of data preprocessing, as
many machine learning algorithms require numerical input. Categorical data represents
discrete categories or labels, such as color, city names, or product types, rather than
numerical values. Here are several common techniques for handling categorical data in
machine learning:
1. Label Encoding:
a. In label encoding, you map each category to a unique integer. For example,
if you have a "Color" column with categories "Red," "Blue," and "Green,"
you could map them to 0, 1, and 2.
b. Label encoding is suitable for ordinal categorical data, where there is a
natural order among categories. However, it may not be suitable for
nominal categorical data, as it implies an order that doesn't exist.
2. One-Hot Encoding (Dummy Variables):
a. One-hot encoding is a popular technique for nominal categorical data. It
creates binary columns for each category and marks the presence of a
category with a 1 and the absence with a 0.

30
BOOK TITLE

b. For example, "Color" would become three binary columns: "Is_Red,"


"Is_Blue," and "Is_Green."
c. One-hot encoding is effective but can lead to high dimensionality if you
have many categories. In such cases, you may want to use dimensionality
reduction techniques.
3. Binary Encoding:
a. Binary encoding combines the benefits of label encoding and one-hot
encoding. It first converts categories to integers and then encodes these
integers in binary code.
b. This method is useful when you have many categories and want to reduce
dimensionality.
4. Count Encoding:
a. Count encoding replaces each category with the count of occurrences in the
dataset. It can help capture the importance of each category based on its
frequency in the data.
b. This is especially useful when dealing with high-cardinality categorical
features.
5. Target Encoding (Mean Encoding):
a. Target encoding involves replacing each category with the mean of the
target variable for that category. It's often used for binary classification
problems and can help the model learn relationships between categories
and the target variable.
b. Be cautious when using target encoding to avoid data leakage, and consider
using techniques like k-fold cross-validation to calculate the means.
6. Feature Hashing (Hashing Trick):
a. Feature hashing maps categorical values to a fixed number of columns (a
feature vector) using a hash function. The number of columns is usually
much smaller than the number of categories.
b. This technique is useful when you need to reduce dimensionality and have a
large number of categories.
7. Embedding Layers (for Neural Networks):
a. In deep learning, you can use embedding layers to transform categorical
data into dense numerical vectors. These embeddings are learned by the
neural network during training.
b. This approach is especially useful when working with deep neural networks
and sequential data like text.
8. Leave-One-Out Encoding:
a. Leave-one-out encoding is a variation of target encoding where each
category is replaced with the mean of the target variable for all instances
except the current one. This method can be effective for small datasets.
9. Frequency Encoding:

31
BOOK TITLE

a. Frequency encoding replaces each category with its frequency or percentage


of occurrences in the dataset. This can help capture the popularity of each
category.
10. Ordinal Encoding:
For ordinal categorical data, you can manually assign numeric values based on the
order of the categories. For example, "low," "medium," and "high" could be
mapped to 1, 2, and 3, respectively.
The choice of encoding method depends on the nature of the categorical data and the
machine learning algorithm you intend to use. It's important to select the method that best
fits your specific problem and dataset. Additionally, you should consider how different
encoding methods may affect the performance of your machine learning model, including
the risk of introducing bias or increasing dimensionality.
Data that only includes a few values is referred to as categorical data, often known as
categories or levels and it is described in two ways - nominal or ordinal. Data that lacks any
intrinsic order, such as colors, genders, or animal species, is represented as nominal
categorical data while ordinal categorical data refers to information that is naturally ranked or
ordered, such as customer satisfaction levels or educational attainment. We will go through
how to handle categorical data in Python in this tutorial.

Setup
pip install pandas
pip install scikit-learn
pip install category_encoders
Categorical data is often represented as text labels, and many machine learning algorithms
require numerical input data. Customer demographics, product classifications, and
geographic areas are just a few examples of real-world datasets that include categorical data
which must be converted into numerical representation before being used in machine
learning algorithms. Therefore, it is important to convert categorical data into a numerical
format before feeding it to a machine learning algorithm. This process is known as encoding.
There are various techniques for encoding categorical data, including one-hot encoding,
ordinal encoding, and target encoding.

Ways to Handle Categorical Data


Example 1 - One Hot Encoding
One-Hot Encoding is a technique used to convert categorical data into numerical format. It
creates a binary vector for each category in the dataset. The vector contains a 1 for the
category it represents and 0s for all other categories
The pandas and scikit-learn libraries provide functions to perform One-Hot Encoding. The
following code snippet shows how to perform One-Hot Encoding using pandas and scikit-
learn.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from category_encoders import OrdinalEncoder, TargetEncoder

32
BOOK TITLE

# Create a pandas DataFrame with categorical data


df = pd.DataFrame({'color': ['red', 'blue', 'green', 'green', 'red']})

# Create an instance of OneHotEncoder


encoder = OneHotEncoder()

# Fit and transform the DataFrame using the encoder


encoded_data = encoder.fit_transform(df)

# Convert the encoded data into a pandas DataFrame


encoded_df = pd.DataFrame(encoded_data.toarray(),
columns=encoder.get_feature_names())
print(encoded_df)
Output
x0_blue x0_green x0_red
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 1.0 0.0
3 0.0 1.0 0.0
4 0.0 0.0 1.0
Example 2 - Ordinal Encoding
Ordinal coding is a popular technique for encoding categorical data where each category is
given a different numerical value based on its rank or order. The categories with the lowest
values receive the smallest integers, while those with the highest values receive the largest
integers. When the categories are grouped organically, like with ratings (poor, fair, good,
outstanding), or educational achievement, this strategy is extremely useful (high school,
college, graduate school). Let us do ordinal encoding using Pandas and the category
encoders package −
import pandas as pd
import category_encoders as ce

# create a sample dataset


data = {'category': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)

# initialize the encoder


encoder = ce.OrdinalEncoder()

# encode the categorical feature


df['category_encoded'] = encoder.fit_transform(df['category'])

# print the encoded dataframe

33
BOOK TITLE

print(df)
Output
category category_encoded
0 red 1
1 green 2
2 blue 3
3 red 1
4 green 2
As you can see, the red category has been given the value 1, green has been given the value
2, and blue has been given the value 3. The sequence in which the categories occurred in the
original dataset served as the basis for this encoding.
Example 3: Target Encoding using Category Encoders
Target Encoding is another technique used for encoding categorical data, particularly when
dealing with high cardinality features. It replaces each category with the average target value
for that category. Target Encoding is useful when there is a strong relationship between the
categorical feature and the target variable.
import pandas as pd
import category_encoders as ce

# create a sample dataset


data = {'category': ['red', 'green', 'blue', 'red', 'green'], 'target': [1, 0, 1, 0, 1]}
df = pd.DataFrame(data)

# initialize the encoder


encoder = ce.TargetEncoder()

# encode the categorical feature


df['category_encoded'] = encoder.fit_transform(df['category'], df['target'])

# print the encoded dataframe


print(df)
In this example, we create a sample dataset with a single categorical feature called "category"
and a corresponding target variable called "target". We import the category_encoders library
and initialize a TargetEncoder object. We use the fit_transform() method to encode the
categorical feature based on the target variable and add the encoded feature to the original
dataframe.
Output
category target category_encoded
0 red 1 0.585815
1 green 0 0.585815
2 blue 1 0.652043
3 red 0 0.585815
4 green 1 0.585815

34
BOOK TITLE

The color column was successfully encoded using target encoding, as can be seen in the
output, by category encoders. The column to be encoded is specified using the cols option,
and the encoding is done using TargetEncoder. The target variable and the encoding target
column are the two arguments that the fit transform function requires.
Conclusion
The significance of managing categorical data properly in machine learning applications was
covered in this article. It investigated one-hot encoding, ordinal encoding, and target
encoding as three distinct methods for encoding categorical data in Python. One-hot
encoding is a quick and efficient method, but it can result in a lot more features. When the
order of the categories is known, ordinal encoding is a reasonable option, but it misses the
connection between the categories and the target variable.
Hence, managing categorical data is a crucial component of machine learning systems, and
selecting the proper encoding method is key for producing accurate and trustworthy results.

2.4 NORMALIZING DATA


Normalization is a pre-processing stage of any type of problem statement. In particular,
normalization takes an important role in the field of soft computing, cloud computing, etc.
for manipulation of data, scaling down, or scaling up the range of data before it becomes
used for further stages. There are so many normalization techniques there, namely Min-Max
normalization, Z-score normalization, and Decimal scaling normalization.
Normalization is scaling the data to be analyzed to a specific range such as [0.0, 1.0] to
provide better results.

What is Data Normalization?


Data transformation operations, such as normalization and aggregation, are additional data
preprocessing procedures that would contribute toward the success of the data extract
process.
Data normalization consists of remodeling numeric columns to a standard scale. Data
normalization is generally considered the development of clean data. Diving deeper,
however, the meaning or goal of data normalization is twofold:
• Data normalization is the organization of data to appear similar across all records
and fields.
• It increases the cohesion of entry types, leading to cleansing, lead generation,
segmentation, and higher quality data.
Normalization is a scaling technique in Machine Learning applied during data preparation to
change the values of numeric columns in the dataset to use a common scale. It is not
necessary for all datasets in a model. It is required only when features of machine learning
models have different ranges.
Mathematically, we can calculate normalization with the below formula:
Xn = (X - Xminimum) / ( Xmaximum - Xminimum)
• Xn = Value of Normalization
• Xmaximum = Maximum value of a feature

35
BOOK TITLE

• Xminimum = Minimum value of a feature


Example: Let's assume we have a model dataset having maximum and minimum values of
feature as mentioned above. To normalize the machine learning model, values are shifted
and rescaled so their range can vary between 0 and 1. This technique is also known as Min-
Max scaling. In this scaling technique, we will change the feature values as follows:
Case1- If the value of X is minimum, the value of Numerator will be 0; hence
Normalization will also be 0.
Xn = (X - Xminimum) / ( Xmaximum - Xminimum)
Put X =Xminimum in above formula, we get;
Xn = Xminimum- Xminimum/ ( Xmaximum - Xminimum)
Xn = 0
Case2- If the value of X is maximum, then the value of the numerator is equal to the
denominator; hence Normalization will be 1.
Xn = (X - Xminimum) / ( Xmaximum - Xminimum)
Put X =Xmaximum in above formula, we get;
Xn = Xmaximum - Xminimum/ ( Xmaximum - Xminimum)
Xn = 1
Case3- On the other hand, if the value of X is neither maximum nor minimum, then values
of normalization will also be between 0 and 1.
Hence, Normalization can be defined as a scaling method where values are shifted and
rescaled to maintain their ranges between 0 and 1, or in other words; it can be referred to
as Min-Max scaling technique.

Normalization techniques in Machine Learning


Although there are so many feature normalization techniques in Machine Learning, few of
them are most frequently used. These are as follows:
• Min-Max Scaling: This technique is also referred to as scaling. As we have
already discussed above, the Min-Max scaling method helps the dataset to shift
and rescale the values of their attributes, so they end up ranging between 0 and
1.
• Standardization scaling:
Standardization scaling is also known as Z-score normalization, in which values are centered
around the mean with a unit standard deviation, which means the attribute becomes zero
and the resultant distribution has a unit standard deviation. Mathematically, we can calculate
the standardization by subtracting the feature value from the mean and dividing it by
standard deviation.
Hence, standardization can be expressed as follows:

Here, µ represents the mean of feature value, and σ represents the standard deviation of
feature values.

36
BOOK TITLE

However, unlike Min-Max scaling technique, feature values are not restricted to a specific
range in the standardization technique.
This technique is helpful for various machine learning algorithms that use distance measures
such as KNN, K-means clustering, and Principal component analysis, etc. Further, it is
also important that the model is built on assumptions and data is normally distributed.

Importance of Data Normalization


Data Normalization disposes of various anomalies that can make an examination of the
information more complicated. A portion of those irregularities can manifest from erasing
information, embedding more data, or refreshing existing data. Once those mistakes are
worked out and eliminated from the framework, further advantages can be acquired through
different jobs in the data and data examination.
It is for the most part through data normalization that the data inside a data set can be
designed so that it can be visualized and examined.

Advantages of Data Normalization


• We can have more clustered indexes.
• Index searching is often faster.
• Data modification commands are faster.
• Fewer null values and less redundant data, making your data more compact.
• Data modification anomalies are reduced.
• Normalization is conceptually cleaner and easier to maintain and change as your
needs change.
• Searching, sorting, and creating indexes is faster, since tables are narrower, and more
rows fit on a data page.

Disadvantages of Normalization
There are various drawbacks to normalizing a database. A few disadvantages are as follows:
• When information is dispersed over many tables, it becomes necessary to link them
together, extending the work. Additionally, the database becomes more intriguing to
recognize.
• Tables will include codes rather than actual data since rewritten data will be saved as
lines of numbers rather than actual data. As a result, the query table must constantly
be consulted.
• Being designed for programs rather than ad hoc querying, the information model
proves to be exceedingly difficult to query. It is made up of SQL that has been
accumulated through time, and operating framework cordial query devices often
carry out this task. As a result, it might be difficult to demonstrate knowledge and
understanding without first comprehending the client’s needs.
• The show’s pace gradually slows down compared to the typical structural type.

37
BOOK TITLE

• To successfully finish the standardization cycle, it is vital to have a thorough


understanding of the many conventional structures. A bad plan with substantial
irregularities and data inconsistencies can result from careless use.

Need of Normalization
Normalization is generally required when we are dealing with attributes on a different scale,
otherwise, it may lead to a dilution in the effectiveness of an important equally important
attribute(on a lower scale) because of other attributes having values on a larger scale. In
simple words, when multiple attributes are there but attributes have values on different
scales, this may lead to poor data models while performing data mining operations. So they
are normalized to bring all the attributes on the same scale.

Data Normalization Methods


Normalization is a scaling technique or a mapping technique or a pre-processing stage.
Where we can find a new range from an existing one. It can be helpful for prediction or
forecasting purposes a lot. As we know, there are so many ways to predict or forecast, but all
can vary with each other a lot. So, to maintain the large variety of prediction and forecasting
predictions, normalization techniques are required to make them closer. There are some
existing normalization techniques as mentioned below:
Min-Max normalization: In this technique of data normalization, a linear transformation is
performed on the original data. The minimum and maximum value from data are fetched
and each value is replaced according to the following formula.

Where A is the attribute data,


Min(A), Max(A) are the minimum and maximum absolute values of A respectively.
v’ is the new value of each entry in data.
v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min values of the range(i.e boundary value of
range required) respectively.
Normalization by decimal scaling: It normalizes by moving the decimal point of values
of the data. To normalize the data by this technique, we divide each value of the data by the
maximum absolute value of the data. The data value, vi, of data, is normalized to vi‘ by using
the formula below :
0 seconds of 0 secondsVolume 0%

where j is the smallest integer such that max(|vi‘|)<1.

38
BOOK TITLE

Z-score normalization or Zero mean normalization: In this technique, values are


normalized based on mean and standard deviation of the data A. The formula used is:

v’, v is new and old of each entry in data respectively. σA, A is the standard deviation and
mean of A respectively.

What is the Purpose of Normalization?


As data’s usefulness to various types of businesses rises, the purpose of normalization in
DBMS, the manner that data is organized when it is present in huge quantities, becomes
even more critical. It is evident that good Data normalization of the database is used to get
better results like:
• Overall business performance increases.
• Improving group analysis without being concerned with redundancy.
• Imagine the consequences if you failed to arrange your data and missed out on
crucial growth opportunities because a website wouldn’t load or a vice president
didn’t get your notes.
This in no way exudes achievement or progress. One of the most crucial things you can do
for your company right now is to decide to standardize data.
Data normalisation is more than simply restructuring the data in a database, as data has
increasing value for all businesses. Here are a few of its main advantages:
• cuts down on superfluous data
• ensures consistency of data throughout the database
• improved database design
• more robust database security
• improved and expedited performance
• improved database organization in general

2.5 FEATURE CONSTRUCION OR GENERATION IN MACHINE LEARNING


Feature construction (or feature engineering) is the process of creating new features from
existing data to improve the performance of a machine learning model. Feature engineering
plays a critical role in enhancing a model's ability to capture meaningful patterns and
relationships in the data. Here are some techniques and considerations for feature
construction in machine learning:
1. Polynomial Features:
a. Create new features by raising existing features to a higher power, such as
squaring, cubing, or using other polynomial transformations. This can help
capture non-linear relationships in the data.
2. Interaction Features:

39
BOOK TITLE

a. Generate new features by taking the product or interaction of two or more


existing features. For example, if you have "length" and "width" features,
you can create a new feature "area" by multiplying them.
3. Binning/Discretization:
a. Convert continuous features into categorical features by dividing them into
bins or ranges. This can help the model capture non-linear patterns and
make the relationship between features and the target variable more
apparent.
4. Encoding Cyclical Features:
a. When dealing with cyclical features like time or angles, convert them into
two features representing sine and cosine values. This can help the model
understand the cyclical nature of the data.
5. Logarithmic or Exponential Transformation:
a. Apply logarithmic or exponential transformations to features to reduce the
impact of extreme values or to make the data more normally distributed.
6. Feature Scaling:
a. Normalize or standardize features to ensure they have similar scales. This
can be important for algorithms sensitive to feature scaling, like many
distance-based methods.
7. Feature Extraction:
a. Use dimensionality reduction techniques like Principal Component Analysis
(PCA) or Linear Discriminant Analysis (LDA) to create a smaller set of
features that capture the most important information in the data.
8. Text Data Transformation:
a. For text data, you can apply techniques like TF-IDF (Term Frequency-
Inverse Document Frequency) to convert text into numerical vectors.
Additionally, you can use word embeddings like Word2Vec or GloVe to
represent words as continuous vectors.
9. Time-Series Features:
a. When working with time-series data, you can create features such as lag
values, moving averages, seasonality, and trends to capture temporal
patterns.
10. Domain-Specific Features:
a. Leverage domain knowledge to engineer features that are relevant to the
specific problem. These features can provide valuable insights that the
model may not otherwise capture.
11. Dummy Variables (One-Hot Encoding):
a. Convert categorical variables into binary dummy variables using one-hot
encoding. This allows the model to handle categorical data more effectively.
12. Feature Crosses:
a. Combine two or more categorical features to create new categorical
features. This is especially useful when there might be interactions or
dependencies between categorical features.

40
BOOK TITLE

13. Feature Aggregation:


a. Summarize data over different groupings, aggregating information using
functions like mean, sum, or count. Aggregated features can be valuable in
many machine learning applications.
14. Derived Features from Dates:
a. Extract information from date-time features, such as day of the week, day
of the month, quarter, or holiday indicators.
15. Geospatial Features:
a. When working with geospatial data, create features based on distances,
spatial relationships, or areas of interest to capture the geography's impact
on the problem.
16. Feature Selection:
a. In some cases, it may be beneficial to reduce the number of features by
selecting the most relevant ones using techniques like feature importance
scores or recursive feature elimination.
The goal of feature construction is to make the data more informative and suitable for the
chosen machine learning algorithm. It often requires domain knowledge, creativity, and an
understanding of the problem context. Additionally, the process should be guided by
rigorous testing and validation to ensure that the engineered features improve the model's
performance without overfitting.

2.6 CORRELATION AND CAUSATION


The difference between correlation and causation
Two or more variables considered to be related, in a statistical context, if their values change
so that as the value of one variable increases or decreases so does the value of the other
variable (although it may be in the opposite direction).
For example, for the two variables "hours worked" and "income earned" there is a
relationship between the two if the increase in hours worked is associated with an increase in
income earned. If we consider the two variables "price" and "purchasing power", as the price
of goods increases a person's ability to buy these goods decreases (assuming a constant
income).
Correlation is a statistical measure (expressed as a number) that describes the size and
direction of a relationship between two or more variables. A correlation between variables,
however, does not automatically mean that the change in one variable is the cause of the
change in the values of the other variable.
Causation indicates that one event is the result of the occurrence of the other event; i.e. there
is a causal relationship between the two events. This is also referred to as cause and effect.
Theoretically, the difference between the two types of relationships are easy to identify — an
action or occurrence can cause another (e.g. smoking causes an increase in the risk of
developing lung cancer), or it can correlate with another (e.g. smoking is correlated with
alcoholism, but it does not cause alcoholism). In practice, however, it remains difficult to
clearly establish cause and effect, compared with establishing correlation.

41
BOOK TITLE

Importance of correlation and causation


The objective of much research or scientific analysis is to identify the extent to which one
variable relates to another variable. For example:
• Is there a relationship between a person's education level and their health?
• Is pet ownership associated with living longer?
• Did a company's marketing campaign increase their product sales?
These and other questions are exploring whether a correlation exists between the two
variables, and if there is a correlation then this may guide further research into investigating
whether one action causes the other. By understanding correlation and causality, it allows for
policies and programs that aim to bring about a desired outcome to be better targeted.

Measuring correlation
For two variables, a statistical correlation is measured by the use of a Correlation Coefficient,
represented by the symbol (r), which is a single number that describes the degree of
relationship between two variables.
The coefficient's numerical value ranges from +1.0 to –1.0, which provides an indication of
the strength and direction of the relationship.
If the correlation coefficient has a negative value (below 0) it indicates a negative relationship
between the variables. This means that the variables move in opposite directions (ie when
one increases the other decreases, or when one decreases the other increases).
If the correlation coefficient has a positive value (above 0) it indicates a positive relationship
between the variables meaning that both variables move in tandem, i.e. as one variable
decreases the other also decreases, or when one variable increases the other also increases.
Where the correlation coefficient is 0 this indicates there is no relationship between the
variables (one variable can remain constant while the other increases or decreases).
While the correlation coefficient is a useful measure, it has its limitations: Correlation
coefficients are usually associated with measuring a linear relationship.
For example, if you compare hours worked and income earned for a tradesperson who
charges an hourly rate for their work, there is a linear (or straight line) relationship since with
each additional hour worked the income will increase by a consistent amount.
If, however, the tradesperson charges based on an initial call out fee and an hourly fee which
progressively decreases the longer the job goes for, the relationship between hours worked
and income would be non-linear, where the correlation coefficient may be closer to 0.
Care is needed when interpreting the value of 'r'. It is possible to find correlations between
many variables, however the relationships can be due to other factors and have nothing to
do with the two variables being considered.
For example, sales of ice creams and the sales of sunscreen can increase and decrease across
a year in a systematic manner, but it would be a relationship that would be due to the effects
of the season (ie hotter weather sees an increase in people wearing sunscreen as well as
eating ice cream) rather than due to any direct relationship between sales of sunscreen and
ice cream.

42
BOOK TITLE

The correlation coefficient should not be used to say anything about cause and effect
relationship. By examining the value of 'r', we may conclude that two variables are related,
but that 'r' value does not tell us if one variable was the cause of the change in the other.

Establishing causation
Causality is the area of statistics that is commonly misunderstood and misused by people in
the mistaken belief that because the data shows a correlation that there is necessarily an
underlying causal relationship.
The use of a controlled study is the most effective way of establishing causality between
variables. In a controlled study, the sample or population is split in two, with both groups
being comparable in almost every way. The two groups then receive different treatments,
and the outcomes of each group are assessed.
For example, in medical research, one group may receive a placebo while the other group is
given a new type of medication. If the two groups have noticeably different outcomes, the
different experiences may have caused the different outcomes.
Due to ethical reasons, there are limits to the use of controlled studies; it would not be
appropriate to use two comparable groups and have one of them undergo a harmful activity
while the other does not. To overcome this situation, observational studies are often used to
investigate correlation and causation for the population of interest. The studies can look at
the groups' behaviours and outcomes and observe any changes over time.
The objective of these studies is to provide statistical information to add to the other sources
of information that would be required for the process of establishing whether or not
causality exists between two variables.

2.7 ML POLYNOMIAL REGRESSION


• Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree polynomial. The
Polynomial Regression equation is given below:
• y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
• It is also called the special case of Multiple Linear Regression in ML. Because we
add some polynomial terms to the Multiple Linear regression equation to convert it
into Polynomial Regression.
• It is a linear model with some modification in order to increase the accuracy.
• The dataset used in Polynomial regression for training is of non-linear nature.
• It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
• Hence, " In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a
linear model."

Need for Polynomial Regression:


The need of Polynomial Regression in ML can be understood in the below points:

43
BOOK TITLE

• If we apply a linear model on a linear dataset, then it provides us a good result as


we have seen in Simple Linear Regression, but if we apply the same model without
any modification on a non-linear dataset, then it will produce a drastic output. Due
to which loss function will increase, the error rate will be high, and accuracy will be
decreased.
• So for such cases, where data points are arranged in a non-linear fashion, we
need the Polynomial Regression model. We can understand it in a better way
using the below comparison diagram of the linear dataset and non-linear dataset.


• In the above image, we have taken a dataset which is arranged non-linearly. So if we
try to cover it with a linear model, then we can clearly see that it hardly covers any
data point. On the other hand, a curve is suitable to cover most of the data points,
which is of the Polynomial model.
• Hence, if the datasets are arranged in a non-linear fashion, then we should use the Polynomial
Regression model instead of Simple Linear Regression.

Equation of the Polynomial Regression Model:


Simple Linear Regression equation: y = b0+b1x .........(a)
Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+ bnxn
.........(b)
Polynomial Regression equation: y= b0+b1x + b2x2+ b3x3+....+ bnxn
..........(c)
When we compare the above three equations, we can clearly see that all three equations are
Polynomial equations but differ by the degree of variables. The Simple and Multiple Linear
equations are also Polynomial equations with a single degree, and the Polynomial regression
equation is Linear equation with the nth degree. So if we add a degree to our linear
equations, then it will be converted into Polynomial Linear equations.

Implementation of Polynomial Regression using Python:


Here we will implement the Polynomial Regression using Python. We will understand it by
comparing Polynomial Regression model with the Simple Linear Regression model. So first,
let's understand the problem for which we are going to build the model.
Problem Description: There is a Human Resource company, which is going to hire a new
candidate. The candidate has told his previous salary 160K per annum, and the HR have to

44
BOOK TITLE

check whether he is telling the truth or bluff. So to identify this, they only have a dataset of
his previous company in which the salaries of the top 10 positions are mentioned with their
levels. By checking the dataset available, we have found that there is a non-linear
relationship between the Position levels and the salaries. Our goal is to build a Bluffing
detector regression model, so HR can hire an honest candidate. Below are the steps to
build such a model.

Steps for Polynomial Regression:


The main steps involved in Polynomial Regression are given below:
o Data Pre-processing
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output.
Data Pre-processing Step:
The data pre-processing step will remain the same as in previous regression models, except
for some changes. In the Polynomial Regression model, we will not use feature scaling, and
also we will not split our dataset into training and test set. It has two reasons:
o The dataset contains very less information which is not suitable to divide it into a
test and training set, else our model will not be able to find the correlations between
the salaries and levels.
o In this model, we want very accurate predictions for salary, so the model should
have enough information.
The code for pre-processing step is given below:

45
BOOK TITLE

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('Position_Salaries.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, 1:2].values
y= data_set.iloc[:, 2].values
Explanation:
• In the above lines of code, we have imported the important Python libraries to
import dataset and operate on it.
• Next, we have imported the dataset 'Position_Salaries.csv', which contains
three columns (Position, Levels, and Salary), but we will consider only two
columns (Salary and Levels).
• After that, we have extracted the dependent(Y) and independent variable(X)
from the dataset. For x-variable, we have taken parameters as [:,1:2], because we
want 1 index(levels), and included :2 to make it as a matrix.
Output:
By executing the above code, we can read our dataset as:

As we can see in the above output, there are three columns present (Positions, Levels, and
Salaries). But we are only considering two columns because Positions are equivalent to the
levels or may be seen as the encoded form of Positions.

46
BOOK TITLE

Here we will predict the output for level 6.5 because the candidate has 4+ years' experience
as a regional manager, so he must be somewhere between levels 7 and 6.
Building the Linear regression model:
Now, we will build and fit the Linear regression model to the dataset. In building polynomial
regression, we will take the Linear regression model as reference and compare both the
results. The code is given below:
#Fitting the Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_regs= LinearRegression()
lin_regs.fit(x,y)
In the above code, we have created the Simple Linear model using lin_regs object
of LinearRegression class and fitted it to the dataset variables (x and y).
Output:
Out[5]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
Building the Polynomial regression model:
Now we will build the Polynomial Regression model, but it will be a little different from the
Simple Linear model. Because here we will use PolynomialFeatures class
of preprocessing library. We are using this class to add some extra features to our dataset.
#Fitting the Polynomial regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_regs= PolynomialFeatures(degree= 2)
x_poly= poly_regs.fit_transform(x)
lin_reg_2 =LinearRegression()
lin_reg_2.fit(x_poly, y)
In the above lines of code, we have used poly_regs.fit_transform(x), because first we are
converting our feature matrix into polynomial feature matrix, and then fitting it to the
Polynomial regression model. The parameter value(degree= 2) depends on our choice. We
can choose it according to our Polynomial features.
After executing the code, we will get another matrix x_poly, which can be seen under the
variable explorer option:

47
BOOK TITLE

Next, we have used another LinearRegression object, namely lin_reg_2, to fit


our x_poly vector to the linear model.
Output:
Out[11]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
Visualizing the result for Linear regression:
Now we will visualize the result for Linear regression model as we did in Simple Linear
Regression. Below is the code for it:
#Visulaizing the result for Linear Regression model
mtp.scatter(x,y,color="blue")
mtp.plot(x,lin_regs.predict(x), color="red")
mtp.title("Bluff detection model(Linear Regression)")
mtp.xlabel("Position Levels")
mtp.ylabel("Salary")
mtp.show()
Output:

In the above output image, we can clearly see that the regression line is so far from the
datasets. Predictions are in a red straight line, and blue points are actual values. If we
consider this output to predict the value of CEO, it will give a salary of approx. 600000$,
which is far away from the real value.
So we need a curved model to fit the dataset other than a straight line.
Visualizing the result for Polynomial Regression
Here we will visualize the result of Polynomial regression model, code for which is little
different from the above model.
Code for this is given below:
#Visulaizing the result for Polynomial Regression
mtp.scatter(x,y,color="blue")

48
BOOK TITLE

mtp.plot(x, lin_reg_2.predict(poly_regs.fit_transform(x)), color="red")


mtp.title("Bluff detection model(Polynomial Regression)")
mtp.xlabel("Position Levels")
mtp.ylabel("Salary")
mtp.show()
In the above code, we have taken lin_reg_2.predict(poly_regs.fit_transform(x), instead of
x_poly, because we want a Linear regressor object to predict the polynomial features matrix.
Output:

As we can see in the above output image, the predictions are close to the real values. The
above plot will vary as we will change the degree.
For degree= 3:
If we change the degree=3, then we will give a more accurate plot, as shown in the below
image.

49
BOOK TITLE

SO as we can see here in the above output image, the predicted salary for level 6.5 is near to
170K$-190k$, which seems that future employee is saying the truth about his salary.
Degree= 4: Let's again change the degree to 4, and now will get the most accurate plot.
Hence we can get more accurate results by increasing the degree of Polynomial.

Predicting the final result with the Linear Regression model:


Now, we will predict the final output using the Linear regression model to see whether an
employee is saying truth or bluff. So, for this, we will use the predict() method and will pass
the value 6.5. Below is the code for it:
lin_pred = lin_regs.predict([[6.5]])
print(lin_pred)
Output:
[330378.78787879]
Predicting the final result with the Polynomial Regression model:
Now, we will predict the final output using the Polynomial Regression model to compare
with Linear model. Below is the code for it:
poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
print(poly_pred)
Output:
[158862.45265153]
As we can see, the predicted output for the Polynomial Regression is [158862.45265153],
which is much closer to real value hence, we can say that future employee is saying true.

2.8 LOGISTIC REGRESSION


• Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.

50
BOOK TITLE

• Logistic regression predicts the output of a categorical dependent variable. Therefore


the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or
1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
• Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
• Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:

Logistic Function (Sigmoid Function):


• The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.

51
BOOK TITLE

Assumptions for Logistic Regression:


• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
• We know the equation of the straight line can be written as:

• In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):

• But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".

Python Implementation of Logistic Regression (Binomial)


To understand the implementation of Logistic Regression in Python, we will use the below
example:
Example: There is a dataset given which contains the information of various users obtained
from the social networking sites. There is a car making company that has recently launched a
new SUV car. So the company wanted to check how many users from the dataset, wants to
purchase the car.
For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary (Independent
variables).

52
BOOK TITLE

Steps in Logistic Regression: To implement the Logistic Regression using Python, we will
use the same steps as we have done in previous topics of Regression. Below are the steps:
o Data Pre-processing step
o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we
can use it in our code efficiently. It will be the same as we have done in Data pre-processing
topic. The code for this is given below:
#Data Pre-procesing Step
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
By executing the above lines of code, we will get the dataset as the output. Consider the
given image:

53
BOOK TITLE

Now, we will extract the dependent and independent variables from the given dataset. Below
is the code for it:
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
In the above code, we have taken [2, 3] for x because our independent variables are age and
salary, which are at index 2, 3. And we have taken 4 for y variable because our dependent
variable is at index 4. The output will be:

54
BOOK TITLE

Now we will split the dataset into a training set and test set. Below is the code for it:
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
The output for this is given below:

For test set:


For training set:

In logistic regression, we will do feature scaling because we want accurate result of


predictions. Here we will only scale the independent variable because dependent variable
have only 0 and 1 values. Below is the code for it:
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)

55
BOOK TITLE

x_test= st_x.transform(x_test)
The scaled output is given below:

2. Fitting Logistic Regression to the Training set:


We have well prepared our dataset, and now we will train the dataset using the training set.
For providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the
logistic regression. Below is the code for it:
#Fitting Logistic Regression to the training set
from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state=0)
classifier.fit(x_train, y_train)
Output: By executing the above code, we will get the below output:
Out[5]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=0, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
Hence our model is well fitted to the training set.

3. Predicting the Test Result

56
BOOK TITLE

Our model is well trained on the training set, so we will now predict the result by using test
set data. Below is the code for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the
variable explorer option. It can be seen as:

The above output image shows the corresponding predicted users who want to purchase or
not purchase the car.

4. Test Accuracy of the result


Now we will create the confusion matrix here to check the accuracy of the classification. To
create it, we need to import the confusion_matrix function of the sklearn library. After
importing the function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted value return by the
classifier). Below is the code for it:
#Creating the Confusion matrix
from sklearn.metrics import confusion_matrix
cm= confusion_matrix()
Output:
By executing the above code, a new confusion matrix will be created. Consider the below
image:

57
BOOK TITLE

We can find the accuracy of the predicted result by interpreting the confusion matrix. By
above output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect
Output).

5. Visualizing the training set result


Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:
#Visualizing the training set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('purple','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Logistic Regression (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
In the above code, we have imported the ListedColormap class of Matplotlib library to
create the colormap for visualizing the result. We have created two new
variables x_set and y_set to replace x_train and y_train. After that, we have used
the nm.meshgrid command to create a rectangular grid, which has a range of -1(minimum)
to 1 (maximum). The pixel points we have taken are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of
provided colors (purple and green). In this function, we have passed
the classifier.predict to show the predicted data points predicted by the classifier.
Output: By executing the above code, we will get the below output:

58
BOOK TITLE

The graph can be explained in the below points:


• In the above graph, we can see that there are some Green points within the green
region and Purple points within the purple region.
• All these data points are the observation points from the training set, which shows
the result for purchased variables.
• This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
• The purple point observations are for which purchased (dependent variable) is
probably 0, i.e., users who did not purchase the SUV car.
• The green point observations are for which purchased (dependent variable) is
probably 1 means user who purchased the SUV car.
• We can also estimate from the graph that the users who are younger with low salary,
did not purchase the car, whereas older users with high estimated salary purchased
the car.
• But there are some purple points in the green region (Buying the car) and some
green points in the purple region(Not buying the car). So we can say that younger
users with a high estimated salary purchased the car, whereas an older user with a
low estimated salary did not purchase the car.

The goal of the classifier:


We have successfully visualized the training set result for the logistic regression, and our goal
for this classification is to divide the users who purchased the SUV car and who did not
purchase the car. So from the output graph, we can clearly see the two regions (Purple and
Green) with the observation points. The Purple region is for those users who didn't buy the
car, and Green Region is for those users who purchased the car.

Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have
used the Linear model for Logistic Regression. In further topics, we will learn for non-linear
Classifiers.

Visualizing the test set result:


Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here
we will use x_test and y_test instead of x_train and y_train. Below is the code for it:
#Visulaizing the test set result
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),

59
BOOK TITLE

alpha = 0.75, cmap = ListedColormap(('purple','green' )))


mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('purple', 'green'))(i), label = j)
mtp.title('Logistic Regression (Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output:

The above graph shows the test set result. As we can see, the graph is divided into two
regions (Purple and Green). And Green observations are in the green region, and Purple
observations are in the purple region. So we can say it is a good prediction and model. Some
of the green and purple data points are in different regions, which can be ignored as we have
already calculated this error using the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification
problem.

2.9 ROC CURVE


A ROC (which stands for “receiver operating characteristic”) curve is a graph that shows a
classification model performance at all classification thresholds. It is a probability curve that
plots two parameters, the True Positive Rate (TPR) against the False Positive Rate (FPR), at
different threshold values and separates a so-called ‘signal’ from the ‘noise.’
The ROC curve plots the True Positive Rate against the False Positive Rate at different
classification thresholds. If the user lowers the classification threshold, more items get
classified as positive, which increases both the False Positives and True Positives. You can
see some imagery regarding this here.
What Is a ROC Curve: AUC — Area Under the ROC Curve

60
BOOK TITLE

AUC is short for "Area Under the ROC Curve," which measures the whole two-
dimensional area located underneath the entire ROC curve from (0,0) to (1,1). The
AUC measures the classifier's ability to distinguish between classes. It is used as a
summary of the ROC curve. The higher the AUC, the better the model can
differentiate between positive and negative classes. AUC supplies an aggregate
measure of the model's performance across all possible classification thresholds.
Model creators want AUC for two chief reasons:
• AUC is scale-invariant. The AUC measures how well the predictions were ranked
instead of measuring their absolute values.
• AUC is classification-threshold-invariant, meaning it measures the quality of the
model's predictions regardless of the classification threshold.
However, AUC has its downsides, which manifest in certain situations:
• Scale invariance is not always wanted. For instance, sometimes, the situation calls for
well-calibrated probability outputs, and AUC doesn’t deliver that.
• Classification-threshold invariance isn't always wanted, especially in cases that show
wide disparities in the cost of false negatives compared to false positives. Instead, it
may be essential to minimize only one type of classification error. For instance, when
designing a model that performs email spam detection, you probably want to
prioritize minimizing false positives, despite resulting in a notable increase of false
negatives. Unfortunately, AUC isn't a good metric for this kind of optimization.

What Is a ROC Curve: How Do You Speculate Model Performance?


AUC is a valuable tool for speculating model performance. An excellent model has its AUC
close to 1, indicating a good separability measure. Consequently, a poor model's AUC leans
closer to 0, showing the worst separability measure. In fact, the proximity to 0 means it
reciprocates the result, predicting the negative class as positive and vice versa, showing 0s as
1s and 1s as 0s. Finally, if the AUC is 0.5, it shows that the model has no class separation
capacity at all.
So, when we have a 0.5<AUC<1 result, there’s a high likelihood that the classifier can
distinguish between the positive class values and the negative class values. That’s because the
classifier can detect more numbers of True Positives and Negatives instead of False
Negatives and Positives.

The Relation Between Sensitivity, Specificity, FPR, and Threshold


Before we examine the relation between Specificity, FPR, Sensitivity, and Threshold, we
should first cover their definitions in the context of machine learning models. For that, we'll
need a confusion matrix to help us to understand the terms better. Here is an example of
a confusion matrix:

61
BOOK TITLE

Source
TP stands for True Positive, and TN means True Negative. FP stands for False Positive, and

FN means False Negative.


• Sensitivity: Sensitivity, also termed "recall," is the metric that shows a model's ability
to predict the true positives of all available categories. It shows what proportion of
the positive class was classified correctly. For example, when trying to figure out how
many people have the flu, sensitivity, or True Positive Rate, measures the proportion
of people who have the flu and were correctly predicted as having it.

Here’s how to mathematically calculate sensitivity:


Sensitivity = (True Positive)/(True Positive + False Negative)
• Specificity: The specificity metric Specificity evaluates a model's ability to predict true
negatives of all available categories. It shows what proportion of the negative class
was classified correctly. For example, specificity measures the proportion of people
who don't have the flu and were correctly predicted as not suffering from it in our
flu scenario.

Here’s how to calculate specificity:


• Specificity = (True Negative)/(True Negative + False Positive)
• FPR: FPR stands for False Positive Rate and shows what proportion of the negative
class was incorrectly classified. This formula shows how we calculate FPR:
• FPR= 1 – Specificity
• Threshold: The threshold is the specified cut-off point for an observation to be
classified as either 0 or 1. Typically, an 0.5 is used as the default threshold, although
it’s not always assumed to be the case.
Sensitivity and specificity are inversely proportional, so if we boost sensitivity, specificity
drops, and vice versa. Furthermore, we net more positive values when we decrease the
threshold, thereby raising the sensitivity and lowering the specificity.
On the other hand, if we boost the threshold, we will get more negative values, which results
in higher specificity and lower sensitivity.
And since the FPR is 1 – specificity, when we increase TPR, the FPR also increases and vice
versa.

62
BOOK TITLE

How to Use the AUC - ROC Curve for the Multi-Class Model
We can use the One vs. ALL methodology to plot the N number of AUC ROC Curves for
N number classes when using a multi-class model. One vs. ALL gives us a way to leverage
binary classification. If you have a classification problem with N possible solutions, One vs.
ALL provides us with one binary classifier for each possible outcome.
So, for example, you have three classes named 0, 1, and 2. You will have one ROC for 0
that’s classified against 1 and 2, another ROC for 1, which is classified against 0 and 2, and
finally, the third one of 2 classified against 0 and 1.
We should take a moment and explain the One vs. ALL methodology to better answer the
question “what is a ROC curve?”. This methodology is made up of N separate binary
classifiers. The model runs through the binary classifier sequence during training, training
each to answer a classification question. For instance, if you have a cat picture, you can train
four different recognizers, one seeing the image as a positive example (the cat) and the other
three seeing a negative example (not the cat). It would look like this:
• Is this image a rutabaga? No
• Is this image a cat? Yes
• Is this image a dog? No
• Is this image a hammer? No
This methodology works well with a small number of total classes. However, as the number
of classes rises, the model becomes increasingly inefficient.

===000===

63
BOOK TITLE

3. INTRODUCTION TO MACHINE LEARNING


ALGORITHMS

Machine learning algorithms are computational techniques that enable computers to learn
and make predictions or decisions based on data. They are a fundamental part of the field of
artificial intelligence and are used in a wide range of applications, from image and speech
recognition to recommendation systems and autonomous vehicles. Machine learning
algorithms can be categorized into several main types, including supervised learning,
unsupervised learning, and reinforcement learning. Here's a brief introduction to these
categories:
1. Supervised Learning:
a. Supervised learning is one of the most common types of machine learning.
It involves training a model on a labeled dataset, where the input data is
paired with corresponding output labels. The goal is to learn a mapping
from inputs to outputs.
b. Common algorithms in supervised learning include:
i. Linear Regression: Used for predicting continuous numeric
values (e.g., predicting house prices).
ii. Logistic Regression: Used for binary classification tasks (e.g.,
spam detection).
iii. Decision Trees and Random Forests: Effective for both
classification and regression tasks.
iv. Support Vector Machines (SVM): Used for classification and
regression, with a focus on maximizing the margin between classes.
v. Neural Networks: Deep learning models with multiple layers of
neurons, suitable for a wide range of tasks, from image recognition
to natural language processing.
vi. Naive Bayes: A probabilistic algorithm often used for text
classification.
vii. K-Nearest Neighbors (K-NN): Used for classification and
regression based on the nearest data points in the training set.
2. Unsupervised Learning:
a. Unsupervised learning involves working with unlabeled data. The goal is to
discover hidden patterns or structures within the data, such as clustering
similar data points or reducing the dimensionality of the data.
b. Common algorithms in unsupervised learning include:
i. K-Means Clustering: Used for grouping similar data points into
clusters.
ii. Hierarchical Clustering: Builds a hierarchy of clusters.
iii. Principal Component Analysis (PCA): Reduces the
dimensionality of data while preserving as much variance as
possible.

64
BOOK TITLE

iv. Autoencoders: Neural networks used for feature learning and


dimensionality reduction.
v. Generative Adversarial Networks (GANs): Used to generate
new data samples that are similar to the training data.
vi. Topic Modeling (e.g., Latent Dirichlet Allocation): Identifies
topics in text data.
3. Reinforcement Learning:
a. Reinforcement learning is concerned with making sequences of decisions to
maximize a cumulative reward. It's often used in settings where an agent
interacts with an environment, taking actions to achieve a goal.
b. Common reinforcement learning algorithms include:
i. Q-Learning: A model-free reinforcement learning algorithm used
for discrete action spaces.
ii. Deep Q-Networks (DQNs): Deep reinforcement learning
models that use neural networks to approximate Q-values.
iii. Policy Gradients: Used for learning policies directly, particularly
in continuous action spaces.
iv. Actor-Critic Methods: Combines the advantages of policy and
value-based approaches.
Machine learning algorithms are employed in various domains and applications, from
finance and healthcare to robotics and natural language processing. The choice of algorithm
depends on the nature of the problem, the available data, and the specific goals of the
project. It's important to experiment with different algorithms, preprocess the data
effectively, and fine-tune the model to achieve the best results. Additionally, machine
learning continues to evolve, with ongoing research and the development of new algorithms
and techniques to address increasingly complex and diverse tasks.

3.1 DECISION TREE CLASSIFICATION ALGORITHM


• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.

65
BOOK TITLE

• In order to build a tree, we use the CART algorithm, which stands


for Classification and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
• Below diagram explains the general structure of a decision tree:

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
• Decision Trees usually mimic human thinking ability while making a decision,
so it is easy to understand.
• The logic behind the decision tree can be easily understood because it shows a
tree-like structure.

Decision Tree Terminologies


• Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

How does the Decision Tree algorithm Work?


In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record

66
BOOK TITLE

(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
PauseNext
Unmute
Current Time 0:13
/
Duration 18:10
Loaded: 5.50%
Â
Fullscreen
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:

67
BOOK TITLE

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
• Information Gain
• Gini Index
1. Information Gain:
• Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the
decision tree.
• A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first. It
can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
• S= Total number of samples
• P(yes)= probability of yes
• P(no)= probability of no
2. Gini Index:
• Gini index is a measure of impurity or purity used while creating a decision tree
in the CART(Classification and Regression Tree) algorithm.
• An attribute with the low Gini index should be preferred as compared to the
high Gini index.
• It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
• Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:

68
BOOK TITLE

• Cost Complexity Pruning


• Reduced Error Pruning.

Advantages of the Decision Tree


• It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


• The decision tree contains lots of layers, which makes it complex.
• It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
• For more class labels, the computational complexity of the decision tree may
increase.

3.2 SUPPORT VECTOR MACHINE ALGORITHM


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories that
are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can be created

69
BOOK TITLE

by using the SVM algorithm. We will first train our model with lots of images of cats and
dogs so that it can learn about different features of cats and dogs, and then we test it with
this strange creature. So as support vector creates a decision boundary between these two
data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the
below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM
SVM can be of two types:
• Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear
SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is called as Non-linear SVM
classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line. And
if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

70
BOOK TITLE

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between
the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.

71
BOOK TITLE

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

72
BOOK TITLE

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

3.3 WHAT IS K-NEAREST NEIGHBORS ALGORITHM?


K-Nearest Neighbours is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not
make any underlying assumptions about the distribution of data (as opposed to other
algorithms such as GMM, which assume a Gaussian distribution of the given data). We are
given some prior data (also called training data), which classifies coordinates into groups
identified by an attribute.
As an example, consider the following table of data points containing two features:

KNN Algorithm working visualization


Now, given another set of data points (also called testing data), allocate these points to a
group by analyzing the training set. Note that the unclassified points are marked as
‘White’.

73
BOOK TITLE

Intuition Behind KNN Algorithm


If we plot these points on a graph, we may be able to locate some clusters or groups. Now,
given an unclassified point, we can assign it to a group by observing what group its nearest
neighbors belong to. This means a point close to a cluster of points classified as ‘Red’ has
a higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’ and the
second point (5.5, 4.5) should be classified as ‘Red’.

Distance Metrics Used in KNN Algorithm


As we know that the KNN algorithm helps us identify the nearest points or the groups for
a query point. But to determine the closest groups or the nearest points for a query point
we need some metric. For this purpose, we use below distance metrics:
• Euclidean Distance
• Manhattan Distance
• Minkowski Distance

Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight
line that joins the two points which are into consideration. This metric helps us calculate
the net displacement done between the two states of an object.

Manhattan Distance
This distance metric is generally used when we are interested in the total distance traveled
by the object instead of the displacement. This metric is calculated by summing the
absolute difference between the coordinates of the points in n-dimensions.

Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special cases of the
Minkowski distance.
0 seconds of 15 secondsVolume 0%

From the formula above we can say that when p = 2 then it is the same as the formula for
the Euclidean distance and when p = 1 then we obtain the formula for the Manhattan
distance.

74
BOOK TITLE

The above-discussed metrics are most common while dealing with a Machine
Learning problem but there are other distance metrics as well like Hamming
Distance which come in handy while dealing with problems that require overlapping
comparisons between two vectors whose contents can be boolean as well as string values.

How to choose the value of k for KNN Algorithm?


The value of k is very crucial in the KNN algorithm to define the number of neighbors in
the algorithm. The value of k in the k-nearest neighbors (k-NN) algorithm should be
chosen based on the input data. If the input data has more outliers or noise, a higher value
of k would be better. It is recommended to choose an odd value for k to avoid ties in
classification. Cross-validation methods can help in selecting the best k value for the given
dataset.

Applications of the KNN Algorithm


• Data Preprocessing – While dealing with any Machine Learning problem we first
perform the EDA part in which if we find that the data contains missing values
then there are multiple imputation methods are available as well. One of such
method is KNN Imputer which is quite effective ad generally used for
sophisticated imputation methodologies.
• Pattern Recognition – KNN algorithms work very well if you have trained a
KNN algorithm using the MNIST dataset and then performed the evaluation
process then you must have come across the fact that the accuracy is too high.
• Recommendation Engines – The main task which is performed by a KNN
algorithm is to assign a new query point to a pre-existed group that has been
created using a huge corpus of datasets. This is exactly what is required in
the recommender systems to assign each user to a particular group and then
provide them recommendations based on that group’s preferences.

Advantages of the KNN Algorithm


• Easy to implement as the complexity of the algorithm is not that high.
• Adapts Easily – As per the working of the KNN algorithm it stores all the data
in memory storage and hence whenever a new example or data point is added then
the algorithm adjusts itself as per that new example and has its contribution to the
future predictions as well.
• Few Hyperparameters – The only parameters which are required in the training
of a KNN algorithm are the value of k and the choice of the distance metric
which we would like to choose from our evaluation metric.

Disadvantages of the KNN Algorithm


• Does not scale – As we have heard about this that the KNN algorithm is also
considered a Lazy Algorithm. The main significance of this term is that this takes

75
BOOK TITLE

lots of computing power as well as data storage. This makes this algorithm both
time-consuming and resource exhausting.
• Curse of Dimensionality – There is a term known as the peaking phenomenon
according to this the KNN algorithm is affected by the curse of
dimensionality which implies the algorithm faces a hard time classifying the data
points properly when the dimensionality is too high.
• Prone to Overfitting – As the algorithm is affected due to the curse of
dimensionality it is prone to the problem of overfitting as well. Hence
generally feature selection as well as dimensionality reduction techniques are
applied to deal with this problem.

Example Program:
Assume 0 and 1 as the two classifiers (groups).

# Python3 program to find groups of unknown


# Points using K nearest neighbour algorithm.

import math

def classifyAPoint(points,p,k=3):
'''
This function finds the classification of p using
k nearest neighbor algorithm. It assumes only two
groups and returns 0 if p belongs to group 0, else
1 (belongs to group 1).

Parameters -
points: Dictionary of training points having two keys - 0 and 1
Each key have a list of training data points belong to that

p : A tuple, test data point of the form (x,y)

k : number of nearest neighbour to consider, default is 3


'''

distance=[]
for group in points:
for feature in points[group]:

#calculate the euclidean distance of p from training points


euclidean_distance = math.sqrt((feature[0]-p[0])**2 +(feature[1]-p[1])**2)

76
BOOK TITLE

# Add a tuple of form (distance,group) in the distance list


distance.append((euclidean_distance,group))

# sort the distance list in ascending order


# and select first k distances
distance = sorted(distance)[:k]

freq1 = 0 #frequency of group 0


freq2 = 0 #frequency og group 1

for d in distance:
if d[1] == 0:
freq1 += 1
else if d[1] == 1:
freq2 += 1

return 0 if freq1>freq2 else 1

# driver function
def main():

# Dictionary of training points having two keys - 0 and 1


# key 0 have points belong to class 0
# key 1 have points belong to class 1

points = {0:[(1,12),(2,5),(3,6),(3,10),(3.5,8),(2,11),(2,9),(1,7)],
1:[(5,3),(3,2),(1.5,9),(7,2),(6,1),(3.8,1),(5.6,4),(4,2),(2,5)]}

# testing point p(x,y)


p = (2.5,7)

# Number of neighbours
k=3

print("The value classified to unknown point is: {}".\


format(classifyAPoint(points,p,k)))

if __name__ == '__main__':
main()

# This code is contributed by Atul Kumar (www.fb.com/atul.kr.007)

77
BOOK TITLE

Output:
The value classified as an unknown point is 0.
Time Complexity: O(N * logN)
Auxiliary Space: O(1)

3.4 TIME SERIES FORECASTING IN MACHINE LEARNING


Time series forecasting is a subfield of machine learning and statistics that focuses on
predicting future data points in a time-ordered sequence. Time series data is prevalent in
various domains, including finance (stock prices), weather (temperature and precipitation),
sales (retail demand), and many others. Forecasting in time series involves understanding
past patterns, trends, and seasonal fluctuations to make predictions about future values.
Here's an overview of time series forecasting in machine learning:

Key Concepts in Time Series Forecasting:


1. Time Series Data: Time series data consists of data points collected or recorded at
discrete time intervals. The time dimension is a critical aspect of these data, as the
order of observations matters.
2. Components of Time Series:
a. Trend: The long-term movement or pattern in the data. Trends can be
upward (increasing), downward (decreasing), or stable.
b. Seasonality: Regular, repeating patterns or cycles in the data, often
associated with specific time intervals (e.g., daily, weekly, yearly).
c. Noise/Irregularity: Random fluctuations or noise that is not explained by
trends or seasonality.

Approaches for Time Series Forecasting:


1. Traditional Statistical Methods:
a. Techniques like ARIMA (AutoRegressive Integrated Moving Average) and
Exponential Smoothing models are commonly used for time series
forecasting.
b. ARIMA models capture both autoregressive (past values) and moving
average (past errors) components in the data.
2. Machine Learning Approaches:
a. Machine learning models, particularly regression models, can be used for
time series forecasting. Common algorithms include linear regression,
decision trees, and random forests.
b. For deep learning enthusiasts, recurrent neural networks (RNNs) and Long
Short-Term Memory networks (LSTMs) have been successful in modeling
sequential data for forecasting.
3. Hybrid Methods:
a. Some advanced approaches combine statistical and machine learning
techniques. For instance, Facebook Prophet combines a seasonal

78
BOOK TITLE

decomposition of time series with an additive model and uses Bayesian


methods for trend forecasting.
4. Feature Engineering:
a. Time series data often benefits from appropriate feature engineering. This
includes creating lag features (using past values), rolling statistics (e.g.,
moving averages), and encoding cyclical features for capturing seasonality.

Steps in Time Series Forecasting:


1. Data Preprocessing:
a. Clean and preprocess the data, handling missing values and outliers, and
converting it into a suitable format for modeling.
2. Exploratory Data Analysis (EDA):
a. Understand the time series data by visualizing it, analyzing trends and
seasonality, and identifying any patterns.
3. Splitting Data:
a. Divide the time series data into training, validation, and test sets to train and
evaluate the forecasting model.
4. Model Selection:
a. Choose an appropriate forecasting model based on the characteristics of the
time series data.
5. Model Training:
a. Train the selected model on the training data to learn patterns and
relationships in the time series.
6. Model Evaluation:
a. Assess the model's performance using the validation set, using metrics such
as Mean Absolute Error (MAE), Mean Squared Error (MSE), or Root
Mean Squared Error (RMSE).
7. Hyperparameter Tuning:
a. Fine-tune the model and hyperparameters to optimize forecasting accuracy.
8. Model Deployment:
a. Deploy the forecasting model to make predictions for future time periods.
Time series forecasting can be challenging due to the temporal dependencies, seasonality,
and trends that need to be considered. It often requires domain expertise to interpret and
validate the results effectively. Accurate forecasting is crucial for businesses and
organizations to make informed decisions, allocate resources, and plan for the future.

Advanced Techniques in Time Series Forecasting:


1. ARIMA and Seasonal Decomposition of Time Series (STL):
a. ARIMA models (AutoRegressive Integrated Moving Average) are powerful
for capturing linear trends and seasonality. Seasonal Decomposition of
Time Series (STL) is a technique that decomposes time series data into
trend, seasonality, and remainder components, making it easier to model.
2. Exponential Smoothing State Space Models (ETS):

79
BOOK TITLE

a. ETS models capture the error, trend, and seasonality components of time
series data. These models are particularly useful when the data exhibits
exponential growth or decay.
3. Prophet by Facebook:
a. Prophet is an open-source forecasting tool developed by Facebook. It is
designed to handle time series data with daily observations that display
patterns on different time scales. Prophet allows users to incorporate
holidays and special events.
4. Long Short-Term Memory (LSTM) Networks:
a. LSTM networks, a type of recurrent neural network (RNN), are effective in
capturing long-term dependencies in sequential data. LSTMs are well-suited
for time series forecasting tasks, especially when dealing with complex and
non-linear patterns.
5. Attention Mechanisms:
a. Attention mechanisms, often used in sequence-to-sequence models, allow
the model to focus on different parts of the input sequence when making
predictions. This can be beneficial in capturing relevant temporal patterns.
6. Ensemble Methods:
a. Ensemble methods, such as combining multiple models or predictions, can
enhance forecasting accuracy. Techniques like bagging (Bootstrap
Aggregating) or stacking can be applied to time series forecasting models.
7. Hyperparameter Optimization:
a. Grid search or randomized search can be employed for hyperparameter
tuning to find the optimal configuration for the forecasting model.
8. Probabilistic Forecasting:
a. Instead of providing a single point estimate, probabilistic forecasting
models offer a distribution of possible outcomes. This approach is valuable
in capturing uncertainty and providing more informative predictions.
9. Backtesting:
a. Backtesting involves assessing the performance of a forecasting model on
historical data. This helps validate the model's effectiveness and
generalization to unseen data.
10. Online Learning:
a. For scenarios where data arrives sequentially, online learning techniques
allow the model to continuously update and adapt to new information.

Challenges and Considerations:


1. Data Stationarity:
a. Many time series models assume stationarity (constant statistical properties
over time). If the data is non-stationary, transformations may be needed,
such as differencing or logarithmic scaling.
2. Overfitting and Underfitting:

80
BOOK TITLE

a. Balancing the complexity of the model is crucial to avoid overfitting or


underfitting. Regularization techniques and careful model selection are
essential.
3. Handling Seasonality and Trends:
a. Seasonal and trend components often exist in time series data. Properly
identifying and modeling these components is crucial for accurate
forecasting.
4. Feature Lagging and Rolling Windows:
a. Incorporating lagged features (past observations) and using rolling windows
for moving averages can help the model capture temporal patterns.
5. Handling Outliers and Anomalies:
a. Outliers and anomalies in the data can significantly impact forecasting
accuracy. Robust models and preprocessing techniques are necessary to
handle such situations.
Time series forecasting is a dynamic and evolving field, with continuous advancements and
improvements in techniques. The choice of the most suitable method depends on the
specific characteristics of the time series data and the requirements of the forecasting task.
Regular model evaluation and updating based on new data are essential to maintain
forecasting accuracy over time.

3.5 CLUSTERING IN MACHINE LEARNING


Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as " A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in
a group that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm,
and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID.
ML system can use this id to simplify the processing of large and complex datasets.
The clustering technique is commonly used for statistical data analysis.
Example: Let's understand the clustering technique with the real-world example of Mall:
When we visit any shopping mall, we can observe that the things with similar usage are
grouped together. Such as the t-shirts are grouped in one section, and trousers are at other
sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in
separate sections, so that we can easily find out the things. The clustering technique also
works in the same way. Other examples of clustering are grouping documents according to
the topic.
The clustering technique can be widely used in various tasks. Some most common uses of
this technique are:
• Market Segmentation

81
BOOK TITLE

• Statistical data analysis


• Social network analysis
• Image segmentation
• Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this
technique to recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.

Types of Clustering Methods


The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But there
are also other various approaches of Clustering exist. Below are the main clustering methods
used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known
as the centroid-based method. The most common example of partitioning clustering is
the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as compared to another cluster centroid.

82
BOOK TITLE

Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected.
This algorithm does it by identifying different clusters in the dataset and connects the areas
of high densities into clusters. The dense areas in data space are divided from each other by
sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.

Distribution Model-Based Clustering


In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that
uses Gaussian Mixture Models (GMM).

83
BOOK TITLE

Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is
no requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative

Hierarchical algorithm.

Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than
one group or cluster. Each dataset has a set of membership coefficients, which depend on
the degree of membership to be in a cluster. Fuzzy C-means algorithm is the example of
this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above.
There are different types of clustering algorithms published, but only a few are commonly
used. The clustering algorithm is based on the kind of data that we are using. Such as, some

84
BOOK TITLE

algorithms need to guess the number of clusters in the given dataset, whereas some are
required to find the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm. It is fast
with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the
smooth density of data points. It is an example of a centroid-based model, that
works on updating the candidates for centroid to be the center of the points within
a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of
high density are separated by the areas of low density. Because of this, the clusters
can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used
as an alternative for the k-means algorithm or for those cases where K-means can
be failed. In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as
a single cluster at the outset and then successively merged. The cluster hierarchy can
be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not
require to specify the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N2T) time complexity,
which is the main drawback of this algorithm.

Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
• In Identification of Cancer Cells: The clustering algorithms are widely used
for the identification of cancerous cells. It divides the cancerous and non-
cancerous data sets into different groups.
• In Search Engines: Search engines also work on the clustering technique. The
search result appears based on the closest object to the search query. It does it
by grouping similar data objects in one group that is far from the other
dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
• Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.

85
BOOK TITLE

• In Biology: It is used in the biology stream to classify different species of


plants and animals using the image recognition technique.
• In Land Use: The clustering technique is used in identifying the area of similar
lands use in the GIS database. This can be very useful to find that for what
purpose the particular land should be used, that means for which purpose it is
more suitable.

3.6 PRINCIPAL COMPONENT ANALYSIS


Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help
of orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by
reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional
data.
PCA works by considering the variance of each attribute because the high attribute shows
the good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing
the power allocation in various communication channels. It is a feature extraction
technique, so it contains the important variables and drops the least important variable.

The PCA algorithm is based on some mathematical concepts such as:


• Variance and Covariance
• Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:


• Dimensionality: It is the number of features or variables present in the given
dataset. More easily, it is the number of columns present in the dataset.
• Correlation: It signifies that how strongly two variables are related to each
other. Such as if one changes, the other variable also gets changed. The
correlation value ranges from -1 to +1. Here, -1 occurs if variables are inversely
proportional to each other, and +1 indicates that variables are directly
proportional to each other.
• Orthogonal: It defines that variables are not correlated to each other, and
hence the correlation between the pair of variables is zero.
• Eigenvectors: If there is a square matrix M, and a non-zero vector v is given.
Then v will be eigenvector if Av is the scalar multiple of v.
• Covariance Matrix: A matrix containing the covariance between the pair of
variables is called the Covariance Matrix.

86
BOOK TITLE

Principal Components in PCA


As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs are either equal to or less than the original features
present in the dataset. Some properties of these principal components are given below:
• The principal component must be the linear combination of the original
features.
• These components are orthogonal, i.e., the correlation between a pair of
variables is zero.
• The importance of each component decreases when going to 1 to n, it means
the 1 PC has the most importance, and n PC will have the least importance.

Steps for PCA algorithm


1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y,
where X is the training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the
two-dimensional matrix of independent variable X. Here each row corresponds to
the data items, and the column corresponds to the Features. The number of
columns is the dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the
features with high variance are more important compared to the features with lower
variance.
If the importance of features is independent of the variance of the feature, then we
will divide each data item in a column with the standard deviation of the column.
Here we will name the matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it.
After transpose, we will multiply it by Z. The output matrix will be the Covariance
matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant
covariance matrix Z. Eigenvectors or the covariance matrix are the directions of the
axes with high information. And the coefficients of these eigenvectors are defined as
the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order,
which means from largest to smallest. And simultaneously sort the eigenvectors
accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to

87
BOOK TITLE

the Z. In the resultant matrix Z*, each observation is the linear combination of
original features. Each column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.

Applications of Principal Component Analysis


• PCA is mainly used as the dimensionality reduction technique in various AI
applications such as computer vision, image compression, etc.
• It can also be used for finding hidden patterns if data has high dimensions.
Some fields where PCA is used are Finance, data mining, Psychology, etc.

Advantages of Principal Component Analysis


1. Dimensionality Reduction: Principal Component Analysis is a popular
technique used for dimensionality reduction, which is the process of reducing the
number of variables in a dataset. By reducing the number of variables, PCA
simplifies data analysis, improves performance, and makes it easier to visualize
data.
2. Feature Selection: Principal Component Analysis can be used for feature
selection, which is the process of selecting the most important variables in a
dataset. This is useful in machine learning, where the number of variables can be
very large, and it is difficult to identify the most important variables.
3. Data Visualization: Principal Component Analysis can be used for data
visualization. By reducing the number of variables, PCA can plot high-dimensional
data in two or three dimensions, making it easier to interpret.
4. Multicollinearity: Principal Component Analysis can be used to deal
with multicollinearity, which is a common problem in a regression analysis where
two or more independent variables are highly correlated. PCA can help identify
the underlying structure in the data and create new, uncorrelated variables that can
be used in the regression model.
5. Noise Reduction: Principal Component Analysis can be used to reduce the noise
in data. By removing the principal components with low variance, which are
assumed to represent noise, Principal Component Analysis can improve the
signal-to-noise ratio and make it easier to identify the underlying structure in the
data.
6. Data Compression: Principal Component Analysis can be used for data
compression. By representing the data using a smaller number of principal
components, which capture most of the variation in the data, PCA can reduce the
storage requirements and speed up processing.
7. Outlier Detection: Principal Component Analysis can be used for outlier
detection. Outliers are data points that are significantly different from the other

88
BOOK TITLE

data points in the dataset. Principal Component Analysis can identify these
outliers by looking for data points that are far from the other points in the
principal component space.

Disadvantages of Principal Component Analysis


1. Interpretation of Principal Components: The principal components created by
Principal Component Analysis are linear combinations of the original variables,
and it is often difficult to interpret them in terms of the original variables. This
can make it difficult to explain the results of PCA to others.
2. Data Scaling: Principal Component Analysis is sensitive to the scale of the data.
If the data is not properly scaled, then PCA may not work well. Therefore, it is
important to scale the data before applying Principal Component Analysis.
3. Information Loss: Principal Component Analysis can result in information loss.
While Principal Component Analysis reduces the number of variables, it can also
lead to loss of information. The degree of information loss depends on the
number of principal components selected. Therefore, it is important to carefully
select the number of principal components to retain.
4. Non-linear Relationships: Principal Component Analysis assumes that the
relationships between variables are linear. However, if there are non-linear
relationships between variables, Principal Component Analysis may not work well.
5. Computational Complexity: Computing Principal Component Analysis can be
computationally expensive for large datasets. This is especially true if the number
of variables in the dataset is large.
6. Overfitting: Principal Component Analysis can sometimes result in overfitting,
which is when the model fits the training data too well and performs poorly on
new data. This can happen if too many principal components are used or if the
model is trained on a small dataset.

===000===

89
BOOK TITLE

4. MODEL DIGNOSTS AND TUNING IN MACHINE


LEARNING

Model diagnostics and tuning are crucial steps in the machine learning pipeline to ensure that
your model is performing at its best. These steps involve evaluating the model's
performance, identifying issues, and optimizing its hyperparameters. Here's a breakdown of
the processes involved:
Model Diagnostics:
1. Model Evaluation Metrics:
a. Choose appropriate evaluation metrics based on the problem type. For
classification tasks, metrics like accuracy, precision, recall, F1 score, and
ROC AUC are commonly used. For regression tasks, metrics like mean
squared error (MSE), mean absolute error (MAE), and R-squared are
common.
2. Cross-Validation:
a. Implement k-fold cross-validation to assess the model's performance.
Cross-validation helps estimate the model's generalization performance and
detect issues like overfitting.
3. Confusion Matrix and ROC Curve:
a. For classification tasks, create a confusion matrix and ROC curve to
understand the model's performance in more detail. This can help identify
issues like class imbalance or misclassification errors.
4. Bias-Variance Trade-off:
a. Analyze the bias-variance trade-off to find the right balance. High bias
(underfitting) occurs when the model is too simple, and high variance
(overfitting) occurs when the model is too complex. Adjust the model's
complexity accordingly.
5. Learning Curve:
a. Plot learning curves to visualize how the model's performance changes with
increasing training data. Learning curves help identify issues related to data
size and model convergence.
6. Residual Analysis:
a. In regression tasks, analyze the residuals (the differences between predicted
and actual values) to check for patterns, heteroscedasticity, or nonlinearity.
7. Feature Importance:
a. Evaluate the importance of features to determine which variables have the
most impact on the model's predictions. This can help you understand the
model's decision-making process.
8. Visualization:
a. Use visualization techniques to inspect the model's performance, feature
relationships, and data distributions. Visualization can help detect anomalies
and potential issues.

90
BOOK TITLE

Hyperparameter Tuning:
1. Grid Search and Random Search:
a. Grid search and random search are techniques to find the best
hyperparameters for your model. Grid search exhaustively explores
predefined hyperparameter combinations, while random search randomly
samples from a predefined range of hyperparameters. These methods help
optimize the model's performance.
2. Cross-Validation for Hyperparameter Tuning:
a. Apply cross-validation during hyperparameter tuning to ensure that the
selected hyperparameters generalize well. Use k-fold cross-validation to
estimate the performance of different hyperparameter combinations.
3. Hyperparameter Optimization Libraries:
a. Employ specialized libraries like scikit-learn's GridSearchCV and
RandomizedSearchCV, or more advanced libraries like Optuna or
Hyperopt, to automate the hyperparameter search process.
4. Learning Rate Schedules (for Neural Networks):
a. When working with neural networks, learning rate schedules can be used to
adapt the learning rate during training. Techniques like learning rate
annealing or cyclic learning rates can improve convergence.
5. Regularization Techniques:
a. Utilize regularization techniques such as L1 (Lasso) or L2 (Ridge)
regularization to control overfitting. The choice of regularization strength
should be part of the tuning process.
6. Ensemble Models:
a. Experiment with ensemble techniques like bagging (e.g., Random Forests)
and boosting (e.g., Gradient Boosting) to combine multiple models for
improved performance.
7. Feature Engineering:
a. Consider modifying, engineering, or transforming features to improve
model performance. Feature engineering can involve creating interactions,
encoding categorical data, and dimensionality reduction.
8. Feature Scaling:
a. Ensure that feature scaling is appropriate for the model. Some algorithms,
like k-nearest neighbors and support vector machines, are sensitive to
feature scales.
9. Early Stopping (for Neural Networks):
a. Implement early stopping to halt training when the model starts to overfit.
Early stopping can prevent unnecessary training epochs and save time.
10. Validation Set for Hyperparameter Tuning:
a. Reserve a separate validation set for hyperparameter tuning to avoid data
leakage from the test set and obtain an unbiased evaluation of the model.
Model diagnostics and hyperparameter tuning are iterative processes that require careful
consideration of the problem, data, and model characteristics. By diagnosing model issues

91
BOOK TITLE

and optimizing hyperparameters, you can fine-tune your machine learning model to achieve
the best possible performance.

4.1 BIAS AND VARIANCE


Machine learning is a branch of Artificial Intelligence, which allows machines to perform
data analysis and make predictions. However, if the machine learning model is not accurate,
it can make predictions errors, and these prediction errors are usually known as Bias and
Variance. In machine learning, these errors will always be present as there is always a slight
difference between the model predictions and actual predictions. The main aim of ML/data
science analysts is to reduce these errors in order to get more accurate results. In this topic,
we are going to discuss bias and variance, Bias-variance trade-off, Underfitting and
Overfitting. But before starting, let's first understand what errors in Machine learning are?

Errors in Machine Learning?


In machine learning, an error is a measure of how accurately an algorithm can make
predictions for the previously unknown dataset. On the basis of these errors, the machine
learning model is selected that can perform best on the particular dataset. There are mainly
two types of errors in machine learning, which are:
o Reducible errors: These errors can be reduced to improve the model accuracy.
Such errors can further be classified into bias and Variance.

o Irreducible errors: These errors will always be present in the model


regardless of which algorithm has been used. The cause of these errors is unknown variables
whose value can't be reduced.

92
BOOK TITLE

What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies them
to test data for prediction. While making predictions, a difference occurs between
prediction values made by the model and actual values/expected values , and this
difference is known as bias errors or Errors due to bias . It can be defined as an inability
of machine learning algorithms such as Linear Regression to capture the true relationship
between the data points. Each algorithm begins with some amount of bias because bias
occurs from assumptions in the model, which makes the target function simple to learn. A
model has either:
• Low Bias: A low bias model will make fewer assumptions about the form of
the target function.
• High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias
model also cannot perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the
algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm
often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-
Nearest Neighbours and Support Vector Machines. At the same time, an algorithm with
high bias is Linear Regression, Linear Discriminant Analysis and Logistic Regression.
Ways to reduce High Bias:
High bias mainly occurs due to a much simple model. Below are some ways to reduce the
high bias:
• Increase the input features as the model is underfitted.
• Decrease the regularization term.
• Use more complex models, such as including some polynomial features.

What is a Variance Error?


The variance would specify the amount of variation in the prediction if the different training
data was used. In simple words, variance tells that how much a random variable is
different from its expected value. Ideally, a model should not vary too much from one
training dataset to another, which means the algorithm should be good in understanding the
hidden mapping between inputs and output variables. Variance errors are either of low
variance or high variance.
Low variance means there is a small variation in the prediction of the target function with
changes in the training data set. At the same time, High variance shows a large variation in
the prediction of the target function with changes in the training dataset.
A model that shows high variance learns a lot and perform well with the training dataset, and
does not generalize well with the unseen dataset. As a result, such a model gives good results
with the training dataset but shows high error rates on the test dataset.

93
BOOK TITLE

Since, with high variance, the model learns too much from the dataset, it leads to overfitting
of the model. A model with high variance has the below problems:
o A high variance model leads to overfitting.
o Increase model complexities.
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.

Some examples of machine learning algorithms with low variance are, Linear Regression,
Logistic Regression, and Linear discriminant analysis. At the same time, algorithms
with high variance are decision tree, Support Vector Machine, and K-nearest
neighbours.

Ways to Reduce High Variance:


• Reduce the input features or number of parameters as a model is overfitted.
• Do not use a much complex model.
• Increase the training data.
• Increase the Regularization term.

Different Combinations of Bias-Variance


There are four possible combinations of bias and variances, which are represented by the
below diagram:

1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine learning
model. However, it is not possible practically.

94
BOOK TITLE

2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with a
large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not learn
well with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias, High-Variance:
5. With high bias and high variance, predictions are inconsistent and also inaccurate on
average.

How to identify High variance or High Bias?


High variance can be identified if the model has:

o Low training error and high test error.


High Bias can be identified if the model has:
o High training error and the test error is almost similar to training error.

Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the model
has a large number of parameters, it will have high variance and low bias. So, it is required to
make a balance between bias and variance errors, and this balance between the bias error and
variance error is known as the Bias-Variance trade-off.

95
BOOK TITLE

For an accurate prediction of the model, algorithms need a low variance and low bias. But
this is not possible because bias and variance are related to each other:
• If we decrease the variance, it will increase the bias.
• If we decrease the bias, it will increase the variance.
Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that
accurately captures the regularities in training data and simultaneously generalizes well with
the unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a high
variance algorithm may perform well with training data, but it may lead to overfitting to
noisy data. Whereas, high bias algorithm generates a much simple model that may not even
capture important regularities in the data. So, we need to find a sweet spot between bias and
variance to make an optimal model.
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance
between bias and variance errors.

4.2 K-FOLD CROSS-VALIDATION


K-fold cross-validation approach divides the input dataset into K groups of samples of equal
sizes. These samples are called folds. For each learning set, the prediction function uses k-1
folds, and the rest of the folds are used for the test set. This approach is a very popular CV
approach because it is easy to understand, and the output is less biased than other methods.
The steps for k-fold cross-validation are:
• Split the input dataset into K groups
• For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the
model using the test set.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On
1st iteration, the first fold is reserved for test the model, and rest are used to train the model.
On 2nd iteration, the second fold is used to test the model, and rest are used to train the
model. This process will continue until each fold is not used for the test fold.
Consider the below diagram:

Stratified k-fold cross-validation

96
BOOK TITLE

This technique is similar to k-fold cross-validation with some little changes. This approach
works on stratification concept, it is a process of rearranging the data to ensure that each
fold or group is a good representative of the complete dataset. To deal with the bias and
variance, it is one of the best approaches.
It can be understood with an example of housing prices, such that the price of some houses
can be much high than other houses. To tackle such situations, a stratified k-fold cross-
validation technique is useful.

Holdout Method
This method is the simplest cross-validation technique among all. In this method, we need to
remove a subset of the training data and use it to get prediction results by training it on the
rest part of the dataset.
The error that occurs in this process tells how well our model will perform with the
unknown dataset. Although this approach is simple to perform, it still faces the issue of high
variance, and it also produces misleading results sometimes.

Comparison of Cross-validation to train/test split in Machine Learning


• Train/test split: The input data is divided into two parts, that are training set
and test set on a ratio of 70:30, 80:20, etc. It provides a high variance, which is
one of the biggest disadvantages.
o Training Data: The training data is used to train the model, and the
dependent variable is known.
o Test Data: The test data is used to make the predictions from the
model that is already trained on the training data. This has the same
features as training data but not the part of that.
• Cross-Validation dataset: It is used to overcome the disadvantage of
train/test split by splitting the dataset into groups of train/test splits, and
averaging the result. It can be used if we want to optimize our model that has
been trained on the training dataset for the best performance. It is more
efficient as compared to train/test split as every observation is used for the
training and testing both.

Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:
• For the ideal conditions, it provides the optimum output. But for the
inconsistent data, it may produce a drastic result. So, it is one of the big
disadvantages of cross-validation, as there is no certainty of the type of data in
machine learning.
• In predictive modeling, the data evolves over a period, due to which, it may face
the differences between the training set and validation sets. Such as if we create
a model for the prediction of stock market values, and the data is trained on the
previous 5 years stock values, but the realistic future values for the next 5 years

97
BOOK TITLE

may drastically different, so it is difficult to expect the correct output for such
situations.

Applications of Cross-Validation
• This technique can be used to compare the performance of different predictive
modeling methods.
• It has great scope in the medical research field.
• It can also be used for the meta-analysis, as it is already being used by the data
scientists in the field of medical statistics.

4.3 BAGGING MACHINE LEARNING


In this tutorial, we discuss bagging in machine learning. Bagging, or bootstrap aggregation, is
the ensemble getting-to-know method generally used to lessen variance within a loud
dataset. In Bagging, a random pattern of statistics in this study set is selected with
replacement, meaning that the character statistics factors may be chosen more soon as
possible. After numerous facts samples are generated, those susceptible fashions are trained
independently. For example, the common or Majority of these predictions yield a correct
estimate depending on the sort of task- regression or type. As a note, the random woodland
set of rules is considered an extension of the bagging approach, using both bagging and
function randomness to create an uncorrelated wooded area of selection trees.
The Bagging is an assembling approach that tries to resolve overfitting for class or the
regression problems. Bagging pursuits to improve the accuracy and overall performance of
gadget mastering algorithms. It does this by taking random subsets of an original dataset,
with substitute, and fits either a classifier (for classification) or regressor (for regression) to
each subset. Bagging is also known as Bootstrap aggregating. It is an ensemble learning
approach that enhances the overall performance and accuracy of the gadget for learning
algorithms. It is miles used to address bias-variance alternate-off increases and decreases the
variance of a prediction version. The Bagging avoids overfitting of data and is used for each
regression and classification of the class, in particular for the decision tree algorithms.

What is Ensemble Learning?


Ensemble learning gives us credence to the idea of the "wisdom of crowds," it suggests that
the choice-making for a more extensive organization of humans is usually higher than that of
an individual professional. Another side, ensemble learning refers to a collection (or
ensemble) of base newbies or fashions, which are paintings collectively to attain a better very
last of the prediction. A single model, also called a base or susceptible learner, may not
perform well due to high variance or bias. But, while vulnerable learners are aggregated, they
could shape a sturdy learner, as their combination reduces bias or variance, yielding higher
model performance. Ensemble learning is a widely used and desired tool learning technique
in which more than one person models, often referred to as base models, are blended to
produce a powerful ideal of the prediction version. An example of ensemble learning is the
Random Forest algorithm.

98
BOOK TITLE

Ensemble learning is frequently illustrated using selection timber as this algorithm may be
liable to overfitting (excessive variance and low bias) when it has not been pruned. It could
additionally lend itself to underfitting (low variance and extreme bias) when it is very small,
like a decision stump, a decision tree with one stage. While an algorithm overfits or fits its
education set, it cannot generalize nicely to new datasets, so ensemble strategies are used to
counteract this conduct to allow for the generalization of the model to new datasets. While
selection timber can showcase excessive variance or high bias, it is worth noting that it is not
the best modelling approach that leverages ensemble learning to find the "sweet spot" in the
bias-variance trade-off.

What is the difference between Bagging and Boosting?


There are some differences between Bagging and boosting. These are two principal forms of
ensemble studying strategies. The main difference between these two learning strategies is
the way they are skilled. In the bagging technique, it is vulnerable newcomers trained in
parallel. But in the boosting, they are trained sequentially. This means that a sequence of
fashions is constructed, and with each new version generation, the weights of the
misclassified information in the preceding version are improved. This redistribution of
weights enables the algorithm to perceive the parameters it wishes to the consciousness of to
enhance its performance. AdaBoost, which stands for "adaptative boosting set of rules," is
onemost maximum famous boosting algorithmbecamened into one of the first of its kind.
Different varieties of boosting algorithms consist of XGBoost, GradientBoost, and
BrownBoost.
Another difference between Bagging and boosting is the scenarios wherein they may be
used. For example, bagging strategies or techniques are usually used on susceptible novices,
mainly showcasing excessive variance and occasional bias. But the boosting plans are
leveraged while low friction and high tendency are located.

Difference between bagging and boosting are:

Bagging Boosting

The most effective manner of mixing predictions that A manner of mixing predictions
belong to the same type. that belong to different sorts.

The main task of it is decrease the variance but not The main task of it is decrease the
bias. bias but not variance.

Here each of the model is different weight. Here each of the model is same
weight.

Each of the model is built here independently. Each of the model is built here
dependently.

99
BOOK TITLE

This training records subsets are decided on using row Each new subset consists of the
sampling with alternative and random sampling factors that were misclassified
techniques from the whole training dataset. through preceding models.

It is trying to solve by over fitting problem. It is trying to solve by reducing the


bias.

If the classifier is volatile (excessive variance), then If the classifier is stable and easy
apply bagging. (excessive bias) the practice
boosting.

In the bagging base, the classifier is works parallelly. In the boosting base, the classifier is
works sequentially.

Example is random forest model by using bagging. Example is AdaBoost using the
boosting technique.

What are the similarities between Bagging and Boosting?


The similarities between Bagging and boosting are the commonly used strategies with a
general similarity of being labelled as ensemble strategies. Now here we will briefly explain
the similarities between Bagging and boosting.
1. They both are ensemble techniques to get the N novices from 1 learner.
2. Each generates numerous training statistics sets through random sampling.
3. They each make the very last decision by averaging the N number of beginners
(or they take most of the people of them, i.e., the Majority of voting).
4. The Bagging and boosting are exact at reducing the variance and offer better
stability.

Describe the Bagging Technique:


Assume the set D of d tuples, at each iteration I, a schooling set Di of d tuples is selected
thru row sampling with a substitute approach (i.e., there may be repetitive factors from
distinct d tuples) from D (i.e., bootstrap). Then a classifier version Mi is discovered for each
training set D < i. every classifier Mi returns its class prediction. The bagged classifier M*
counts the votes and assigns the class with the most votes to X (unknown pattern).

What are the Implementation Steps of Bagging?


o Step 1: Multiple subsets are made from the original information set with identical
tuples, deciding on observations with replacement.
o Step 2: A base model is created on all subsets.
o Step 3: Every version is found in parallel with each training set and unbiased.
o Step 4: The very last predictions are determined by combining the forecasts from all
models.

100
BOOK TITLE

Application of the Bagging:


There are various applications of Bagging, which are given below -
1. IT:
Bagging can also improve the precision and accuracy of IT structures, together with network
intrusion detection structures. In the meantime, this study seems at how Bagging can
enhance the accuracy of network intrusion detection and reduce the rates of fake positives.
2. Environment:
Ensemble techniques, together with Bagging, were carried out inside the area of far-flung
sensing. This study indicates how it has been used to map the styles of wetlands inside a
coastal landscape.
3. Finance:
Bagging has also been leveraged with deep gaining knowledge of models within the finance
enterprise, automating essential tasks, along with fraud detection, credit risk reviews, and
option pricing issues. This research demonstrates how Bagging amongst different device
studying techniques was leveraged to assess mortgage default hazard. This highlights how
Bagging limits threats by saving you from credit score card fraud within the banking and
economic institutions.
4. Healthcare:
The Bagging has been used to shape scientific data predictions. These studies (PDF, 2.8 MB)
show that ensemble techniques had been used for various bioinformatics issues, including
gene and protein selection, to perceive a selected trait of interest. More significantly, this
study mainly delves into its use to expect the onset of diabetes based on various threat
predictors.

What are the Advantages and Disadvantages of Bagging?


Advantages of Bagging are -
There are many advantages of Bagging. The benefit of Bagging is given below -
1. Easier for implementation:
Python libraries, including scikit-examine (sklearn), make it easy to mix the predictions of
base beginners or estimators to enhance model performance. Their documentation outlines
the available modules you can leverage for your model optimization.
2. Variance reduction:
The Bagging can reduce the variance inside a getting to know set of rules which is especially
helpful with excessive-dimensional facts, where missing values can result in better conflict,
making it more liable to overfitting and stopping correct generalization to new datasets.
Disadvantages of Bagging are -
There are many disadvantages of Bagging. The disadvantages of Bagging are given below -
1. Flexible less:
As a method, Bagging works particularly correctly with algorithms that are much less solid.
One which can be more stable or a problem with high amounts of bias does now not
provide an awful lot of gain as there is less variation in the dataset of the version. As noted
within the hands-On guide for machine learning, "the bagging is a linear regression version
will efficaciously just return the original predictions for huge enough b."

101
BOOK TITLE

2. Loss of interpretability:
The Bagging slows down and grows extra in depth because of the quantity of iterations
growth. accordingly, it is no longer adequately suitable for actual-time applications. Clustered
structures or large processing cores are perfect for quickly growing bagged ensembles on
massive look-at units.
3. Expensive for computation:
The Bagging is tough to draw unique business insights via Bagging because of the averaging
concerned throughout predictions. While the output is more precise than any person's
information point, a more accurate or whole dataset may yield greater precision within a
single classification or regression model.

Bagging classifier example:


Example:
Here we give an example of a bagging classifier using python. The example is given below -
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier
data = datasets.load_wine(as_frame = True)
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 22)
estimator_range = [2,4,6,8,10,12,14,16,18,20]
models = []
scores = []
for n_estimators in estimator_range:
# Create a bagging classifier
clf = BaggingClassifier(n_estimatorsn_estimators = n_estimators, random_state = 22)
# Fit the model
clf.fit(X_train, y_train)
# Append the model and score to their respective list
models.append(clf)
scores.append(accuracy_score(y_true = y_test, y_pred = clf.predict(X_test)))
# Generate the plot of the scores against a number of the estimators
plt.figure(figsize=(9,6))
plt.plot(estimator_range, scores)
# Adjust labels and font (to make them visible)
plt.xlabel("n_estimators", font size = 18)
plt.ylabel("score", font size = 18)
plt.tick_params(label size = 16)
# show the plot

102
BOOK TITLE

plt.show()
Output:
By utilizing iterating thru exceptional values for the range of estimators, we will see an
increase in version overall performance from 82.2% to 95.5%. After 14 estimators, the
accuracy begins to drop, and once more, if you set an exceptional random_state, the values
you see will range. This is why cross-validation is an adequate exercise to ensure solid
consequences. In this case, we see a 13.3% boom in accuracy concerning identifying the type
of wine. Now we compile the above program and then run it. After that, the output is
screened below -

Another form for the Evaluation:


As bootstrapping chooses random subsets of observations to create classifiers, some
observations might need to be addressed in the selection process. Those "out-of-bag"
observations can then be used to assess the model in addition to that of a test set.
Remember that out-of-bag estimation can overestimate mistakes in binary class troubles and
should be used as praise to different metrics. We saw in the remaining exercise that 12
estimators yielded the very best accuracy, so we can use that to create our model-this time
setting the parameter oob_score to proper to evaluate the model without-of-bag rating.
Example:
Here we give an example of "out of the bag" using python. The example is given below -
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
data = datasets.load_wine(as_frame = True)
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 27)
oob_model = BaggingClassifier(n_estimators = 16, oob_score = True,random_state = 27)
oob_model.fit(X_train, y_train)

103
BOOK TITLE

print(oob_model.oob_score_)
Output:
Now we compile the above program and then run it. After that, the output is screened
below -
0.8951612903225806

4.4 RANDOM FOREST ALGORITHM


Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, " Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one decision tree,
the random forest takes the prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:

Assumptions for Random Forest


Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random forest classifier:
• There should be some actual values in the feature variable of the dataset so that
the classifier can predict accurate results rather than a guessed result.

104
BOOK TITLE

• The predictions from each tree must have very low correlations.

Why use Random Forest?


Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
• It takes less training time as compared to other algorithms.
• It predicts output with high accuracy, even for the large dataset it runs
efficiently.
• It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?


Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:

105
BOOK TITLE

Applications of Random Forest


There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan
risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease
can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
Advantages of Random Forest
o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest
o Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.

4.5 GRADIENT BOOSTING IN MACHINE LEARNING


Gradient Boosting is a powerful machine learning technique used for both regression and
classification tasks. It is an ensemble learning method that builds a strong predictive model
by combining the predictions of multiple weak models, typically decision trees, in a
sequential manner. Gradient Boosting is widely regarded for its high predictive accuracy and
flexibility. The most popular implementations of Gradient Boosting include Gradient
Boosting Machines (GBM), XGBoost, LightGBM, and CatBoost.
Here's an overview of how Gradient Boosting works and its key concepts:

Key Concepts in Gradient Boosting:


1. Ensemble Learning:
a. Gradient Boosting is part of the ensemble learning family. It combines
multiple base models (typically decision trees) to create a strong
predictive model by reducing bias and variance.
2. Boosting:
a. In boosting, each base model focuses on the mistakes of the previous
models, gradually reducing the model's error. This sequential process is
where the "boosting" in Gradient Boosting comes from.
3. Weak Learners:
a. Gradient Boosting works well with weak learners, which are models
that perform slightly better than random guessing. Decision trees with
limited depth (stumps) are commonly used as weak learners.
4. Loss Function and Gradients:
a. Gradient Boosting minimizes a loss function, such as Mean Squared
Error (MSE) for regression or Log Loss (cross-entropy) for
classification. Gradients of the loss function are used to guide the
model's optimization process.

106
BOOK TITLE

The Gradient Boosting Process:


1. Initialize Model:
a. Start with an initial model (often a simple model like a single decision
stump), which serves as the baseline.
2. Calculate Residuals:
a. Compute the residuals, which are the differences between the actual
target values and the predictions made by the current model.
3. Fit a Weak Learner:
a. Train a weak learner (e.g., decision tree) to predict the residuals. The
new model's predictions are added to the previous model's predictions,
effectively correcting the mistakes made by the previous model.
4. Update Model:
a. The new model is combined with the previous model, and the process
repeats. The contributions of each model are weighted, and learning
rates control how much each model's predictions influence the final
prediction.
5. Iterate:
a. The boosting process iterates until a predefined number of models
(trees) are built or until a stopping criterion (e.g., a target accuracy level)
is reached.
6. Predictions:
a. The final prediction is the sum of predictions made by all the models.
For regression, this represents the cumulative sum of residuals, and for
classification, it can be interpreted as probabilities.

Advantages of Gradient Boosting:


1. High Predictive Accuracy: Gradient Boosting often produces highly accurate
models that outperform other algorithms.
2. Robust to Overfitting: By focusing on the errors made by previous models,
Gradient Boosting is less prone to overfitting compared to models like decision
trees.
3. Flexibility: It can handle both regression and classification tasks and supports
various loss functions.
4. Feature Importance: Gradient Boosting provides feature importance scores, which
can help in feature selection and understanding the model's behavior.

Popular Implementations of Gradient Boosting:


1. XGBoost (Extreme Gradient Boosting): XGBoost is an efficient and highly
optimized implementation of Gradient Boosting. It is widely used in machine
learning competitions and real-world applications due to its speed and performance.
2. LightGBM: LightGBM is another efficient implementation of Gradient Boosting,
designed for high performance. It's known for its speed and ability to work with
large datasets.

107
BOOK TITLE

3. CatBoost: CatBoost is a specialized Gradient Boosting implementation for


categorical feature handling. It automatically encodes categorical features and is
known for its simplicity and strong default hyperparameters.
Gradient Boosting is a powerful technique but can be computationally expensive and may
require careful hyperparameter tuning. It is particularly well-suited for structured data and
structured tabular datasets. When applied effectively, Gradient Boosting can deliver state-of-
the-art results in many machine learning tasks.

4.6 STACKING
Stacking is one of the popular ensemble modeling techniques in machine learning.
Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the future.
This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be achieved.
In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how
to best combine the input predictions to make a better output prediction.
Stacking is also known as a stacked generalization and is an extended form of the Model
Averaging Ensemble technique in which all sub-models equally participate as per their
performance weights and build a new model with better predictions. This new model is
stacked up on top of the others; this is the reason why it is named stacking.

Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists of two or
more base/learner's models and a meta-model that combines the predictions of the base
models. These base models are called level 0 models, and the meta-model is known as the
level 1 model. So, the Stacking ensemble method includes original (training) data,
primary level models, primary level prediction, secondary level model, and final
prediction. The basic architecture of stacking can be represented as shown below the image.

• Original data: This data is divided into n-folds and is also considered test data or
training data.
• Base models: These models are also referred to as level-0 models. These models
use training data and provide compiled predictions (level-0) as an output.
• Level-0 Predictions: Each base model is triggered on some training data and
provides different predictions, which are known as level-0 predictions.

108
BOOK TITLE

• Meta Model: The architecture of the stacking model consists of one meta-model,
which helps to best combine the predictions of the base models. The meta-model is
also known as the level-1 model.
• Level-1 Prediction: The meta-model learns how to best combine the predictions of
the base models and is trained on different predictions made by individual base
models, i.e., data not used to train the base models are fed to the meta-model,
predictions are made, and these predictions, along with the expected outputs,
provide the input and output pairs of the training dataset used to fit the meta-model.

Steps to implement Stacking models:


There are some important steps to implementing stacking models in machine learning. These
are as follows:
• Split training data sets into n-folds using the RepeatedStratifiedKFold as this is
the most common approach to preparing training datasets for meta-models.
• Now the base model is fitted with the first fold, which is n-1, and it will make
predictions for the nth folds.
• The prediction made in the above step is added to the x1_train list.
• Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of size n,
• Now, the model is trained on all the n parts, which will make predictions for the
sample data.
• Add this prediction to the y1_test list.
• In the same way, we can find x2_train, y2_test, x3_train, and y3_test by using Model
2 and 3 for training, respectively, to get Level 2 predictions.
• Now train the Meta model on level 1 prediction, where these predictions will be
used as features for the model.
• Finally, Meta learners can now be used to make a prediction on test data in the
stacking model.

Stacking Ensemble Family


There are some other ensemble techniques that can be considered the forerunner of the
stacking method. For better understanding, we have divided them into the different
frameworks of essential stacking so that we can easily understand the differences between
methods and the uniqueness of each technique. Let's discuss a few commonly used
ensemble techniques related to stacking.

Voting ensembles:
This is one of the simplest stacking ensemble methods, which uses different algorithms to
prepare all members individually. Unlike the stacking method, the voting ensemble uses
simple statistics instead of learning how to best combine predictions from base models
separately.
It is significant to solve regression problems where we need to predict the mean or median
of the predictions from base models. Further, it is also helpful in various classification

109
BOOK TITLE

problems according to the total votes received for prediction. The label with the higher
numbers of votes is referred to as hard voting, whereas the label that receives the largest
sums of probability or lesser votes is referred to as soft voting.
The voting ensemble differs from than stacking ensemble in terms of weighing models based
on each member's performance because here, all models are considered to have the same
skill levels.
Member Assessment: In the voting ensemble, all members are assumed to have the same
skill sets.
Combine with Model: Instead of using combined prediction from each member, it uses
simple statistics to get the final prediction, e.g., mean or median.

Weighted Average Ensemble


The weighted average ensemble is considered the next level of the voting ensemble, which
uses a diverse collection of model types as contributing members. This method uses some
training datasets to find the average weight of each ensemble member based on their
performance. An improvement over this naive approach is to weigh each member based on
its performance on a hold-out dataset, such as a validation set or out-of-fold predictions
during k-fold cross-validation. Furthermore, it may also involve tuning the coefficient
weightings for each model using an optimization algorithm and performance on a holdout
dataset.
Member Assessment: Weighted average ensemble method uses member performance
based on the training dataset.
Combine With Model: It considers the weighted average of prediction from each member
separately.

Blending Ensemble:
Blending is a similar approach to stacking with a specific configuration. It is considered a
stacking method that uses k-fold cross-validation to prepare out-of-sample predictions for
the meta-model. In this method, the training dataset is first to split into different training sets
and validation sets then we train learner models on the training sets. Further, predictions are
made on the validation set and sample set, where validation predictions are used as features
to build a new model, which is later used to make final predictions on the test set using the
prediction values as features.
Member Predictions: The blending stacking ensemble uses out-of-sample predictions on a
validation set.
Combine With Model: Linear model (e.g., linear regression or logistic regression).

Super Learner Ensemble:


This method is quite similar to blending, which has a specific configuration of a stacking
ensemble. It uses out-of-fold predictions from learner models and prepares a meta-model.
However, it is considered a modified form of blending, which only differs in the selection of
how out-of-sample predictions are prepared for the meta learner.

110
BOOK TITLE

Summary of Stacking Ensemble


Stacking is an ensemble method that enables the model to learn how to use combine
predictions given by learner models with meta-models and prepare a final model with
accurate prediction. The main benefit of stacking ensemble is that it can shield the
capabilities of a range of well-performing models to solve classification and regression
problems. Further, it helps to prepare a better model having better predictions than all
individual models. In this topic, we have learned various ensemble techniques and their
definitions, the stacking ensemble method, the architecture of stacking models, and steps to
implement stacking models in machine learning.

===000===

111
BOOK TITLE

5. ARTIFICIAL NEURAL NETWORK TUTORIAL

Artificial Neural Network Tutorial provides basic and advanced concepts of ANNs. Our
Artificial Neural Network tutorial is developed for beginners as well as professions.
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain. An Artificial neural network is usually a computational
network based on biological neural networks that construct the structure of the human
brain. Similar to a human brain has neurons interconnected to each other, artificial neural
networks also have neurons that are linked to each other in various layers of the networks.
These neurons are known as nodes.
Artificial neural network tutorial covers all the aspects related to the artificial neural network.
In this tutorial, we will discuss ANNs, Adaptive resonance theory, Kohonen self-organizing
map, Building blocks, unsupervised learning, Genetic algorithm, etc.

What is Artificial Neural Network?


The term "Artificial Neural Network" is derived from Biological neural networks that
develop the structure of a human brain. Similar to the human brain that has neurons
interconnected to one another, artificial neural networks also have neurons that are
interconnected to one another in various layers of the networks. These neurons are known
as nodes.

The given figure illustrates the typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.

Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks,
cell nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
Relationship between Biological neural network and artificial neural network:

112
BOOK TITLE

Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output
An Artificial Neural Network in the field of Artificial intelligence where it attempts to
mimic the network of neurons makes up a human brain so that computers will have an
option to understand things and make decisions in a human-like manner. The artificial neural
network is designed by programming computers to behave simply like interconnected brain
cells.
There are around 1000 billion neurons in the human brain. Each neuron has an association
point somewhere in the range of 1,000 and 100,000. In the human brain, data is stored in
such a manner as to be distributed, and we can extract more than one piece of this data
when necessary from our memory parallelly. We can say that the human brain is made up of
incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example of a
digital logic gate that takes an input and gives an output. "OR" gate, which takes two inputs.
If one or both the inputs are "On," then we get "On" in output. If both the inputs are "Off,"
then we get "Off" in output. Here the output depends upon input. Our brain does not
perform the same task. The outputs to inputs relationship keep changing because of the
neurons in our brain, which are "learning."

The architecture of an artificial neural network:


To understand the concept of the architecture of an artificial neural network, we have to
understand what a neural network consists of. In order to define a neural network that
consists of a large number of artificial neurons, which are termed units arranged in a
sequence of layers. Lets us look at various types of layers available in an artificial neural
network.
Artificial Neural Network primarily consists of three layers:

113
BOOK TITLE

Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations
to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.

It determines weighted total is passed as an input to an activation function to produce the


output. Activation functions choose whether a node should fire or not. Only those who are
fired make it to the output layer. There are distinctive activation functions available that can
be applied upon the sort of task we are performing.

Advantages of Artificial Neural Network (ANN)


Parallel processing capability:
Artificial neural networks have a numerical value that can perform more than one task
simultaneously.
Storing data on the entire network:
Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't prevent the
network from working.
Capability to work with incomplete knowledge:
After ANN training, the information may produce output even with inadequate data. The
loss of performance here relies upon the significance of missing data.
Having a memory distribution:
For ANN is to be able to adapt, it is important to determine the examples and to encourage
the network according to the desired output by demonstrating these examples to the
network. The succession of the network is directly proportional to the chosen instances, and
if the event can't appear to the network in all its aspects, it can produce false output.
Having fault tolerance:
Extortion of one or more cells of ANN does not prohibit it from generating output, and this
feature makes the network fault-tolerance.

Disadvantages of Artificial Neural Network:


Assurance of proper network structure:

114
BOOK TITLE

There is no particular guideline for determining the structure of artificial neural networks.
The appropriate network structure is accomplished through experience, trial, and error.
Unrecognized behavior of the network:
It is the most significant issue of ANN. When ANN produces a testing solution, it does not
provide insight concerning why and how. It decreases trust in the network.
Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their
structure. Therefore, the realization of the equipment is dependent.
Difficulty of showing the issue to the network:
ANNs can work with numerical data. Problems must be converted into numerical values
before being introduced to ANN. The presentation mechanism to be resolved here will
directly impact the performance of the network. It relies on the user's abilities.
The duration of the network is unknown:
The network is reduced to a specific value of the error, and this value does not give us
optimum results.
Science artificial neural networks that have steeped into the world in the mid-20th century are exponentially
developing. In the present time, we have investigated the pros of artificial neural networks and the issues
encountered in the course of their utilization. It should not be overlooked that the cons of ANN networks,
which are a flourishing science branch, are eliminated individually, and their pros are increasing day by day.
It means that artificial neural networks will turn into an irreplaceable part of our lives progressively
important.

How do artificial neural networks work?


Artificial Neural Network can be best represented as a weighted directed graph, where the
artificial neurons form the nodes. The association between the neurons outputs and neuron
inputs can be viewed as the directed edges with weights. The Artificial Neural Network
receives the input signal from the external source in the form of a pattern and image in the
form of a vector. These inputs are then mathematically assigned by the notations x(n) for
every n number of inputs.

115
BOOK TITLE

Afterward, each of the input is multiplied by its corresponding weights ( these weights are
the details utilized by the artificial neural networks to solve a specific problem ). In general
terms, these weights normally represent the strength of the interconnection between neurons
inside the artificial neural network. All the weighted inputs are summarized inside the
computing unit.
If the weighted sum is equal to zero, then bias is added to make the output non-zero or
something else to scale up to the system's response. Bias has the same input, and weight
equals to 1. Here the total of weighted inputs can be in the range of 0 to positive infinity.
Here, to keep the response in the limits of the desired value, a certain maximum value is
benchmarked, and the total of weighted inputs is passed through the activation function.
The activation function refers to the set of transfer functions used to achieve the desired
output. There is a different kind of the activation function, but primarily either linear or non-
linear sets of functions. Some of the commonly used sets of activation functions are the
Binary, linear, and Tan hyperbolic sigmoidal activation functions. Let us take a look at each
of them in details:

Binary:
In binary activation function, the output is either a one or a 0. Here, to accomplish this,
there is a threshold value set up. If the net weighted input of neurons is more than 1, then
the final output of the activation function is returned as one or else the output is returned as
0.

Sigmoidal Hyperbolic:
The Sigmoidal Hyperbola function is generally seen as an "S" shaped curve. Here the tan
hyperbolic function is used to approximate output from the actual net input. The function is
defined as:
F(x) = (1/1 + exp(-????x))
Where ???? is considered the Steepness parameter.

Types of Artificial Neural Network:


There are various types of Artificial Neural Networks (ANN) depending upon the human
brain neuron and network functions, an artificial neural network similarly performs tasks.
The majority of the artificial neural networks will have some similarities with a more
complex biological partner and are very effective at their expected tasks. For example,
segmentation or classification.
Feedback ANN:
In this type of ANN, the output returns into the network to accomplish the best-evolved
results internally. As per the University of Massachusetts, Lowell Centre for Atmospheric
Research. The feedback networks feed information back into itself and are well suited to
solve optimization issues. The Internal system error corrections utilize feedback ANNs.
Feed-Forward ANN:
A feed-forward network is a basic neural network comprising of an input layer, an output
layer, and at least one layer of a neuron. Through assessment of its output by reviewing its

116
BOOK TITLE

input, the intensity of the network can be noticed based on group behavior of the associated
neurons, and the output is decided. The primary advantage of this network is that it figures
out how to evaluate and recognize input patterns.
Prerequisite
No specific expertise is needed as a prerequisite before starting this tutorial.
Audience
Our Artificial Neural Network Tutorial is developed for beginners as well as professionals,
to help them understand the basic concept of ANNs.
Problems
We assure you that you will not find any problem in this Artificial Neural Network tutorial.
But if there is any problem or mistake, please post the problem in the contact form so that
we can further improve it.

5.1 PERCEPTRON- SINGLE ARTIFICIAL NEURON


The perceptron is a single processing unit of any neural network. Frank Rosenblatt first
proposed in 1958 is a simple neuron which is used to classify its input into one or two
categories. Perceptron is a linear classifier, and is used in supervised learning. It helps to
organize the given input data.
A perceptron is a neural network unit that does a precise computation to detect features in
the input data. Perceptron is mainly used to classify the data into two parts. Therefore, it is
also known as Linear Binary Classifier.

Perceptron uses the step function that returns +1 if the weighted sum of its input 0 and -1.
The activation function is used to map the input between the required value like (0, 1) or (-1,
1).
A regular neural network looks like this:

117
BOOK TITLE

The perceptron consists of 4 parts.


o Input value or One input layer: The input layer of the perceptron is made of
artificial input neurons and takes the initial data into the system for further
processing.
o Weights and Bias:
Weight: It represents the dimension or strength of the connection between units. If
the weight to node 1 to node 2 has a higher quantity, then neuron 1 has a more
considerable influence on the neuron.
Bias: It is the same as the intercept added in a linear equation. It is an additional
parameter which task is to modify the output along with the weighted sum of the
input to the other neuron.
o Net sum: It calculates the total sum.
o Activation Function: A neuron can be activated or not, is determined by an
activation function. The activation function calculates a weighted sum and further
adding bias with it to give the result.

A standard neural network looks like the below diagram.

118
BOOK TITLE

How does it work?


The perceptron works on these simple steps which are given below:
a. In the first step, all the inputs x are multiplied with their weights w.

b. In this step, add all the increased values and call them the Weighted sum.

c. In our last step, apply the weighted sum to a correct Activation Function.
For Example:
A Unit Step Activation Function

119
BOOK TITLE

There are two types of architecture. These types focus on the functionality of artificial neural
networks as follows-
o Single Layer Perceptron
o Multi-Layer Perceptron

Single Layer Perceptron


The single-layer perceptron was the first neural network model, proposed in 1958 by Frank
Rosenbluth. It is one of the earliest models for learning. Our goal is to find a linear decision
function measured by the weight vector w and the bias parameter b.
To understand the perceptron layer, it is necessary to comprehend artificial neural networks
(ANNs).
The artificial neural network (ANN) is an information processing system, whose mechanism
is inspired by the functionality of biological neural circuits. An artificial neural network
consists of several processing units that are interconnected.
This is the first proposal when the neural model is built. The content of the neuron's local
memory contains a vector of weight.
The single vector perceptron is calculated by calculating the sum of the input vector
multiplied by the corresponding element of the vector, with each increasing the amount of
the corresponding component of the vector by weight. The value that is displayed in the
output is the input of an activation function.
Let us focus on the implementation of a single-layer perceptron for an image classification
problem using TensorFlow. The best example of drawing a single-layer perceptron is
through the representation of "logistic regression."

Now, We have to do the following necessary steps of training logistic regression-


o The weights are initialized with the random values at the origination of each
training.
o For each element of the training set, the error is calculated with the difference
between the desired output and the actual output. The calculated error is used to
adjust the weight.
o The process is repeated until the fault made on the entire training set is less than the
specified limit until the maximum number of iterations has been reached.

120
BOOK TITLE

Complete code of Single layer perceptron


# Import the MINST dataset
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_ ("/tmp/data/", one_hot=True)
import tensorflow as tf
import matplotlib.pyplot as plt
# Parameters
learning_rate = 0.01
training_epochs = 25
batch_size = 100
display_step = 1
# tf Graph Input
x = tf.placeholder("float", [none, 784]) # MNIST data image of shape 28*28 = 784
y = tf.placeholder("float", [none, 10]) # 0-9 digits recognition => 10 classes
# Create model
# Set model weights
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
# Constructing the model
activation=tf.nn.softmaxx(tf.matmul (x, W)+b) # Softmax
of function
# Minimizing error using cross entropy
cross_entropy = y*tf.log(activation)
cost = tf.reduce_mean\ (-tf.reduce_sum\ (cross_entropy, reduction_indice = 1))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
#Plot settings
avg_set = []
epoch_set = []
# Initializing the variables where init = tf.initialize_all_variables()
# Launching the graph
with tf.Session() as sess:
sess.run(init)
# Training of the cycle in the dataset
for epoch in range(training_epochs):
avg_cost = 0.
total_batch = int(mnist.train.num_example/batch_size)
# Creating loops at all the batches in the code
for i in range(total_batch):
batch_xs, batch_ys = mnist.train.next_batch(batch_size)
# Fitting the training by the batch data sess.run(optimizr, feed_dict = {
x: batch_xs, y: batch_ys})
# Compute all the average of loss avg_cost += sess.run(cost, \ feed_dict = {

121
BOOK TITLE

x: batch_xs, \ y: batch_ys}) //total batch


# Display the logs at each epoch steps
if epoch % display_step==0:
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format (avg_cost))
avg_set.append(avg_cost) epoch_set.append(epoch+1)
print ("Training phase finished")
plt.plot(epoch_set,avg_set, 'o', label = 'Logistics Regression Training')
plt.ylabel('cost')
plt.xlabel('epoch')
plt.legend()
plt.show()
# Test the model
correct_prediction = tf.equal (tf.argmax (activation, 1),tf.argmax(y,1))
# Calculating the accuracy of dataset
accuracy = tf.reduce_mean(tf.cast (correct_prediction, "float")) print
("Model accuracy:", accuracy.eval({x:mnist.test.images, y: mnist.test.labels}))
The output of the Code:

The logistic regression is considered as predictive analysis. Logistic regression is mainly used
to describe data and use to explain the relationship between the dependent binary variable
and one or many nominal or independent variables.

122
BOOK TITLE

5.2 MULTI-LAYER PERCEPTRON (FEED FORWARD NEURAL NETWORK)


Multi-Layer perceptron defines the most complex architecture of artificial neural networks.
It is substantially formed from multiple layers of the perceptron. TensorFlow is a very
popular deep learning framework released by, and this notebook will guide to build a neural
network with this library. If we want to understand what is a Multi-layer perceptron, we have
to develop a multi-layer perceptron from scratch using Numpy.
The pictorial representation of multi-layer perceptron learning is as shown below-

MLP networks are used for supervised learning format. A typical learning algorithm for
MLP networks is also called back propagation's algorithm.
A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a set
of outputs from a set of inputs. An MLP is characterized by several layers of input nodes
connected as a directed graph between the input nodes connected as a directed graph
between the input and output layers. MLP uses backpropagation for training the network.
MLP is a deep learning method.
Now, we are focusing on the implementation with MLP for an image classification problem.
# Import MINST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)

import tensorflow as tf
import matplotlib.pyplot as plt

# Parameters
learning_rate = 0.001
training_epochs = 20
batch_size = 100
display_step = 1

# Network Parameters
n_hidden_1 = 256

123
BOOK TITLE

# 1st layer num features


n_hidden_2 = 256 # 2nd layer num features
n_input = 784 # MNIST data input (img shape: 28*28) n_classes = 10
# MNIST total classes (0-9 digits)

# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])

# weights layer 1
h = tf.Variable(tf.random_normal([n_input, n_hidden_1])) # bias layer 1
bias_layer_1 = tf.Variable(tf.random_normal([n_hidden_1]))
# layer 1 layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, h), bias_layer_1))

# weights layer 2
w = tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2]))

# bias layer 2
bias_layer_2 = tf.Variable(tf.random_normal([n_hidden_2]))

# layer 2
layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, w), bias_layer_2))

# weights output layer


output = tf.Variable(tf.random_normal([n_hidden_2, n_classes]))

# biar output layer


bias_output = tf.Variable(tf.random_normal([n_classes])) # output layer
output_layer = tf.matmul(layer_2, output) + bias_output

# cost function
cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
logits = output_layer, labels = y))

#cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(output_layer, y))


# optimizer
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)

# optimizer = tf.train.GradientDescentOptimizer(
learning_rate = learning_rate).minimize(cost)

# Plot settings

124
BOOK TITLE

avg_set = []
epoch_set = []

# Initializing the variables


init = tf.global_variables_initializer()

# Launch the graph


with tf.Session() as sess:
sess.run(init)

# Training cycle
for epoch in range(training_epochs):
avg_cost = 0.
total_batch = int(mnist.train.num_examples / batch_size)

# Loop over all batches


for i in range(total_batch):
batch_xs, batch_ys = mnist.train.next_batch(batch_size)
# Fit training using batch data sess.run(optimizer, feed_dict = {
x: batch_xs, y: batch_ys})
# Compute average loss
avg_cost += sess.run(cost, feed_dict = {x: batch_xs, y: batch_ys}) / total_batch
# Display logs per epoch step
if epoch % display_step == 0:
print
Epoch:", '%04d' % (epoch + 1), "cost=", "{:.9f}".format(avg_cost)
avg_set.append(avg_cost)
epoch_set.append(epoch + 1)
print
"Training phase finished"

plt.plot(epoch_set, avg_set, 'o', label = 'MLP Training phase')


plt.ylabel('cost')
plt.xlabel('epoch')
plt.legend()
plt.show()

# Test model
correct_prediction = tf.equal(tf.argmax(output_layer, 1), tf.argmax(y, 1))

# Calculate accuracy
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

125
BOOK TITLE

print
"Model Accuracy:", accuracy.eval({x: mnist.test.images, y: mnist.test.labels})

The above line of codes generating the following output-

Creating an interactive section


We have two basic options when using TensorFlow to run our code:
o Build graphs and run sessions [Do all the set-up and then execute a session to
implement a session to evaluate tensors and run operations].
o Create our coding and run on the fly.
For this first part, we will use the interactive session that is more suitable for an environment
like Jupiter notebook.
sess = tf.InteractiveSession()
Creating placeholders
It's a best practice to create placeholder before variable assignments when using
TensorFlow. Here we'll create placeholders to inputs ("Xs") and outputs ("Ys").
Placeholder "X": Represent the 'space' allocated input or the images.
o Each input has 784 pixels distributed by a 28 width x 28 height matrix.
o The 'shape' argument defines the tensor size by its dimensions.

Feed Forward Process in Deep Neural Network


Now, we know how with the combination of lines with different weight and biases can
result in non-linear models. How does a neural network know what weight and biased values
to have in each layer? It is no different from how we did it for the single based perceptron
model.
We are still making use of a gradient descent optimization algorithm which acts to minimize
the error of our model by iteratively moving in the direction with the steepest descent, the
direction which updates the parameters of our model while ensuring the minimal error. It

126
BOOK TITLE

updates the weight of every model in every single layer. We will talk more about
optimization algorithms and backpropagation later.
It is important to recognize the subsequent training of our neural network. Recognition is
done by dividing our data samples through some decision boundary.
"The process of receiving an input to produce some kind of output to make some kind of
prediction is known as Feed Forward." Feed Forward neural network is the core of many
other important neural networks such as convolution neural network.
In the feed-forward neural network, there are not any feedback loops or connections in the
network. Here is simply an input layer, a hidden layer, and an output layer.

There can be multiple hidden layers which depend on what kind of data you are dealing
with. The number of hidden layers is known as the depth of the neural network. The deep
neural network can learn from more functions. Input layer first provides the neural network
with data and the output layer then make predictions on that data which is based on a series
of functions. ReLU Function is the most commonly used activation function in the deep
neural network.
To gain a solid understanding of the feed-forward process, let's see this mathematically.
1) The first input is fed to the network, which is represented as matrix x1, x2, and one where
one is the bias value.

2) Each input is multiplied by weight with respect to the first and second model to obtain
their probability of being in the positive region in each model.
So, we will multiply our inputs by a matrix of weight using matrix multiplication.

3) After that, we will take the sigmoid of our scores and gives us the probability of the point
being in the positive region in both models.

127
BOOK TITLE

4) We multiply the probability which we have obtained from the previous step with the
second set of weights. We always include a bias of one whenever taking a combination of
inputs.

And as we know to obtain the probability of the point being in the positive region of this
model, we take the sigmoid and thus producing our final output in a feed-forward process.

Let takes the neural network which we had previously with the following linear models and
the hidden layer which combined to form the non-linear model in the output layer.

So, what we will do we use our non-linear model to produce an output that describes the
probability of the point being in the positive region. The point was represented by 2 and 2.
Along with bias, we will represent the input as

The first linear model in the hidden layer recall and the equation defined it

Which means in the first layer to obtain the linear combination the inputs are multiplied by -
4, -1 and the bias value is multiplied by twelve.

The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three to
obtain the linear combination of that same point in our second model.

128
BOOK TITLE

Now, to obtain the probability of the point is in the positive region relative to both models
we apply sigmoid to both points as

The second layer contains the weights which dictated the combination of the linear models
in the first layer to obtain the non-linear model in the second layer. The weights are 1.5, 1,
and a bias value of 0.5.
Now, we have to multiply our probabilities from the first layer with the second set of
weights as

Now, we will take the sigmoid of our final score

It is complete math behind the feed forward process where the inputs from the input
traverse the entire depth of the neural network. In this example, there is only one hidden
layer. Whether there is one hidden layer or twenty, the computational processes are the same
for all hidden layers.

5.3 RESTRICTED BOLTZMANN MACHINE


Restricted Boltzmann Machine (RBM) is a type of artificial neural network that is used for
unsupervised learning. It is a type of generative model that is capable of learning a
probability distribution over a set of input data.
RBM was introduced in the mid-2000s by Hinton and Salakhutdinov as a way to address the
problem of unsupervised learning. It is a type of neural network that consists of two layers
of neurons – a visible layer and a hidden layer. The visible layer represents the input data,
while the hidden layer represents a set of features that are learned by the network.

129
BOOK TITLE

The RBM is called “restricted” because the connections between the neurons in the same
layer are not allowed. In other words, each neuron in the visible layer is only connected to
neurons in the hidden layer, and vice versa. This allows the RBM to learn a compressed
representation of the input data by reducing the dimensionality of the input.
The RBM is trained using a process called contrastive divergence, which is a variant of the
stochastic gradient descent algorithm. During training, the network adjusts the weights of
the connections between the neurons in order to maximize the likelihood of the training
data. Once the RBM is trained, it can be used to generate new samples from the learned
probability distribution.
RBM has found applications in a wide range of fields, including computer vision, natural
language processing, and speech recognition. It has also been used in combination with
other neural network architectures, such as deep belief networks and deep neural networks,
to improve their performance.

What are Boltzmann Machines?


It is a network of neurons in which all the neurons are connected to each other. In this
machine, there are two layers named visible layer or input layer and hidden layer. The visible
layer is denoted as v and the hidden layer is denoted as the h. In Boltzmann machine, there
is no output layer. Boltzmann machines are random and generative neural networks capable
of learning internal representations and are able to represent and (given enough time) solve
tough combinatoric problems.
The Boltzmann distribution (also known as Gibbs Distribution) which is an integral part of
Statistical Mechanics and also explain the impact of parameters like Entropy and
Temperature on the Quantum States in Thermodynamics. Due to this, it is also known
as Energy-Based Models (EBM). It was invented in 1985 by Geoffrey Hinton, then a
Professor at Carnegie Mellon University, and Terry Sejnowski, then a Professor at Johns
Hopkins University

What are Restricted Boltzmann Machines (RBM)?


A restricted term refers to that we are not allowed to connect the same type layer to each
other. In other words, the two neurons of the input layer or hidden layer can’t connect to
each other. Although the hidden layer and visible layer can be connected to each other.

130
BOOK TITLE

As in this machine, there is no output layer so the question arises how we are going to
identify, adjust the weights and how to measure the that our prediction is accurate or not. All
the questions have one answer, that is Restricted Boltzmann Machine.
The RBM algorithm was proposed by Geoffrey Hinton (2007), which learns probability
distribution over its sample training data inputs. It has seen wide applications in different
areas of supervised/unsupervised machine learning such as feature learning, dimensionality
reduction, classification, collaborative filtering, and topic modeling.
Consider the example movie rating discussed in the recommender system section.
Movies like Avengers, Avatar, and Interstellar have strong associations with the latest fantasy
and science fiction factor. Based on the user rating RBM will discover latent factors that can
explain the activation of movie choices. In short, RBM describes variability among
correlated variables of input dataset in terms of a potentially lower number of unobserved
variables.
The energy function is given by

Applications of Restricted Boltzmann Machine


Restricted Boltzmann Machines (RBMs) have found numerous applications in various fields,
some of which are:
• Collaborative filtering: RBMs are widely used in collaborative filtering for
recommender systems. They learn to predict user preferences based on their past
behavior and recommend items that are likely to be of interest to the user.
• Image and video processing: RBMs can be used for image and video processing
tasks such as object recognition, image denoising, and image reconstruction. They
can also be used for tasks such as video segmentation and tracking.
• Natural language processing: RBMs can be used for natural language processing
tasks such as language modeling, text classification, and sentiment analysis. They can
also be used for tasks such as speech recognition and speech synthesis.
• Bioinformatics: RBMs have found applications in bioinformatics for tasks such as
protein structure prediction, gene expression analysis, and drug discovery.
• Financial modeling: RBMs can be used for financial modeling tasks such as
predicting stock prices, risk analysis, and portfolio optimization.
• Anomaly detection: RBMs can be used for anomaly detection tasks such as fraud
detection in financial transactions, network intrusion detection, and medical
diagnosis.
• It is used in Filtering.
• It is used in Feature Learning.
• It is used in Classification.
• It is used in Risk Detection.
• It is used in Business and Economic analysis.

131
BOOK TITLE

How do Restricted Boltzmann Machines work?


In RBM there are two phases through which the entire RBM works:
1st Phase: In this phase, we take the input layer and using the concept of weights and biased
we are going to activate the hidden layer. This process is said to be Feed Forward Pass. In
Feed Forward Pass we are identifying the positive association and negative association.
Feed Forward Equation:
• Positive Association — When the association between the visible unit and the hidden
unit is positive.
• Negative Association — When the association between the visible unit and the hidden
unit is negative.
2nd Phase: As we don’t have any output layer. Instead of calculating the output layer, we
are reconstructing the input layer through the activated hidden state. This process is said to
be Feed Backward Pass. We are just backtracking the input layer through the activated
hidden neurons. After performing this we have reconstructed Input through the activated
hidden state. So, we can calculate the error and adjust weight in this way:
Feed Backward Equation:
• Error = Reconstructed Input Layer-Actual Input layer
• Adjust Weight = Input*error*learning rate (0.1)
After doing all the steps we get the pattern that is responsible to activate the hidden neurons.
To understand how it works:
Let us consider an example in which we have some assumption that V1 visible unit activates
the h1 and h2 hidden unit and V2 visible unit activates the h2 and h3 hidden. Now when any
new visible unit let V5 has come into the machine and it also activates the h1 and h2 unit.
So, we can back trace the hidden units easily and also identify that the characteristics of the
new V5 neuron is matching with that of V1. This is because V1 also activated the same
hidden unit earlier.

Restricted Boltzmann Machines


Types of RBM :
There are mainly two types of Restricted Boltzmann Machine (RBM) based on the types of
variables they use:

132
BOOK TITLE

1. Binary RBM: In a binary RBM, the input and hidden units are binary variables.
Binary RBMs are often used in modeling binary data such as images or text.
2. Gaussian RBM: In a Gaussian RBM, the input and hidden units are continuous
variables that follow a Gaussian distribution. Gaussian RBMs are often used in
modeling continuous data such as audio signals or sensor data.

Apart from these two types, there are also variations of RBMs such as:
1. Deep Belief Network (DBN): A DBN is a type of generative model that consists
of multiple layers of RBMs. DBNs are often used in modeling high-dimensional
data such as images or videos.
2. Convolutional RBM (CRBM): A CRBM is a type of RBM that is designed
specifically for processing images or other grid-like structures. In a CRBM, the
connections between the input and hidden units are local and shared, which makes
it possible to capture spatial relationships between the input units.
3. Temporal RBM (TRBM): A TRBM is a type of RBM that is designed for
processing temporal data such as time series or video frames. In a TRBM, the
hidden units are connected across time steps, which allows the network to model
temporal dependencies in the data.

===000===

133
BOOK TITLE

ABOUT THE AUTHOR

Insert author bio text here. Insert author bio text here Insert author bio text here Insert
author bio text here Insert author bio text here Insert author bio text here Insert author bio
text here Insert author bio text here Insert author bio text here Insert author bio text here
Insert author bio text here Insert author bio text here Insert author bio text here Insert
author bio text here Insert author bio text here Insert author bio text here Insert author bio
text here Insert author bio text here Insert author bio text here Insert author bio text here
Insert author bio text here Insert author bio text here Insert author bio text here

134

You might also like