Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

Machine learning is a growing technology which enables computers to learn automatically from past data.

Machine
learning uses various algorithms for building mathematical models and making predictions using historical data or
information. Currently, it is being used for various tasks such as image recognition, speech recognition, email
filtering, Facebook auto-tagging, recommender system, and many more.

What is Machine Learning


In the real world, we are surrounded by humans who can learn everything from their experiences with their
learning capability, and we have computers or machines which work on our instructions. But can a machine
also learn from experiences or past data like a human does? So here comes the role of Machine Learning.

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences on their own.
The term machine learning was first introduced by Arthur Samuel in 1959. We can define it in a
summarized way as:

Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.

With the help of sample historical data, which is known as training data, machine learning algorithms build
a mathematical model that helps in making predictions or decisions without being explicitly programmed.
Machine learning brings computer science and statistics together for creating predictive models . Machine
learning constructs or uses the algorithms that learn from historical data. The more we will provide the
information, the higher will be the performance.

A machine has the ability to learn if it can improve its performance by gaining more data.

How does Machine Learning work


A Machine Learning system learns from historical data, builds the prediction models, and whenever it
receives new data, predicts the output for it. The accuracy of predicted output depends upon the amount
of data, as the huge amount of data helps to build a better model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead of writing a
code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms,
machine builds the logic as per the data and predict the output. Machine learning has changed our way of
thinking about the problem. The below block diagram explains the working of Machine Learning algorithm:

Features of Machine Learning:


 Machine learning uses data to detect various patterns in a given dataset.
 It can learn from past data and improve automatically.
 It is a data-driven technology.
 Machine learning is much similar to data mining as it also deals with the huge amount of the data.
Need for Machine Learning
The need for machine learning is increasing day by day. The reason behind the need for machine learning is
that it is capable of doing tasks that are too complex for a person to implement directly. As a human, we
have some limitations as we cannot access the huge amount of data manually, so for this, we need some
computer systems and here comes the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data and let them explore
the data, construct the models, and predict the required output automatically. The performance of the
machine learning algorithm depends on the amount of data, and it can be determined by the cost function.
With the help of machine learning, we can save both time and money.

The importance of machine learning can be easily understood by its uses cases, Currently, machine learning
is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion by Facebook,
etc. Various top companies such as Netflix and Amazon have build machine learning models that are using a
vast amount of data to analyze the user interest and recommend product accordingly.

Following are some key points which show the importance of Machine Learning:

 Rapid increment in the production of data


 Solving complex problems, which are difficult for a human
 Decision making in various sector including finance
 Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning


At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning

Supervised learning is a type of machine learning method in which we provide sample labeled data to the
machine learning system in order to train it, and on that basis, it predicts the output.

The system creates a model using labeled data to understand the datasets and learn about each data, once the
training and processing are done then we test the model by providing a sample data to check whether it is
predicting the exact output or not.

The goal of supervised learning is to map input data with the output data. The supervised learning is based
on supervision, and it is the same as when a student learns things in the supervision of the teacher. The
example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of algorithms:

 Classification
 Regression
2) Unsupervised Learning

Unsupervised learning is a learning method in which a machine learns without any supervision.

The training is provided to the machine with the set of data that has not been labeled, classified, or
categorized, and the algorithm needs to act on that data without any supervision. The goal of unsupervised
learning is to restructure the input data into new features or a group of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to find useful insi ghts
from the huge amount of data. It can be further classifieds into two categories of algorithms:

 Clustering
 Association

3) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward for
each right action and gets a penalty for each wrong action. The agent learns automatically with these
feedbacks and improves its performance. In reinforcement learning, the agent interacts with the environment
and explores it. The goal of an agent is to get the most reward points, and hence, it improves its
performance.

The robotic dog, which automatically learns the movement of his arms, is an example of Reinforcement
learning.

Machine Learning at present:


Now machine learning has got a great advancement in its research, and it is present everywhere around us,
such as self-driving cars, Amazon Alexa, Catboats, recommender system, and many more. It includes
Supervised, unsupervised, and reinforcement learning with clustering, classification, decision tree,
SVM algorithms, etc.

Modern machine learning models can be used for making various predictions, including weather
prediction, disease prediction, stock market analysis, etc.

Applications of Machine learning

Machine learning is a buzzword for today's technology, and it is growing very rapidly day by day. We are
using machine learning in our daily life even without knowing it such as Google Maps, Google assistant,
Alexa, etc. Below are some most trending real-world applications of Machine Learning:
1. Image Recognition:

Image recognition is one of the most common applications of machine learning. It is used to identify objects,
persons, places, digital images, etc. The popular use case of image recognition and face detection is,
Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our
Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind this
is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a
popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech to
text", or "Computer speech recognition." At present, machine learning algorithms are widely used by
various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech
recognition technology to follow the voice instructions.
3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the
shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow -moving, or heavily congested with
the help of two ways:

 Real Time location of the vehicle form Google Map app and sensors
 Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information from the user
and sends back to its database to improve the performance.

4. Product recommendations:

Machine learning is widely used by various e-commerce and entertainment companies such as Amazon,
Netflix, etc., for product recommendation to the user. Whenever we search for some product on Amazon,
then we started getting an advertisement for the same product while internet surfing on the same browser
and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests the product as
per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc., and
this is also done with the help of machine learning.

5. Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a
significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on self-
driving car. It is using unsupervised learning method to train the car models to detect people and objects
while driving.

6. Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spam box, and the
technology behind this is Machine learning. Below are some spam filters used by Gmail:

 Content Filter
 Header filter
 General blacklists filter
 Rules-based filters
 Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction. These assistants can help us in
various ways just by our voice instructions such as Play music, call someone, Open an email, Scheduling an
appointment, etc.
These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML
algorithms and act accordingly.

8. Online Fraud Detection:

Machine learning is making our online transaction safe and secure by detecting fraud transaction. Whenever
we perform some online transaction, there may be various ways that a fraudulent transaction can take place
such as fake accounts, fake ids, and steal money in the middle of a transaction. So to detect this, Feed
Forward Neural network helps us by checking whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values become the
input for the next round. For each genuine transaction, there is a specific pattern which gets change for the
fraud transaction hence, it detects it and makes our online transactions more secure.

9. Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up and
downs in shares, so for this machine learning's long short term memory neural network is used for the
prediction of stock market trends.

10. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical technology is
growing very fast and able to build 3D models that can predict the exact position of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:

Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all, as for
this also machine learning helps us by converting the text into our known languages. Google's GNMT
(Google Neural Machine Translation) provide this feature, which is a Neural Machine Learning that
translates the text into our familiar language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning algorithm, which is used
with image recognition and translates the text from one language to another language

The main differences between Supervised and Unsupervised learning are given below:

Supervised Learning Unsupervised Learning


Supervised learning algorithms are trained using Unsupervised learning algorithms are trained
labeled data. using unlabeled data.
Supervised learning model takes direct feedback to Unsupervised learning model does not take any
check if it is predicting correct output or not. feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
In supervised learning, input data is provided to the In unsupervised learning, only input data is
model along with the output. provided to the model.
The goal of supervised learning is to train the model so The goal of unsupervised learning is to find the
that it can predict the output when it is given new data. hidden patterns and useful insights from the
unknown dataset.
Supervised learning needs supervision to train the Unsupervised learning does not need any
model. supervision to train the model.
Supervised learning can be categorized in Unsupervised Learning can be classified in
Classification and Regression problems. Clustering and Associations problems.
Supervised learning can be used for those cases where Unsupervised learning can be used for those cases
we know the input as well as corresponding outputs. where we have only input data and no
corresponding output data.
Supervised learning model produces an accurate result. Unsupervised learning model may give less
accurate result as compared to supervised
learning.
Supervised learning is not close to true Artificial Unsupervised learning is more close to the true
intelligence as in this, we first train the model for each Artificial Intelligence as it learns similarly as a
data, and then only it can predict the correct output. child learns daily routine things by his
experiences.
It includes various algorithms such as Linear It includes various algorithms such as Clustering,
Regression, Logistic Regression, Support Vector KNN, and Apriori algorithm.
Machine, Multi-class Classification, Decision tree,
Bayesian Logic, etc.

Introduction of Machine Learning Approaches


Artificial Neural Network ANN

is an efficient computing system whose central theme is borrowed from the analogy of biological neural
networks. ANNs are also named as “artificial neural systems,” or “parallel distributed processing systems,”
or “connectionist systems.” ANN acquires a large collection of units that are interconnected in some pattern
to allow communication between the units. These units, also referred to as nodes or neurons, are simple
processors which operate in parallel.

Every neuron is connected with other neuron through a connection link. Each connection link is associated
with a weight that has information about the input signal. This is the most useful information for neurons to
solve a particular problem because the weight usually excites or inhibits the signal that is being
communicated. Each neuron has an internal state, which is called an activation signal. Output signals, which
are produced after combining the input signals and activation rule, may be sent to other units.

What is Clustering?
Basically, it is a type of unsupervised learning method and a common technique for statistical data analysis
used in many fields. Clustering mainly is a task of dividing the set of observations into subsets, called
clusters, in such a way that observations in the same cluster are similar in one sense and they are dissimilar
to the observations in other clusters. In simple words, we can say that the main goal of clustering is to group
the data on the basis of similarity and dissimilarity.

Reinforcement Learning
This type of learning is used to reinforce or strengthen the network based on critic information. That is, a
network being trained under reinforcement learning, receives some feedback from the environment.
However, the feedback is evaluative and not instructive as in the case of supervised learning. Based on this
feedback, the network performs the adjustments of the weights to obtain better critic information in future.

This learning process is similar to supervised learning but we might have very less information.

Decision Tree
In general, Decision tree analysis is a predictive modelling tool that can be applied across many areas.
Decision trees can be constructed by an algorithmic approach that can split the dataset in different ways
based on different conditions. Decisions tress are the most powerful algorithms that falls under the category
of supervised algorithms.

They can be used for both classification and regression tasks. The two main entities of a tree are decision
nodes, where the data is split and leaves, where we got outcome. The example of a binary tree for predicting
whether a person is fit or unfit providing various information like age, eating habits and exercise habits, is
given below −

In the above decision tree, the question are decision nodes and final outcomes are leaves. We have the
following two types of decision trees −

 Classification decision trees − In this kind of decision trees, the decision variable is categorical. The
above decision tree is an example of classification decision tree.
 Regression decision trees − In this kind of decision trees, the decision variable is continuous.

Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian network as:

"A Bayesian network is a probabilistic graphical model which represents a set of variables and their
conditional dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian model.

Bayesian networks are probabilistic, because these networks are built from a probability distribution,
and also use probability theory for prediction and anomaly detection.

Real world applications are probabilistic in nature, and to represent the relationship between multiple
events, we need a Bayesian network. It can also be used in various tasks including prediction, anomaly
detection, diagnostics, automated insight, reasoning, time series prediction, and decision making
under uncertainty.
Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems. However, primarily, it is used for Classification problems
in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called
as support vectors, and hence algorithm is termed as Support Vector Machine.

Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of natural selection and
genetics. These are intelligent exploitation of random search provided with historical data to direct the
search into the region of better performance in solution space. They are commonly used to generate high-
quality solutions for optimization problems and search problems.

Genetic algorithms simulate the process of natural selection which means those species who can adapt to
changes in their environment are able to survive and reproduce and go to next generation. In simple words,
they simulate “survival of the fittest” among individual of consecutive generation for solving a problem.
Each generation consist of a population of individuals and each individual represents a point in search
space and possible solution. Each individual is represented as a string of character/integer/float/bits. This
string is analogous to the Chromosome.

Common issues in Machine Learning


Although machine learning is being used in every industry and helps organizations make more informed and
data-driven choices that are more effective than classical methodologies, it still has so many problems that
cannot be ignored. Here are some common issues in Machine Learning that professionals face to inculcate
ML skills and create an application from scratch.

1. Inadequate Training Data

The major issue that comes while using machine learning algorithms is the lack of quality as well as quantity
of data. Although data plays a vital role in the processing of machine learning algorithms, many data
scientists claim that inadequate data, noisy data, and unclean data are extremely exhausting the machine
learning algorithms. For example, a simple task requires thousands of sample data, and an advanced task
such as speech or image recognition needs millions of sample data examples. Further, data quality is also
important for the algorithms to work ideally, but the absence of data quality is also found in Machine
Learning applications. Data quality can be affected by some factors as follows:

 Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well as accuracy in
classification tasks.
 Incorrect data- It is also responsible for faulty programming and results obtained in machine learning
models. Hence, incorrect data may affect the accuracy of the results also.
 Generalizing of output data- Sometimes, it is also found that generalizing output data becomes complex,
which results in comparatively poor future actions.
2. Poor quality of data

As we have discussed above, data plays a significant role in machine learning, and it must be of good quality
as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to less accuracy in classification
and low-quality results. Hence, data quality can also be considered as a major common problem while
processing machine learning algorithms.

3. Non-representative training data

To make sure our training model is generalized well or not, we have to ensure that sample training data must
be representative of new cases that we need to generalize. The training data must cover all cases that are
already occurred as well as occurring.

Further, if we are using non-representative training data in the model, it results in less accurate predictions.
A machine learning model is said to be ideal if it predicts well for generalized cases and provides accurate
decisions. If there is less training data, then there will be a sampling noise in the model, called the non-
representative training set. It won't be accurate in predictions. To overcome this, it will be biased against one
class or a group.

Hence, we should use representative data in training to protect against being biased and make accurate
predictions without any drift.

4. Overfitting and Underfitting

Overfitting:

Overfitting is one of the most common issues faced by Machine Learning engineers and data scientists.
Whenever a machine learning model is trained with a huge amount of data, it starts capturing noise and
inaccurate data into the training data set. It negatively affects the performance of the model. Let's understand
with a simple example where we have a few training data sets such as 1000 mangoes, 1000 apples, 1000
bananas, and 5000 papayas. Then there is a considerable probability of identification of an apple as papaya
because we have a massive amount of biased data in the training data set; hence prediction got negatively
affected. The main reason behind overfitting is using non-linear methods used in machine learning
algorithms as they build non-realistic data models. We can overcome overfitting by using linear and
parametric algorithms in the machine learning models.

Methods to reduce overfitting:

 Increase training data in a dataset.


 Reduce model complexity by simplifying the model by selecting one with fewer parameters
 Ridge Regularization and Lasso Regularization
 Early stopping during the training phase
 Reduce the noise
 Reduce the number of attributes in training data.
 Constraining the model.

Underfitting:

Underfitting is just the opposite of overfitting. Whenever a machine learning model is trained with fewer
amounts of data, and as a result, it provides incomplete and inaccurate data and destroys the accuracy of the
machine learning model.

Underfitting occurs when our model is too simple to understand the base structure of the data, just like an
undersized pant. This generally happens when we have limited data into the data set, and we try to build a
linear model with non-linear data. In such scenarios, the complexity of the model destroys, and rules of the
machine learning model become too easy to be applied on this data set, and the model starts doing wrong
predictions as well.

Methods to reduce Underfitting:

 Increase model complexity


 Remove noise from the data
 Trained on increased and better features
 Reduce the constraints
 Increase the number of epochs to get better results.

5. Monitoring and maintenance

As we know that generalized output data is mandatory for any machine learning model; hence, regular
monitoring and maintenance become compulsory for the same. Different results for different actions require
data change; hence editing of codes as well as resources for monitoring them also become necessary.

6. Getting bad recommendations

A machine learning model operates under a specific context which results in bad recommendations and
concept drift in the model. Let's understand with an example where at a specific time customer is looking for
some gadgets, but now customer requirement changed over time but still machine learning model showing
same recommendations to the customer while customer expectation has been changed. This incident is
called a Data Drift. It generally occurs when new data is introduced or interpretation of data changes.
However, we can overcome this by regularly updating and monitoring data according to the expectations.

7. Lack of skilled resources

Although Machine Learning and Artificial Intelligence are continuously growing in the market, still these
industries are fresher in comparison to others. The absence of skilled resources in the form of manpowe r is
also an issue. Hence, we need manpower having in-depth knowledge of mathematics, science, and
technologies for developing and managing scientific substances for machine learning.

8. Customer Segmentation

Customer segmentation is also an important issue while developing a machine learning algorithm. To
identify the customers who paid for the recommendations shown by the model and who don't even check
them. Hence, an algorithm is necessary to recognize the customer behavior and trigger a relevant
recommendation for the user based on past experience.

9. Process Complexity of Machine Learning

The machine learning process is very complex, which is also another major issue faced by machine learning
engineers and data scientists. However, Machine Learning and Artificial Intelligence are very new
technologies but are still in an experimental phase and continuously being changing over time. There is the
majority of hits and trial experiments; hence the probability of error is higher than expected. Further, it als o
includes analyzing the data, removing data bias, training data, applying complex mathematical calculations,
etc., making the procedure more complicated and quite tedious.

10. Data Bias

Data Biasing is also found a big challenge in Machine Learning. These errors exist when certain elements of
the dataset are heavily weighted or need more importance than others. Biased data leads to inaccurate
results, skewed outcomes, and other analytical errors. However, we can resolve this error by determining
where data is actually biased in the dataset. Further, take necessary steps to reduce it.
Methods to remove Data Bias:

 Research more for customer segmentation.


 Be aware of your general use cases and potential outliers.
 Combine inputs from multiple sources to ensure data diversity.
 Include bias testing in the development process.
 Analyze data regularly and keep tracking errors to resolve them easily.
 Review the collected and annotated data.
 Use multi-pass annotation such as sentiment analysis, content moderation, and intent recognition.

11. Lack of Explainability

This basically means the outputs cannot be easily comprehended as it is programmed in specific ways to
deliver for certain conditions. Hence, a lack of explainability is also found in machine learning algorithms
which reduce the credibility of the algorithms.

12. Slow implementations and results

This issue is also very commonly seen in machine learning models. However, machine learning models are
highly efficient in producing accurate results but are time-consuming. Slow programming, excessive
requirements' and overloaded data take more time to provide accurate results than expected. This needs
continuous maintenance and monitoring of the model for delivering accurate results.

13. Irrelevant features

Although machine learning models are intended to give the best possible outcome, if we feed garbage data
as input, then the result will also be garbage. Hence, we should use relevant features in our training sample.
A machine learning model is said to be good if training data has a good set of features or less to no irrelevant
features.

Comparison Between Data Science and Machine Learning


The below table describes the basic differences between Data Science and ML:

Data Science Machine Learning


It deals with understanding and finding hidden It is a subfield of data science that enables the machine to
patterns or useful insights from the data, which learn from the past data and experiences automatically.
helps to take smarter business decisions.
It is used for discovering insights from the data. It is used for making predictions and classifying the result
for new data points.
It is a broad term that includes various steps to It is used in the data modeling step of the data science as
create a model for a given problem and deploy a complete process.
the model.
A data scientist needs to have skills to use big Machine Learning Engineer needs to have skills such as
data tools like Hadoop, Hive and Pig, statistics, computer science fundamentals, programming skills in
programming in Python, R, or Scala. Python or R, statistics and probability concepts, etc.
It can work with raw, structured, and It mostly requires structured data to work on.
unstructured data.
Data scientists spent lots of time in handling the ML engineers spend a lot of time for managing the
data, cleansing the data, and understanding its complexities that occur during the implementation of
patterns. algorithms and mathematical concepts behind that.
Regression Analysis in Machine learning

Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically, Regression
analysis helps us to understand how the value of the dependent variable is changing corresponding to an
independent variable when other independent variables are held fixed. It predicts continuous/real values
such as temperature, age, salary, price, etc.

Regression is a supervised learning technique which helps in finding the correlation between variables and
enables us to predict the continuous output variable based on the one or more predictor variables. It is
mainly used for prediction, forecasting, time series modeling, and determining the causal-effect
relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the
machine learning model can make predictions about the data. In simple words, "Regression shows a line or
curve that passes through all the datapoints on target-predictor graph in such a way that the vertical
distance between the datapoints and the regression line is minimum." The distance between datapoints
and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

 Prediction of rain using temperature and other factors


 Determining Market trends
 Prediction of road accidents due to rash driving.

Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has
its own importance on different scenarios, but at the core, all the regression methods analyze the effect of the
independent variable on dependent variables. Here we are discussing some important types of regression
which are given below:

 Linear Regression
 Logistic Regression
 Polynomial Regression
 Support Vector Regression
 Decision Tree Regression
 Random Forest Regression
 Ridge Regression
 Lasso Regression:

Linear Regression:

 Linear regression is a statistical regression method which is used for predictive analysis.
 It is one of the very simple and easy algorithms which works on regression and shows the relationship
between the continuous variables.
 It is used for solving the regression problem in machine learning.
 Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
 If there is only one input variable (x), then such linear regression is called simple linear regression. And if
there is more than one input variable, then such linear regression is called multiple linear regression.
 The relationship between variables in the linear regression model can be explained using the below image.
Here we are predicting the salary of an employee on the basis of the year of experience.
 Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

 Analyzing trends and sales estimates


 Salary forecasting
 Real estate prediction
 Arriving at ETAs in traffic.

Logistic Regression:

 Logistic regression is another supervised learning algorithm which is used to solve the classification
problems. In classification problems, we have dependent variables in a binary or discrete format such as 0
or 1.
 Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or False,
Spam or not spam, etc.
 It is a predictive analysis algorithm which works on the concept of probability.
 Logistic regression is a type of regression, but it is different from the linear regression algorithm in the term
how they are used.
 Logistic regression uses sigmoid function or logistic function which is a complex cost function. This sigmoid
function is used to model the data in logistic regression. The function can be represented as:

 f(x)= Output between the 0 and 1 value.


 x= input to the function
 e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

 It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values
below the threshold level are rounded up to 0.

There are three types of logistic regression:

 Binary(0/1, pass/fail)
 Multi(cats, dogs, lions)
 Ordinal(low, medium, high)

What is Bayes Theorem?


Bayes theorem is one of the most popular machine learning concepts that helps to calculate the probability
of occurring one event with uncertain knowledge while other one has already occurred.

Bayes' theorem can be derived using product rule and conditional probability of event X with known event
Y:

 According to the product rule we can express as the probability of event X with known event Y as follows;

1. P(X ? Y)= P(X|Y) P(Y) {equation 1}

 Further, the probability of event Y with known event X:

1. P(X ? Y)= P(Y|X) P(X) {equation 2}

Mathematically, Bayes theorem can be expressed by combining both equations on right hand side. We will
get:
Here, both events X and Y are independent events which means probability of outcome of both events does
not depends one another.

The above equation is called as Bayes Rule or Bayes Theorem.

 P(X|Y) is called as posterior, which we need to calculate. It is defined as updated probability after
considering the evidence.
 P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true.
 P(X) is called the prior probability, probability of hypothesis before considering the evidence
 P(Y) is called marginal probability. It is defined as the probability of evidence under any consideration.

Hence, Bayes Theorem can be written as:

posterior = likelihood * prior / evidence

Prerequisites for Bayes Theorem


While studying the Bayes theorem, we need to understand few important concepts. These are as follows:

1. Experiment

An experiment is defined as the planned operation carried out under controlled condition such as tossing a
coin, drawing a card and rolling a dice, etc.

2. Sample Space

During an experiment what we get as a result is called as possible outcomes and the set of all possible
outcome of an event is known as sample space. For example, if we are rolling a dice, sample space will be:

S1 = {1, 2, 3, 4, 5, 6}

Similarly, if our experiment is related to toss a coin and recording its outcomes, then sample space will be:

S2 = {Head, Tail}

3. Event

Event is defined as subset of sample space in an experiment. Further, it is also called as set of outcomes.
Assume in our experiment of rolling a dice, there are two event A and B such that;

A = Event when an even number is obtained = {2, 4, 6}

B = Event when a number is greater than 4 = {5, 6}

 Probability of the event A ''P(A)''= Number of favourable outcomes / Total number of possible outcomes
P(E) = 3/6 =1/2 =0.5
 Similarly, Probability of the event B ''P(B)''= Number of favourable outcomes / Total number of possible
outcomes
=2/6
=1/3
=0.333
 Union of event A and B:
A∪B = {2, 4, 5, 6}

 Intersection of event A and B:


A∩B= {6}
 Disjoint Event: If the intersection of the event A and B is an empty set or null then such events are known as
disjoint event or mutually exclusive events also.

4. Random Variable:

It is a real value function which helps mapping between sample space and a real line of an experiment. A
random variable is taken on some random values and each value having some probability. However, it is
neither random nor a variable but it behaves as a function which can either be discrete, continuous or
combination of both.

5. Exhaustive Event:

As per the name suggests, a set of events where at least one event occurs at a time, called exhaustive event
of an experiment.

Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a time and both are
mutually exclusive for e.g., while tossing a coin, either it will be a Head or may be a Tail.

6. Independent Event:

Two events are said to be independent when occurrence of one event does not affect the occurrence of
another event. In simple words we can say that the probability of outcome of both events does not depends
one another.

Mathematically, two events A and B are said to be independent if:

P(A ∩ B) = P(AB) = P(A)*P(B)

7. Conditional Probability:

Conditional probability is defined as the probability of an event A, given that another event B has already
occurred (i.e. A conditional B). This is represented by P(A|B) and we can define it as:

P(A|B) = P(A ∩ B) / P(B)

8. Marginal Probability:

Marginal probability is defined as the probability of an event A occurring independent of any other event B.
Further, it is considered as the probability of evidence under any consideration.
P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)

Here ~B represents the event that B does not occur.

How to apply Bayes Theorem or Bayes rule in Machine Learning?


Bayes theorem helps us to calculate the single term P(B|A) in terms of P(A|B), P(B), and P(A). This rule is
very helpful in such scenarios where we have a good probability of P(A|B), P(B), and P(A) and need to
determine the fourth term.

Naïve Bayes classifier is one of the simplest applications of Bayes theorem which is used in classification
algorithms to isolate data as per accuracy, speed and classes.

Let's understand the use of Bayes theorem in machine learning with below example.

Suppose, we have a vector A with I attributes. It means

A = A1, A2, A3, A4……………Ai

Further, we have n classes represented as C1, C2, C3, C4…………Cn.

These are two conditions given to us, and our classifier that works on Machine Language has to predict A
and the first thing that our classifier has to choose will be the best possible class. So, with the help of Bayes
theorem, we can write it as:

P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)

Here;

P(A) is the condition-independent entity.

P(A) will remain constant throughout the class means it does not change its value with respect to change in
class. To maximize the P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).

With n number classes on the probability list let's assume that the possibility of any class being the right
answer is equally likely. Considering this factor, we can say that:

P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).
This process helps us to reduce the computation cost as well as time. This is how Bayes theorem plays a
significant role in Machine Learning and Naïve Bayes theorem has simplified the conditional probability
tasks without affecting the precision. Hence, we can conclude that:

P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C)

Hence, by using Bayes theorem in Machine Learning we can easily describe the possibilities of smaller
events.

What is Naïve Bayes Classifier in Machine Learning


Naïve Bayes theorem is also a supervised algorithm, which is based on Bayes theorem and used to solve
classification problems. It is one of the most simple and effective classification algorithms in Machine
Learning which enables us to build various ML models for quick predictions. It is a probabilistic classifie r
that means it predicts on the basis of probability of an object. Some popular Naïve Bayes algorithms are
spam filtration, Sentimental analysis, and classifying articles.

Advantages of Naïve Bayes Classifier in Machine Learning:

 It is one of the simplest and effective methods for calculating the conditional probability and text
classification problems.
 A Naïve-Bayes classifier algorithm is better than all other models where assumption of independent
predictors holds true.
 It is easy to implement than other models.
 It requires small amount of training data to estimate the test data which minimize the training time period.
 It can be used for Binary as well as Multi-class Classifications.

Disadvantages of Naïve Bayes Classifier in Machine Learning:

The main disadvantage of using Naïve Bayes classifier algorithms is, it limits the assumption of independent
predictors because it implicitly assumes that all attributes are independent or unrelated but in real life it is
not feasible to get mutually independent attributes.

Bayesian Belief Network in artificial intelligence

Bayesian belief network is key computer technology for dealing with probabilistic events and to solve a
problem which has uncertainty. We can define a Bayesian network as:

"A Bayesian network is a probabilistic graphical model which represents a set of variables and their
conditional dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian model.

Bayesian networks are probabilistic, because these networks are built from a probability distribution, and
also use probability theory for prediction and anomaly detection.

Real world applications are probabilistic in nature, and to represent the relationship between multiple events,
we need a Bayesian network. It can also be used in various tasks including prediction, anomaly detection,
diagnostics, automated insight, reasoning, time series prediction, and decision making under
uncertainty.

Bayesian Network can be used for building models from data and experts opinions, and it consists of two
parts:
 Directed Acyclic Graph
 Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision problems under uncertain
knowledge is known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:

 Each node corresponds to the random variables, and a variable can be continuous or discrete.
 Arc or directed arrows represent the causal relationship or conditional probabilities between random
variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed link that
means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of the
network graph.
o If we are considering node B, which is connected with node A by a directed arrow, then node A is
called the parent of Node B.
o Node C is independent of node A.

Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a directed
acyclic graph or DAG.

The Bayesian network has mainly two components:

 Causal Component
 Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which
determines the effect of the parent on that node.

Bayesian network is based on Joint probability distribution and conditional probability. So let's first
understand the joint probability distribution:
The EM algorithm or Expectation-Maximization algorithm is a latent variable model that was proposed by
Arthur Dempster, Nan Laird, and Donald Rubin in 1977.

A latent variable model comprises observable variables and unobservable variables. Observed variables are
those that can be measured whereas unobserved (latent/hidden) variables are inferred from observed
variables.

As explained by the trio, the EM algorithm can be used to determine the local maximum likelihood (MLE)
parameters or maximum a posteriori (MAP) parameters for latent variables (unobservable variables that
need to be inferred from observable variables) in a statistical model. It is used to predict these values or
determine data that is missing or incomplete, provided that you know the general form of probability
distribution associated with these latent variables.

To put it simply, the general principle behind the EM algorithm in machine learning involves using
observable instances of latent variables to predict values in instances that are unobservable for learning. This
is done until convergence of the values occurs.

The algorithm is a rather powerful tool in machine learning and is a combination of many unsupervised
algorithms. This includes the k-means clustering algorithm, among other EM algorithm variants.

The Expectation-Maximization Algorithm


Let’s explore the mechanism of the Expectation-Maximization algorithm in Machine Learning:

Source

 Step 1: We have a set of missing or incomplete data and another set of starting parameters. We assume that
observed data or the initial values of the parameters are generated from a specific model.
 Step 2: Based on the observable value in the observable instances of the available data, we will predict or
estimate the values in the unobservable instances of the data or the missing data. This is known as the
Expectation step (E – step).
 Step 3: Using the data generated from the E – step, we will update the parameters and complete the data
set. This is known as the Maximization step (M – step) which is used to update the hypothesis.

Steps 2 and step 3 are repeated until convergence. Meaning if the values are not converging, we will repeat
the E – step and M – step.

Source

Advantages and Disadvantages of the EM Algorithm


Disadvantages of EM Algorithm

1 Every iteration in the EM algorithm results in a guaranteed increase in likelihood.

The Expectation step and Maximization step is rather easy and the solution for the latter mostly exists in closed
2
form.

Advantages of the EM Algorithm

The expectation-Maximization algorithm takes both forward and backward probabilities into account. This is in
1
contrast with numerical optimization which takes only the forward probabilities into account.

2 EM algorithm convergence is very slow and is only made to the local optima.

Applications of the EM Algorithm


The latent variable model has plenty of real-world applications in machine learning.

1. It is used in unsupervised data clustering and psychometric analysis.


2. It is also used to compute the Gaussian density of a function.
3. The EM algorithm finds extensive use in predicting the Hidden Markov Model (HMM) parameters and other
mixed models.
4. EM algorithm finds plenty of use in natural language processing (NLP), computer vision, and quantitative
genetics.
5. Other important applications of the EM algorithm include image reconstruction in the field of medicine and
structural engineering.
Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems. However, primarily, it is used for Classification problems
in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called
as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram
in which there are two different categories that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train
our model with lots of images of cats and dogs so that it can learn about different features of cats and dogs,
and then we test it with this strange creature. So as support vector creates a decision boundary between these
two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and
dog. On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM
SVM can be of two types:

 Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into
two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is
used called as Linear SVM classifier.
 Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot
be classified by using a straight line, then such data is termed as non-linear data and classifier used is called
as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance between
the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support
vector.

How does SVM works?


Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that
has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can
be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called as margin.
And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the
optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot
draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

Kernel plays a vital role in classification and is used to analyze some patterns in the given dataset. They are very
helpful in solving a no-linear problem by using a linear classifier.

Popular SVM Kernel Functions


Linear Kernel

It is the most basic type of kernel, usually one dimensional in nature. It proves to be the best function when
there are lots of features. The linear kernel is mostly preferred for text-classification problems as most of
these kinds of classification problems can be linearly separated.

Linear kernel functions are faster than other functions.

Linear Kernel Formula

F(x, xj) = sum( x.xj)

Here, x, xj represents the data you’re trying to classify.

Polynomial Kernel

It is a more generalized representation of the linear kernel. It is not as preferred as other kernel functions as
it is less efficient and accurate.

Polynomial Kernel Formula

F(x, xj) = (x.xj+1)^d

Here ‘.’ shows the dot product of both the values, and d denotes the degree.
F(x, xj) representing the decision boundary to separate the given classes.

Gaussian Radial Basis Function (RBF)

It is one of the most preferred and used kernel functions in svm. It is usually chosen for non-linear data. It
helps to make proper separation when there is no prior knowledge of data.

Gaussian Radial Basis Formula

F(x, xj) = exp(-gamma * ||x - xj||^2)

The value of gamma varies from 0 to 1. You have to manually provide the value of gamma in the code. The
most preferred value for gamma is 0.1.

Advantages of SVM

 It works well on a dataset having many features.


  It provides a clear margin of separation.

  It is very effective for the dataset where the number of features are greater than the data points.

  You can specify different kernel functions to make a proper decision boundary.

Disadvantages of SVM

It requires very high training time, hence not recom


It is very sensitive to outliers.

Introduction to Decision Tree


In general, Decision tree analysis is a predictive modelling tool that can be applied across many areas.
Decision trees can be constructed by an algorithmic approach that can split the dataset in different ways
based on different conditions. Decisions tress are the most powerful algorithms that falls under the category
of supervised algorithms.

They can be used for both classification and regression tasks. The two main entities of a tree are decision
nodes, where the data is split and leaves, where we got outcome. The example of a binary tree for predicting
whether a person is fit or unfit providing various information like age, eating habits and exercise habits, is
given below −
In the above decision tree, the question are decision nodes and final outcomes are leaves. We have the
following two types of decision trees −

 Classification decision trees − In this kind of decision trees, the decision variable is categorical. The
above decision tree is an example of classification decision tree.
 Regression decision trees − In this kind of decision trees, the decision variable is continuous.

Implementing Decision Tree Algorithm


Gini Index

It is the name of the cost function that is used to evaluate the binary splits in the dataset and works with the
categorial target variable “Success” or “Failure”.

Higher the value of Gini index, higher the homogeneity. A perfect Gini index value is 0 and worst is 0.5 (for
2 class problem). Gini index for a split can be calculated with the help of following steps −

 First, calculate Gini index for sub-nodes by using the formula p^2+q^2 , which is the sum of the
square of probability for success and failure.
 Next, calculate Gini index for split using weighted Gini score of each node of that split.

Classification and Regression Tree (CART) algorithm uses Gini method to generate binary splits.

Split Creation

A split is basically including an attribute in the dataset and a value. We can create a split in dataset with the
help of following three parts −

 Part1: Calculating Gini Score − We have just discussed this part in the previous section.
 Part2: Splitting a dataset − It may be defined as separating a dataset into two lists of rows having
index of an attribute and a split value of that attribute. After getting the two groups - right and left,
from the dataset, we can calculate the value of split by using Gini score calculated in first part. Split
value will decide in which group the attribute will reside.
 Part3: Evaluating all splits − Next part after finding Gini score and splitting dataset is the
evaluation of all splits. For this purpose, first, we must check every value associated with each
attribute as a candidate split. Then we need to find the best possible split by evaluating the cost of the
split. The best split will be used as a node in the decision tree.

Building a Tree
As we know that a tree has root node and terminal nodes. After creating the root node, we can build the tree
by following two parts −

Part1: Terminal node creation

While creating terminal nodes of decision tree, one important point is to decide when to stop growing tree or
creating further terminal nodes. It can be done by using two criteria namely maximum tree depth and
minimum node records as follows −

 Maximum Tree Depth − As name suggests, this is the maximum number of the nodes in a tree after
root node. We must stop adding terminal nodes once a tree reached at maximum depth i.e. once a
tree got maximum number of terminal nodes.
 Minimum Node Records − It may be defined as the minimum number of training patterns that a
given node is responsible for. We must stop adding terminal nodes once tree reached at these
minimum node records or below this minimum.

Terminal node is used to make a final prediction.

Part2: Recursive Splitting

As we understood about when to create terminal nodes, now we can start building our tree. Recursive
splitting is a method to build the tree. In this method, once a node is created, we can create the child nodes
(nodes added to an existing node) recursively on each group of data, generated by splitting the dataset, by
calling the same function again and again.

Prediction

After building a decision tree, we need to make a prediction about it. Basically, prediction involves
navigating the decision tree with the specifically provided row of data.

We can make a prediction with the help of recursive function, as did above. The same prediction routine is
called again with the left or the child right nodes.

Assumptions

The following are some of the assumptions we make while creating decision tree −

 While preparing decision trees, the training set is as root node.


 Decision tree classifier prefers the features values to be categorical. In case if you want to use
continuous values then they must be done discretized prior to model building.
 Based on the attribute’s values, the records are recursively distributed.
 Statistical approach will be used to place attributes at any node position i.e.as root node or internal
node.

Strengths and Weakness of Decision Tree approach


The strengths of decision tree methods are:
 Decision trees are able to generate understandable rules.
 Decision trees perform classification without requiring much computation.
 Decision trees are able to handle both continuous and categorical variables.
 Decision trees provide a clear indication of which fields are most important for prediction or
classification.

The weaknesses of decision tree methods :

 Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a
continuous attribute.
 Decision trees are prone to errors in classification problems with many class and relatively small
number of training examples.
 Decision tree can be computationally expensive to train. The process of growing a decision tree is
computationally expensive. At each node, each candidate splitting field must be sorted before its best
split can be found. In some algorithms, combinations of fields are used and a search must be made
for optimal combining weights. Pruning algorithms can also be expensive since many candidate sub-
trees must be formed and compared.

Practical issues in learning decision trees:

Determining how deep to grow the decision tree, handling continuous attributes, choosing an appropriate
attribute selection measure, and handling training data with missing attribute values, handling attributes with
different costs, and improving computational efficiency are all practical issues in learning decision trees.

Let’s have a look at each one of them briefly,

Overfitting the Data:

A model is regarded to be a good machine learning model if it generalizes any new input data from the issue
domain in an appropriate manner while we are creating it.

Each branch of the tree is grown just deep enough by the algorithm to properly categorize the training
instances.

In reality, when there is noise in the data or when the number of training instances is insufficient to provide
a representative sample of the underlying target function, it might cause problems.

This basic technique may yield trees that overfit the training samples in either instance.

The formal definition of overfitting is, “Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the
training data if another hypothesis h’ ∈ H exists, with h having less error than h’ over the training examples
but h’ having smaller error over the full distribution of cases.”
As the tree is built, the horizontal axis of this graphic shows the total number of nodes in the decision tree.
The accuracy of the tree’s predictions is indicated by the vertical axis.

The solid line depicts the decision tree’s accuracy over the training instances, whereas the broken line
depicts accuracy over a separate set of test cases not included in the training set.

The tree’s accuracy over the training instances grows in a linear fashion as it matures. The accuracy assessed
over the independent test cases, on the other hand, increases at first, then falls.

As can be observed, once the tree size reaches about 25 nodes, additional elaboration reduces the tree’s
accuracy on the test cases while boosting it on the training examples.

What is Underfitting:

When a machine learning system fails to capture the underlying trend of the data, it is considered to be
underfitting. Our machine learning model’s accuracy is ruined by underfitting.

Its recurrence merely indicates that our model or method does not adequately fit the data. Underfitting may
be prevented by collecting additional data and employing feature selection to reduce the number of
characteristics.

Both of the errors usually occur when the training example contains errors or noise.

What is Noise?

Real-world data contains noise, which is unnecessary or nonsensical data that may dramatically impair
various data analyses. Classification, grouping, and association analysis are examples of machine learning
tasks.

Even when the training data is noise-free, overfitting can occur, especially when tiny numbers of samples
are connected with leaf nodes.

In this scenario, coincidence regularities are feasible, in which some property, despite being unrelated to the
actual goal function, occurs to divide the cases quite effectively.
There is a risk of overfitting whenever such accidental regularities emerge.

What can we do to avoid overfitting? Here are a few examples of frequent heuristics:

 Don’t try to fit all of the examples in; instead, quit before the training set runs out.
 After fitting all of the instances, prune the resulting tree.

In decision tree learning, there are numerous methods for preventing overfitting.

These may be divided into two categories:

 Techniques that stop growing the tree before it reaches the point where it properly classifies the training
data.
 Then post-prune the tree, and ways that allow the tree to overfit the data and then post-prune the tree.
 Despite the fact that the first strategy appears to be more straightforward, the second approach of post-
pruning overfit trees has shown to be more effective in reality.

The criterion used to determine the correct final tree size:

 To assess the usefulness of post-pruning nodes from the tree, use a separate set of examples from the
training examples.
 Use all available data for training, but do a statistical test to see if extending (or pruning) a specific node
would result in a better result than the training set.
 A chi-square test is performed to see if enlarging a node would increase performance throughout the f ull
instance distribution or only on the current sample of training data.
 When encoding the training samples and the decision tree, use an explicit measure of complexity, with the
tree’s development halted when the encoding size is reduced. This method is based on the Minimum
Description Length concept, which is a heuristic.

ID3 in brief

ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively (repeatedly)
dichotomizes(divides) features into two or more groups at each step.

Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree. In simple words,
the top-down approach means that we start building the tree from the top and the greedy approach means
that at each iteration we select the best feature at the present moment to create a node.

Most generally ID3 is only used for classification problems with nominal features only.

Metrics in ID3

As mentioned previously, the ID3 algorithm selects the best feature at each step while building a Decision
tree.
Before you ask, the answer to the question: ‘How does ID3 select the best feature?’ is that ID3 uses
Information Gain or just Gain to find the best feature.

Information Gain calculates the reduction in the entropy and measures how well a given feature separates or
classifies the target classes. The feature with the highest Information Gain is selected as the best one.

In simple words, Entropy is the measure of disorder and the Entropy of a dataset is the measure of disorder
in the target feature of the dataset.
In the case of binary classification (where the target column has only two types of classes) entropy is 0 if all
values in the target column are homogenous(similar) and will be 1 if the target column has equal number
values for both the classes.

We denote our dataset as S, entropy is calculated as:

Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n

where,
n is the total number of classes in the target column (in our case n = 2 i.e YES and NO)
pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the target column” to the
“total number of rows” in the dataset.

Information Gain for a feature column A is calculated as:

IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))

where Sᵥ is the set of rows in S for which the feature column A has value v, |Sᵥ| is the number of rows in Sᵥ
and likewise |S| is the number of rows in S.

ID3 Steps

1. Calculate the Information Gain of each feature.


2. Considering that all rows don’t belong to the same class, split the dataset S into subsets using the feature for
which the Information Gain is maximum.
3. Make a decision tree node using the feature with the maximum Information gain.
4. If all rows belong to the same class, make the current node as a leaf node with the class as its label.
5. Repeat for the remaining features until we run out of all features, or the decision tree has all leaf nodes.

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
 K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case
into the category that is most similar to the available categories.
 K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that
data into a category that is much similar to the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know
either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set to the cats and dogs images and
based on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the
below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

 Step-1: Select the number K of the neighbors


 Step-2: Calculate the Euclidean distance of K number of neighbors
 Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
 Step-4: Among these k neighbors, count the number of the data points in each category.
 Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
 Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below image:

 Firstly, we will choose the number of neighbors, so we will choose the k=5.
 Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:

 By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A
and two nearest neighbors in category B. Consider the below image:
 As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN algorithm:

 There is no particular way to determine the best value for "K", so we need to try some values to find the best
out of them. The most preferred value for K is 5.
 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.
 Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


 Always needs to determine the value of K which may be complex some time.
 The computation cost is high because of calculating the distance between the data points for all the training
samples.

Locally Weighted Linear Regression:

Locally weighted linear regression is a non-parametric algorithm, that is, the model does not learn a fixed set
of parameters as is done in ordinary linear regression. Rather parameters are computed individually for
each query point . While computing , a higher “preference” is given to the points in the training set lying
in the vicinity of than the points lying far away from .
The modified cost function is:

where, is a non-negative “weight” associated with training point .


For s lying closer to the query point , the value of is large, while for s lying far away
from the value of is small.

A typical choice of is:


where, is called the bandwidth parameter and controls the rate at which falls with distance from

Clearly, if is small is close to 1 and if is large is close to 0.

Thus, the training-set-points lying closer to the query point contribute more to the cost than the
points lying far away from .

For example –

Consider a query point = 5.0 and let and be two points in the training set such that = 4.9
and = 3.0.

Using the formula with = 0.5:

Thus, the weights fall exponentially as the distance between and increases and so does the
contribution of error in prediction for to the cost.

Consequently, while computing , we focus more on reducing for the points lying
closer to the query point (having larger value of ).
Steps involved in locally weighted linear regression are:

Compute to minimize the cost.


Predict Output: for given query point ,

Points to remember:

 Locally weighted linear regression is a supervised learning algorithm.


 It a non-parametric algorithm.
 There exists No training phase. All the work is done during the testing phase/while making predictions.
 The popular type of feed-forward network is the radial basis function (RBF) network. It has two
layers, not counting the input layer, and contrasts from a multilayer perceptron in the method that the
hidden units implement computations.
 Each hidden unit significantly defines a specific point in input space, and its output, or activation, for
a given instance based on the distance between its point and the instance, which is only a different
point. The closer these two points, the better the activation.
 This is implemented by utilizing a nonlinear transformation function to modify the distance into a
similarity measure. A bell-shaped Gaussian activation service of which the width can be different for
each hidden unit is generally used for this objective. The hidden units are known as RBFs because
the points in the instance area for which a given hidden unit makes a similar activation form a
hypersphere or hyperellipsoid.
 The output layer of an RBF structure is similar to that of a multilayer perceptron − It takes a linear
set of the outputs of the hidden units and in classification issues passage it through the sigmoid
function.
 The parameters that such a network understands are the centers and widths of the RBFs and the
weights used to design the linear set of the outputs acquired from the hidden layer. An essential
benefit over multilayer perceptrons is that the first group of parameters can be decided independently
of the second group and make accurate classifiers.
 One method to decide the first group of parameters is to use clustering. The simple k-means
clustering algorithm can be applied, clustering each class independently to obtain k-basis functions
for each class.
 The second group of parameters is understood by keeping the first parameters constant. This includes
learning a simple linear classifier using one of the approaches such as linear or logistic regress ion. If
there are long fewer hidden units than training instances, this can be done fastly.
 The limitation of RBF networks is that they provide each attribute with a similar weight because all
are considered equally in the distance computation unless attribute weight parameters are contained
in the complete optimization process.
 Therefore, they cannot deal efficiently with inappropriate attributes, against multilayer perceptrons.
Support vector machines share similar issues. Support vector machines with Gaussian kernels (i.e.,
“RBF kernels”) are a definite method of RBF network, in which one function is centered on each
training instance, all basis functions have a similar width, and the outputs are merged linearly by
calculating the maximum-margin hyperplane. This has the result that some of the RBFs have a
nonzero weight the ones that define the support vectors.

Case-based Reasoner(CBR)
As we know Nearest Neighbour classifiers stores training tuples as points in Euclidean space. But Case-
Based Reasoning classifiers (CBR) use a database of problem solutions to solve new problems. It stores
the tuples or cases for problem-solving as complex symbolic descriptions.

How CBR works?


When a new case arrises to classify, a Case-based Reasoner(CBR) will first check if an identical training
case exists. If one is found, then the accompanying solution to that case is returned. If no identical case is
found, then the CBR will search for training cases having components that are similar to those of the new
case. Conceptually, these training cases may be considered as neighbours of the new case. If cases are
represented as graphs, this involves searching for subgraphs that are similar to subgraphs within the new
case. The CBR tries to combine the solutions of the neighbouring training cases to propose a solution for the
new case. If compatibilities arise with the individual solutions, then backtracking to search for other
solutions may be necessary. The CBR may employ background knowledge and problem-solving strategies
to propose a feasible solution.

Applications of CBR includes:

1. Problem resolution for customer service help desks, where cases describe product-related diagnostic
problems.
2. It is also applied to areas such as engineering and law, where cases are either technical designs or
legal rulings, respectively.
3. Medical educations, where patient case histories and treatments are used to help diagnose and treat
new patients.

Challenges with CBR

 Finding a good similarity metric (eg for matching subgraphs) and suitable methods for combining
solutions.
 Selecting salient features for indexing training cases and the development of efficient indexing
techniques.

CBR becomes more intelligent as the number of the trade-off between accuracy and efficiency evolves as
the number of stored cases becomes very large. But after a certain point, the system’s efficiency will suffer
as the time required to search for and process relevant cases increases.

What is Artificial Neural Network?


The term "Artificial Neural Network" is derived from Biological neural networks that develop the structure
of a human brain. Similar to the human brain that has neurons interconnected to one another, artificial neural
networks also have neurons that are interconnected to one another in various layers of the netw orks.

Advantages of Artificial Neural Network (ANN)


Parallel processing capability:
Artificial neural networks have a numerical value that can perform more than one task simultaneously.

Storing data on the entire network:

Data that is used in traditional programming is stored on the whole network, not on a database. The
disappearance of a couple of pieces of data in one place doesn't prevent the network from working.

Capability to work with incomplete knowledge:

After ANN training, the information may produce output even with inadequate data. The loss of
performance here relies upon the significance of missing data.

Having a memory distribution:

For ANN is to be able to adapt, it is important to determine the examples and to encourage the network
according to the desired output by demonstrating these examples to the network. The succession of the
network is directly proportional to the chosen instances, and if the event can't appear to the network in all its
aspects, it can produce false output.

Having fault tolerance:

Extortion of one or more cells of ANN does not prohibit it from generating output, and this feature makes
the network fault-tolerance.

Disadvantages of Artificial Neural Network:


Assurance of proper network structure:

There is no particular guideline for determining the structure of artificial neural networks. The appropriate
network structure is accomplished through experience, trial, and error.

Unrecognized behavior of the network:

It is the most significant issue of ANN. When ANN produces a testing solution, it does not provide insight
concerning why and how. It decreases trust in the network.

Hardware dependence:

Artificial neural networks need processors with parallel processing power, as per their structure. Therefore,
the realization of the equipment is dependent.

Difficulty of showing the issue to the network:

ANNs can work with numerical data. Problems must be converted into numerical values before being
introduced to ANN. The presentation mechanism to be resolved here will directly impact the performance of
the network. It relies on the user's abilities.

The duration of the network is unknown:

The network is reduced to a specific value of the error, and this value does not give us optimum results.

What is the Perceptron model in Machine Learning?


Perceptron is Machine Learning algorithm for supervised learning of various binary classification tasks.
Further, Perceptron is also understood as an Artificial Neuron or neural network unit that helps to detect
certain input data computations in business intelligence.

Perceptron model is also treated as one of the best and simplest types of Artificial Neural networks.
However, it is a supervised learning algorithm of binary classifiers. Hence, we can consider it as a single -
layer neural network with four main parameters, i.e., input values, weights and Bias, net sum, and an
activation function.

Types of Perceptron Models


Based on the layers, Perceptron models are divided into two types. These are as follows:

1. Single-layer Perceptron Model


2. Multi-layer Perceptron model

Single Layer Perceptron Model:

This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron model
consists feed-forward network and also includes a threshold transfer function inside the model. The main
objective of the single-layer perceptron model is to analyze the linearly separable objects with binary
outcomes.

In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with inconstantly
allocated input for weight parameters. Further, it sums up all inputs (weight). After adding all inputs, if the
total sum of all inputs is more than a pre-determined value, the model gets activated and shows the output
value as +1.

If the outcome is same as pre-determined or threshold value, then the performance of this model is stated as
satisfied, and weight demand does not change. However, this model consists of a few discrepancies
triggered when multiple weight inputs values are fed into the model. Hence, to find desired output and
minimize errors, some changes should be necessary for the weights input.

"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:

Like a single-layer perceptron model, a multi-layer perceptron model also has the same model structure but
has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes in two
stages as follows:

 Forward Stage: Activation functions start from the input layer in the forward stage and terminate on the
output layer.
 Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward on the
output layer and ended on the input layer.

Hence, a multi-layered perceptron model has considered as multiple artificial neural networks having
various layers in which activation function does not remain linear, similar to a single layer perceptron
model. Instead of linear, activation function can be executed as sigmoid, TanH, ReLU, etc., for deployment.

A multi-layer perceptron model has greater processing power and can process linear and non-linear patterns.
Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.
Advantages of Multi-Layer Perceptron:

 A multi-layered perceptron model can be used to solve complex non-linear problems.


 It works well with both small and large input data.
 It helps us to obtain quick predictions after the training.
 It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:

 In Multi-layer perceptron, computations are difficult and time-consuming.


 In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects each
independent variable.
 The model functioning depends on the quality of the training.

Gradient descent and Delta Rule

If a set of data points can be separated into two groups using a straight line, the data is said to be linearly
separable. Non-linearly separable data is defined as data points that cannot be split into two groups using a
straight line.

Figure (a) -> Training Set is Linearly Separable

Figure (b) -> Training Set is non-linearly Separable

When the training instances are linearly separable, the perceptron algorithm finds a successful weight
vector; however, if the examples are not linearly separable, they may fail to converge.

The delta rule, a second training rule, is meant to address this challenge. In this blog, we’ll have a brief look
at Gradient Descent and Delta Rule.

The delta rule converges toward a best-fit approximation to the target concept if the training instances are
not linearly separable.
Delta Rule’s Main Idea:

The Delta rule’s main idea is to explore the hypothesis space of potential weight vectors using gradient
descent to discover the weights that best suit the training instances.

This criterion is significant because the BACKPROPAGATION algorithm, which can train networks with
many linked units, is based on gradient descent

Multilayer Neural Networks

A single-layer neural network will work only for linearly separable data and not for non-linearly separable
data. Hence there is a need for Multilayer Neural Networks, to be able to work with non-linearly separable
data.

Figure (a) -> Training Set is Linearly Separable

Figure (b) -> Training Set is non-linearly Separable

A multi-layer Neural Network has two hidden layers. Hidden layers, whose neurons are not directly li nked
to the output, are used in multilayer networks to address the classification issue for non-linear data.

The hidden layers can be understood geometrically as extra hyper-planes that increase the network’s
separation capability. Typical multilayer network designs are seen in the Figure below.

This new design raises a new challenge: how to train concealed units whose expected output is unknown.
This problem can be solved using the Backpropagation technique.
Backpropagation technique:

Given a network with a defined set of units and linkages, the BACKPROPAGATION algorithm learns the
weights for a multilayer network.

It uses gradient descent to try to reduce the squared error between the network output values and the outputs’
goal values.

We begin by redefining E to total the errors over all of the network output units because we are examining
networks with multiple output units rather than single units as previously.

tkd and Okd are the target and output values associated with the kth output unit and training example d,
respectively, and outputs are the set of output units in the network.

In contrast to the single-minimum parabolic error surface, the error surface in multilayer networks might
contain many local minima.
However, this means that gradient descent is only guaranteed to converge toward a local minimum, not the
global minimum error.

Despite this stumbling block, BACKPROPAGATION has been proven to deliver outstanding outcomes in a
variety of real-world scenarios.

The technique presented here is applicable to layered feedforward networks with two levels of sigmoid
units, each layer’s units being linked to all units from the previous layer.

Each node in the network is given an index (for example, an integer), where a “node” is either a network
input or the output of a network unit.

The input from node I to unit j is denoted by xji, while the associated weight is denoted by wji.

The error term associated with unit n is denoted by . It functions similarly to the amount (t – o) from
our previous explanation of the delta training rule. We’ll see what happens afterward.

ALGORITHM:

The approach starts by building a network with the necessary number of hidden and output units, as well as
setting all network weights to tiny random values.

The main loop of the algorithm then iterates over the training instances using this fixed network topology.

It applies the network to each training example, determines the network output error for this example,
computes the gradient with regard to the error for this example, and then updates all network weights.

This gradient descent phase is repeated until the network performs satisfactorily (sometimes thousands of
times, using the same training samples each time).

The delta training rule is comparable to the gradient descent weight-update rule. It changes each weight
according to the learning rate n, the input value xji to which the weight is applied, and the error in the unit’s
output, just as the delta rule.
The main change is that in the delta rule, the error (t – o) is substituted with a more complicated error term

, whose exact form of derives from the weight tuning rule’s derivation.

Consider how is computed for each network output unit k to get a sense of how it works.

is simply (tk – ok) from the delta rule multiplied by the quantity that is the sigmoid
squashing function’s derivative.

The value for each hidden unit h However, because target values tk are only provided for network
outputs in training instances, no target values are explicitly accessible to signal the inaccuracy of concealed
unit values.

Rather, the error term for hidden unit h is determined by adding the error terms for each output unit

impacted by h and weighting each by Wkh, the weight from hidden unit h to output unit k. The degree to
which hidden unit h is “responsible” for the inaccuracy in output unit k is represented by this weight.

1- What is generalization?

The term ‘generalization’ refers to the model’s capability to adapt and react properly to previously unseen,
new data, which has been drawn from the same distribution as the one used to build the model. In other
words, generalization examines how well a model can digest new data and make correct predictions after
getting trained on a training set.
How well a model is able to generalize is the key to its success. If you train a model too well on training
data, it will be incapable of generalizing. In such cases, it will end up making erroneous predictions when
it’s given new data. This would make the model ineffective even though it’s capable of making correct
predictions for the training data set. This is known as overfitting. The inverse (underfitting) is also true,
which happens when you train a model with inadequate data. In cases of underfitting, your model would fail
to make accurate predictions even with the training data. This would make the model just as useless as
overfitting

Kohonen Self- Organizing Feature Map

Kohonen Self-Organizing feature map (SOM) refers to a neural network, which is trained using competitive
learning. Basic competitive learning implies that the competition process takes place before the cycle of
learning. The competition process suggests that some criteria select a winning processing element. Afte r the
winning processing element is selected, its weight vector is adjusted according to the used learning law
(Hecht Nielsen 1990).

The self-organizing map makes topologically ordered mappings between input data and processing elements
of the map. Topological ordered implies that if two inputs are of similar characteristics, the most active
processing elements answering to inputs that are located closed to each other on the map. The weight vectors
of the processing elements are organized in ascending to descending order. Wi < Wi+1 for all values of i or
Wi+1 for all values of i (this definition is valid for one-dimensional self-organizing map only).

The self-organizing map is typically represented as a two-dimensional sheet of processing elements


described in the figure given below. Each processing element has its own weight vector, and learning of
SOM (self-organizing map) depends on the adaptation of these vectors. The processing elements of the
network are made competitive in a self-organizing process, and specific criteria pick the winning processing
element whose weights are updated. Generally, these criteria are used to limit the Euclidean distance
between the input vector and the weight vector. SOM (self-organizing map) varies from basic competitive
learning so that instead of adjusting only the weight vector of the winning processing element also weight
vectors of neighboring processing elements are adjusted. First, the size of the neighborhood is largely
making the rough ordering of SOM and size is diminished as time goes on. At last, only a winning
processing element is adjusted, making the fine-tuning of SOM possible. The use of neighborhood makes
topologically ordering procedure possible, and together with competitive learning makes process non-linear.

It is discovered by Finnish professor and researcher Dr. Teuvo Kohonen in 1982. The self-organizing map
refers to an unsupervised learning model proposed for applications in which maintaining a topology between
input and output spaces. The notable attribute of this algorithm is that the input vectors that are close and
similar in high dimensional space are also mapped to close by nodes in the 2D space. It is fundamentally a
method for dimensionality reduction, as it maps high-dimension inputs to a low dimensional discretized
representation and preserves the basic structure of its input space.
All the entire learning process occurs without supervision because the nodes are self-organizing. They are
also known as feature maps, as they are basically retraining the features of the input data, and simply
grouping themselves as indicated by the similarity between each other. It has practical value for visualizing
complex or huge quantities of high dimensional data and showing the relationship between them into a low,
usually two-dimensional field to check whether the given unlabeled data have any structure to it.

A self-Organizing Map (SOM) varies from typical artificial neural networks (ANNs) both in its architecture
and algorithmic properties. Its structure consists of a single layer linear 2D grid of neurons, rather than a
series of layers. All the nodes on this lattice are associated directly to the input vector, but not to each other.
It means the nodes don't know the values of their neighbors, and only update the weight of their associations
as a function of the given input. The grid itself is the map that coordinates itself at each iteration as a
function of the input data. As such, after clustering, each node has its own coordinate (i.j), which enables
one to calculate Euclidean distance between two nodes by means of the Pythagoras theorem.

A Self-Organizing Map utilizes competitive learning instead of error-correction learning, to modify its
weights. It implies that only an individual node is activated at each cycle in which the features of an
occurrence of the input vector are introduced to the neural network, as all nodes compete for the privilege to
respond to the input.

The selected node- the Best Matching Unit (BMU) is selected according to the similarity between the
current input values and all the other nodes in the network. The node with the fractional Euclidean
difference between the input vector, all nodes, and its neighboring nodes is selected and within a specific
radius, to have their position slightly adjusted to coordinate the input vector. By experiencing all the nodes
present on the grid, the whole grid eventually matches the entire input dataset with connected nodes
gathered towards one area, and dissimilar ones are isolated.
What is Deep Learning?
Deep Learning is a computer software that mimics the network of neurons in a brain. It is a subset of
machine learning based on artificial neural networks with representation learning. It is called deep learning
because it makes use of deep neural networks. This learning can be supervised, semi-supervised or
unsupervised.

Deep learning algorithms are constructed with connected layers.

 The first layer is called the Input Layer


 The last layer is called the Output Layer
 All layers in between are called Hidden Layers. The word deep means the network join neurons in more than
two layers.
What is Deep Learning?

Each Hidden layer is composed of neurons. The neurons are connected to each other. The neuron will
process and then propagate the input signal it receives the layer above it. The strength of the signal given the
neuron in the next layer depends on the weight, bias and activation function.

The network consumes large amounts of input data and operates them through multiple layers; the network
can learn increasingly complex features of the data at each layer.

In this Deep learning tutorial for beginners, you will learn Deep learning basics like-

 What is Deep Learning?


 Deep learning Process
 Classification of Neural Networks
 Types of Deep Learning Networks
 Feed-forward neural networks
 Recurrent neural networks (RNNs)
 Convolutional neural networks (CNN)
 Reinforcement Learning
 Examples of deep learning applications
 Why is Deep Learning Important?
 Limitations of deep learning

Deep learning Process


A deep neural network provides state-of-the-art accuracy in many tasks, from object detection to speech
recognition. They can learn automatically, without predefined knowledge explicitly coded by the
programmers.
Deep learning Process

To grasp the idea of deep learning, imagine a family, with an infant and parents. The toddler points objects
with his little finger and always says the word ‘cat.’ As his parents are concerned about his education, they
keep telling him ‘Yes, that is a cat’ or ‘No, that is not a cat.’ The infant persists in pointing objects but
becomes more accurate with ‘cats.’ The little kid, deep down, does not know why he can say it is a cat or
not. He has just learned how to hierarchies complex features coming up with a cat by looking at the pet
overall and continue to focus on details such as the tails or the nose before to make up his mind.

A neural network works quite the same. Each layer represents a deeper level of knowledge, i.e., the
hierarchy of knowledge. A neural network with four layers will learn more complex feature than with two
layers.

The learning occurs in two phases:

First Phase: The first phase consists of applying a nonlinear transformation of the input and create a
statistical model as output.
Second Phase: The second phase aims at improving the model with a mathematical method known as
derivative.

The neural network repeats these two phases hundreds to thousands of times until it has reached a tolerable
level of accuracy. The repeat of this two-phase is called an iteration.

To give a Deep learning example, take a look at the motion below, the model is trying to learn how to dance.
After 10 minutes of training, the model does not know how to dance, and it looks like a scribble.

After 48 hours of learning, the computer masters the art of dancing.


Classification of Neural Networks
Shallow neural network: The Shallow neural network has only one hidden layer between the input and
output.

Deep neural network: Deep neural networks have more than one layer. For instance, Google LeNet model
for image recognition counts 22 layers.

Nowadays, deep learning is used in many ways like a driverless car, mobile phone, Google Search Engine,
Fraud detection, TV, and so on.

Types of Deep Learning Networks


Now in this Deep Neural network tutorial, we will learn about types of Deep Learning Networks:
Types of Deep Learning Networks

Feed-forward neural networks


The simplest type of artificial neural network. With this type of architecture, information flows in only one
direction, forward. It means, the information’s flows starts at the input layer, goes to the “hidden” layers,
and end at the output layer. The network

does not have a loop. Information stops at the output layers.

Recurrent neural networks (RNNs)


RNN is a multi-layered neural network that can store information in context nodes, allowing it to learn data
sequences and output a number or another sequence. In simple words, it is an Artificial neural networks
whose connections between neurons include loops. RNNs are well suited for processing sequences of inputs.

Recurrent neural networks

For Example, if the task is to predict the next word in the sentence “Do you want a…………?
 The RNN neurons will receive a signal that point to the start of the sentence.
 The network receives the word “Do” as an input and produces a vector of the number. This vector is fed
back to the neuron to provide a memory to the network. This stage helps the network to remember it
received “Do” and it received it in the first position.
 The network will similarly proceed to the next words. It takes the word “you” and “want.” The state of the
neurons is updated upon receiving each word.
 The final stage occurs after receiving the word “a.” The neural network will provide a probability for each
English word that can be used to complete the sentence. A well-trained RNN probably assigns a high
probability to “café,” “drink,” “burger,” etc.

Common uses of RNN

 Help securities traders to generate analytic reports


 Detect abnormalities in the contract of financial statement
 Detect fraudulent credit-card transaction
 Provide a caption for images
 Power chatbots
 The standard uses of RNN occur when the practitioners are working with time-series data or sequences (e.g.,
audio recordings or text).

Convolutional neural networks (CNN)


CNN is a multi-layered neural network with a unique architecture designed to extract increasingly complex
features of the data at each layer to determine the output. CNN’s are well suited for perceptual tasks.

Convolutional Neural Network

CNN is mostly used when there is an unstructured data set (e.g., images) and the practitioners need to
extract information from it.

For instance, if the task is to predict an image caption:

 The CNN receives an image of let’s say a cat, this image, in computer term, is a collection of the pixel.
Generally, one layer for the greyscale picture and three layers for a color picture.
 During the feature learning (i.e., hidden layers), the network will identify unique features, for instance, the
tail of the cat, the ear, etc.
 When the network thoroughly learned how to recognize a picture, it can provide a probability for each
image it knows. The label with the highest probability will become the prediction of the network.

Reinforcement Learning
Reinforcement learning is a subfield of machine learning in which systems are trained by receiving virtual
“rewards” or “punishments,” essentially learning by trial and error. Google’s DeepMind has used
reinforcement learning to beat a human champion in the Go games. Reinforcement learning is also used in
video games to improve the gaming experience by providing smarter bots.

One of the most famous algorithms are:

 Q-learning
 Deep Q network
 State-Action-Reward-State-Action (SARSA)
 Deep Deterministic Policy Gradient (DDPG)

Examples of deep learning applications


Now in this Deep learning for beginners tutorial, let’s learn about Deep Learning applications:

AI in Finance:

The financial technology sector has already started using AI to save time, reduce costs, and add value. Deep
learning is changing the lending industry by using more robust credit scoring. Credit decision-makers can
use AI for robust credit lending applications to achieve faster, more accurate risk assessment, using machine
intelligence to factor in the character and capacity of applicants.

Underwrite is a Fintech company providing an AI solution for credit makers companies. underwrite.ai uses
AI to detect which applicant is more likely to pay back a loan. Their approach radically outperforms
traditional methods.

AI in HR:

Under Armour, a sportswear company revolutionizes hiring and modernizes the candidate experience with
the help of AI. In fact, Under Armour Reduces hiring time for its retail stores by 35%. Under Armour faced
a growing popularity interest back in 2012. They had, on average, 30000 resumes a month. Reading all of
those applications and begin to start the screening and interview process was taking too long. The lengthy
process to get people hired and on-boarded impacted Under Armour’s ability to have their retail stores fully
staffed, ramped and ready to operate.

At that time, Under Armour had all of the ‘must have’ HR technology in place such as transactional
solutions for sourcing, applying, tracking and onboarding but those tools weren’t useful enough. Under
armour choose HireVue, an AI provider for HR solution, for both on-demand and live interviews. The
results were bluffing; they managed to decrease by 35% the time to fill. In return, the hired higher quality
staffs.

AI in Marketing:

AI is a valuable tool for customer service management and personalization challenges. Improved speech
recognition in call-center management and call routing as a result of the application of AI techniques allows
a more seamless experience for customers.

For example, deep-learning analysis of audio allows systems to assess a customer’s emotional tone. If the
customer is responding poorly to the AI chatbot, the system can be rerouted the conversation to real, human
operators that take over the issue.

Apart from the three Deep learning examples above, AI is widely used in other sectors/industries.

Why is Deep Learning Important?


Deep learning is a powerful tool to make prediction an actionable result. Deep learning excels in pattern
discovery (unsupervised learning) and knowledge-based prediction. Big data is the fuel for deep learning.
When both are combined, an organization can reap unprecedented results in term of productivity, sales,
management, and innovation.

Deep learning can outperform traditional method. For instance, deep learning algorithms are 41% more
accurate than machine learning algorithm in image classification, 27 % more accurate in facial recognition
and 25% in voice recognition.

Limitations of deep learning


Now in this Neural network tutorial, we will learn about limitations of Deep Learning:

Data labeling

Most current AI models are trained through “supervised learning.” It means that humans must label and
categorize the underlying data, which can be a sizable and error-prone chore. For example, companies
developing self-driving-car technologies are hiring hundreds of people to manually annotate hours of video
feeds from prototype vehicles to help train these systems.

Obtain huge training datasets

It has been shown that simple deep learning techniques like CNN can, in some cases, imitate the knowledge
of experts in medicine and other fields. The current wave of machine learning, however, requires training
data sets that are not only labeled but also sufficiently broad and universal.

Deep-learning methods required thousands of observations for models to become relatively good at
classification tasks and, in some cases, millions for them to perform at the level of humans. Without
surprise, deep learning is famous in giant tech companies; they are using big data to accumulate petabytes of
data. It allows them to create an impressive and highly accurate deep learning model.

Explain a problem

Large and complex models can be hard to explain, in human terms. For instance, why a particular decision
was obtained. It is one reason that acceptance of some AI tools are slow in application areas where
interpretability is useful or indeed required.

Furthermore, as the application of AI expands, regulatory requirements could also drive the need for more
explainable AI models.

Convolutional Neural Network

Convolutional Neural Networks are a special type of feed-forward artificial neural network in which the
connectivity pattern between its neuron is inspired by the visual cortex.
The visual cortex encompasses a small region of cells that are region sensitive to visual fields. In case some
certain orientation edges are present then only some individual neuronal cells get fired inside the brain such
as some neurons responds as and when they get exposed to the vertical edges, however some responds when
they are shown to horizontal or diagonal edges, which is nothing but the motivation behind Convolutional
Neural Networks.

The Convolutional Neural Networks, which are also called as covnets, are nothing but neural networks,
sharing their parameters. Suppose that there is an image, which is embodied as a cuboid, such that it
encompasses length, width, and height. Here the dimensions of the image are represented by the Red, Green,
and Blue channels, as shown in the image given below.

Now assume that we have taken a small patch of the same image, followed by running a small neural
network on it, having k number of outputs, which is represented in a vertical manner. Now when we slide
our small neural network all over the image, it will result in another image constituting different width,
height as well as depth. We will notice that rather than having R, G, B channels, we have come across some
more channels that, too, with less width and height, which is actually the concept of Convolution. In case, if
we accomplished in having similar patch size as that of the image, then it would have been a regular neural
network. We have some wights due to this small patch.
Mathematically it could be understood as follows;

 The Convolutional layers encompass a set of learnable filters, such that each filter embraces small width,
height as well as depth as that of the provided input volume (if the image is the input layer then probably it
would be 3).
 Suppose that we want to run the convolution over the image that comprises of 34x34x3 dimension, such
that the size of a filter can be axax3. Here a can be any of the above 3, 5, 7, etc. It must be small in
comparison to the dimension of the image.
 Each filter gets slide all over the input volume during the forward pass. It slides step by step, calling each
individual step as a stride that encompasses a value of 2 or 3 or 4 for higher-dimensional images, followed by
calculating a dot product in between filter's weights and patch from input volume.
 It will result in 2-Dimensional output for each filter as and when we slide our filters followed by stacking
them together so as to achieve an output volume to have a similar depth value as that of the number of
filters. And then, the network will learn all the filters.

Working of CNN
Generally, a Convolutional Neural Network has three layers, which are as follows;

 Input: If the image consists of 32 widths, 32 height encompassing three R, G, B channels, then it will hold the
raw pixel([32x32x3]) values of an image.
 Convolution: It computes the output of those neurons, which are associated with input's local regions, such
that each neuron will calculate a dot product in between weights and a small region to which they are
actually linked to in the input volume. For example, if we choose to incorporate 12 filters, then it will result
in a volume of [32x32x12].
 ReLU Layer: It is specially used to apply an activation function elementwise, like as max (0, x) thresholding at
zero. It results in ([32x32x12]), which relates to an unchanged size of the volume.
 Pooling: This layer is used to perform a downsampling operation along the spatial dimensions (width, height)
that results in [16x16x12] volume.
 The 2D Convolution Layer
 The most common type of convolution that is used is the 2D convolution layer and is usually
abbreviated as conv2D. A filter or a kernel in a conv2D layer “slides” over the 2D input data,
performing an elementwise multiplication. As a result, it will be summing up the results into a single
output pixel. The kernel will perform the same operation for every location it slides over,
transforming a 2D matrix of features into a different 2D matrix of features.
What is Reinforcement Learning?
 Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave
in an environment by performing the actions and seeing the results of actions. For each good action, the
agent gets positive feedback, and for each bad action, the agent gets negative feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled data, unlike
supervised learning.
 Since there is no labeled data, so the agent is bound to learn by its experience only.
 RL solves a specific type of problem where decision making is sequential, and the goal is long-term, such as
game-playing, robotics, etc.
 The agent interacts with the environment and explores it by itself. The primary goal of an agent in
reinforcement learning is to improve the performance by getting the maximum positive rewards.
 The agent learns with the process of hit and trial, and based on the experience, it learns to perform the task
in a better way. Hence, we can say that "Reinforcement learning is a type of machine learning method
where an intelligent agent (computer program) interacts with the environment and learns to act within
that." How a Robotic dog learns the movement of his arms is an example of Reinforcement learning.
 It is a core part of Artificial intelligence, and all AI agent works on the concept of reinforcement learning.
Here we do not need to pre-program the agent, as it learns from its own experience without any human
intervention.
 Example: Suppose there is an AI agent present within a maze environment, and his goal is to find the
diamond. The agent interacts with the environment by performing some actions, and based on those
actions, the state of the agent gets changed, and it also receives a reward or penalty as feedback.
 The agent continues doing these three things (take action, change state/remain in the same state, and get
feedback), and by doing these actions, he learns and explores the environment.
 The agent learns that what actions lead to positive feedback or rewards and what actions lead to negative
feedback penalty. As a positive reward, the agent gets a positive point, and as a penalty, it gets a negative
point.

Terms used in Reinforcement Learning


 Agent(): An entity that can perceive/explore the environment and act upon it.
 Environment(): A situation in which an agent is present or surrounded by. In RL, we assume the stochastic
environment, which means it is random in nature.
 Action(): Actions are the moves taken by an agent within the environment.
 State(): State is a situation returned by the environment after each action taken by the agent.
 Reward(): A feedback returned to the agent from the environment to evaluate the action of the agent.
 Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
 Value(): It is expected long-term retuned with the discount factor and opposite to the short-term reward.
 Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current action (a).

Key Features of Reinforcement Learning


 In RL, the agent is not instructed about the environment and what actions need to be taken.
 It is based on the hit and trial process.
 The agent takes the next action and changes states according to the feedback of the previous action.
 The agent may get a delayed reward.
 The environment is stochastic, and the agent needs to explore it to reach to get the maximum positive
rewards.

Approaches to implement Reinforcement Learning


There are mainly three ways to implement reinforcement-learning in ML, which are:

1. Value-based:
The value-based approach is about to find the optimal value function, which is the maximum value at a state
under any policy. Therefore, the agent expects the long-term return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards without using the value
function. In this approach, the agent tries to apply such a policy that the action performed in each step helps
to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the environment, and the agent
explores that environment to learn it. There is no particular solution or algorithm for this approach because
the model representation is different for each environment.

Elements of Reinforcement Learning


There are four main elements of Reinforcement Learning, which are given below :

1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment

1) Policy: A policy can be defined as a way how an agent behaves at a given time. It maps the perceived
states of the environment to the actions taken on those states. A policy is the core element of the RL as it
alone can define the behavior of the agent. In some cases, it may be a simple function or a lookup table,
whereas, for other cases, it may involve general computation as a search process. It could be deterministic or
a stochastic policy:

For deterministic policy: a = π(s)


For stochastic policy: π(a | s) = P[At =a | St = s]

2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each state, the
environment sends an immediate signal to the learning agent, and this signal is known as a reward signal.
These rewards are given according to the good and bad actions taken by the agent. The agent's main
objective is to maximize the total number of rewards for good actions. The reward signal can change the
policy, such as if an action selected by the agent leads to low reward, then the policy may change to select
other actions in the future.

3) Value Function: The value function gives information about how good the situation and action are and
how much reward an agent can expect. A reward indicates the immediate signal for each good and bad
action, whereas a value function specifies the good state and action for the future. The value function
depends on the reward as, without reward, there could be no value. The goal of estimating values is to
achieve more rewards.

4) Model: The last element of reinforcement learning is the model, which mimics the behavior of the
environment. With the help of the model, one can make inferences about how the environment will behave.
Such as, if a state and an action are given, then a model can predict the next state and reward.

The model is used for planning, which means it provides a way to take a course of action by considering all
future situations before actually experiencing those situations. The approaches for solving the RL problems
with the help of the model are termed as the model-based approach. Comparatively, an approach without
using a model is called a model-free approach.

How does Reinforcement Learning Work?


To understand the working process of the RL, we need to consider two main things:

 Environment: It can be anything such as a room, maze, football ground, etc.


 Agent: An intelligent agent such as AI robot.

Let's take an example of a maze environment that the agent needs to explore. Consider the below image:
In the above image, the agent is at the very first block of the maze. The maze is consisting of an S 6 block,
which is a wall, S8 a fire pit, and S4 a diamond block.

The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S 4 block, then get the +1
reward; if it reaches the fire pit, then gets -1 reward point. It can take four actions: move up, move down,
move left, and move right.

The agent can take any path to reach to the final point, but he needs to make it in possible fewer steps.
Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1-reward point.

The agent will try to remember the preceding steps that it has taken to reach the final step. To memorize the
steps, it assigns 1 value to each previous step. Consider the below step:
Now, the agent has successfully stored the previous steps assigning the 1 value to each previous block. But
what will the agent do if he starts moving from the block, which has 1 value block on both sides? Consider
the below diagram:

It will be a difficult condition for the agent whether he should go up or down as each block has the same
value. So, the above approach is not suitable for the agent to reach the destination. Hence to solve the
problem, we will use the Bellman equation, which is the main concept behind reinforcement learning.
The Bellman Equation
The Bellman equation was introduced by the Mathematician Richard Ernest Bellman in the year 1953,
and hence it is called as a Bellman equation. It is associated with dynamic programming and used to
calculate the values of a decision problem at a certain point by including the values of previous states.

It is a way of calculating the value functions in dynamic programming or environment that leads to modern
reinforcement learning.

The key-elements used in Bellman equations are:

 Action performed by the agent is referred to as "a"


 State occurred by performing the action is "s."
 The reward/feedback obtained for each good and bad action is "R."
 A discount factor is Gamma "γ."

The Bellman equation can be written as:

1. V(s) = max [R(s,a) + γV(s`)]

Where,

V(s)= value calculated at a particular point.

R(s,a) = Reward at a particular state s by performing an action.

γ = Discount factor

V(s`) = The value at the previous state.

In the above equation, we are taking the max of the complete values because the agent tries to find the
optimal solution always.

So now, using the Bellman equation, we will find value at each state of the given environment. We will start
from the block, which is next to the target block.

For 1st block:

V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.

V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.

For 2nd block:

V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no reward at this
state.

V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9

For 3rd block:

V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is no reward at this
state also.

V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81


For 4th block:

V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because there is no reward at
this state also.

V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73

For 5th block:

V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because there is no reward at
this state also.

V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66

Consider the below image:

Now, we will move further to the 6th block, and here agent may change the route because it always tries to
find the optimal path. So now, let's consider from the block next to the fire pit.
Now, the agent has three options to move; if he moves to the blue box, then he will feel a bump if he moves
to the fire pit, then he will get the -1 reward. But here we are taking only positive rewards, so for this, he
will move to upwards only. The complete block values will be calculated using this formula. Consider the
below image:

Types of Reinforcement learning


There are mainly two types of reinforcement learning, which are:

 Positive Reinforcement
 Negative Reinforcement

Positive Reinforcement:

The positive reinforcement learning means adding something to increase the tendency that expected
behavior would occur again. It impacts positively on the behavior of the agent and increases the strength of
the behavior.

This type of reinforcement can sustain the changes for a long time, but too much positive reinforcement may
lead to an overload of states that can reduce the consequences.

Negative Reinforcement:

The negative reinforcement learning is opposite to the positive reinforcement as it increases the tendency
that the specific behavior will occur again by avoiding the negative condition.

It can be more effective than the positive reinforcement depending on situation and behavior, but it provides
reinforcement only to meet minimum behavior.

How to represent the agent state?

We can represent the agent state using the Markov State that contains all the required information from the
history. The State St is Markov state if it follows the given condition:

P[St+1 | St ] = P[St +1 | S1,......, St]

The Markov state follows the Markov property, which says that the future is independent of the past and
can only be defined with the present. The RL works on fully observable environments, where the agent can
observe the environment and act for the new state. The complete process is known as Markov Decision
process, which is explained below:

Markov Decision Process


Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. If the
environment is completely observable, then its dynamic can be modeled as a Markov Process. In MDP, the
agent constantly interacts with the environment and performs actions; at each action, the environment
responds and generates a new state.
MDP is used to describe the environment for the RL, and almost all the RL problem can be formalized using
MDP.

MDP contains a tuple of four elements (S, A, Pa, Ra):

 A set of finite States S


 A set of finite Actions A
 Rewards received after transitioning from state S to state S', due to action a.
 Probability Pa .

MDP uses Markov property, and to better understand the MDP, we need to learn about it.

Markov Property:

It says that "If the agent is present in the current state S1, performs an action a1 and move to the state s2,
then the state transition from s1 to s2 only depends on the current state and future action and states do
not depend on past actions, rewards, or states."

Or, in other words, as per Markov Property, the current state transition does not depend on any past action or
state. Hence, MDP is an RL problem that satisfies the Markov property. Such as in a Chess game, the
players only focus on the current state and do not need to remember past actions or states.

Finite MDP:

A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we consider only the
finite MDP.
Markov Process:

Markov Process is a memoryless process with a sequence of random states S 1, S2, ....., St that uses the
Markov Property. Markov process is also known as Markov chain, which is a tuple (S, P) on state S and
transition function P. These two components (S and P) can define the dynamics of the system.

Reinforcement Learning Algorithms


Reinforcement learning algorithms are mainly used in AI applications and gaming applications. The main
used algorithms are:

 Q-Learning:
o Q-learning is an Off policy RL algorithm, which is used for the temporal difference Learning. The
temporal difference learning methods are the way of comparing temporally successive predictions.
o It learns the value function Q (S, a), which means how good to take action "a" at a particular state
"s."
o The below flowchart explains the working of Q- learning:

 State Action Reward State action (SARSA):


o SARSA stands for State Action Reward State action, which is an on-policy temporal difference
learning method. The on-policy control method selects the action for each state while learning using
a specific policy.
o The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and all pairs of (s-a).
o The main difference between Q-learning and SARSA algorithms is that unlike Q-learning, the
maximum reward for the next state is not required for updating the Q-value in the table.
o In SARSA, new action and reward are selected using the same policy, which has determined the
original action.
o The SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.
 Deep Q Neural Network (DQN):
o As the name suggests, DQN is a Q-learning using Neural networks.
o For a big state space environment, it will be a challenging and complex task to define and update a
Q-table.
o To solve such an issue, we can use a DQN algorithm. Where, instead of defining a Q-table, neural
network approximates the Q-values for each action and state.

Now, we will expand the Q-learning.

Q-Learning Explanation:

 Q-learning is a popular model-free reinforcement learning algorithm based on the Bellman equation.
 The main objective of Q-learning is to learn the policy which can inform the agent that what actions
should be taken for maximizing the reward under what circumstances.
 It is an off-policy RL that attempts to find the best action to take at a current state.
 The goal of the agent in Q-learning is to maximize the value of Q.
 The value of Q-learning can be derived from the Bellman equation. Consider the Bellman equation given
below:

In the equation, we have various components, including reward, discount factor (γ), probability, and end
states s'. But there is no any Q-value is given so first consider the below image:

In the above image, we can see there is an agent who has three values options, V(s 1), V(s2), V(s3). As this is
MDP, so agent only cares for the current state and the future state. The agent can go to any direction (Up,
Left, or Right), so he needs to decide where to go for the optimal path. Here agent will take a move as per
probability bases and changes the state. But if we want some exact moves, so for this, we need to make
some changes in terms of Q-value. Consider the below image:
Q- represents the quality of the actions at each state. So instead of using a value at each state, we will use a
pair of state and action, i.e., Q(s, a). Q-value specifies that which action is more lubricative than others, and
according to the best Q-value, the agent takes his next move. The Bellman equation can be used for deriving
the Q-value.

To perform any action, the agent will get a reward R(s, a), and also he will end up on a certain state, so the Q
-value equation will be:

Hence, we can say that, V(s) = max [Q(s, a)]

The above formula is used to estimate the Q-values in Q-Learning.

What is 'Q' in Q-learning?

The Q stands for quality in Q-learning, which means it specifies the quality of an action taken by the agent.

Q-table:

A Q-table or matrix is created while performing the Q-learning. The table follows the state and action pair,
i.e., [s, a], and initializes the values to zero. After each action, the table is updated, and the q-values are
stored within the table.

The RL agent uses this Q-table as a reference table to select the best action based on the q-values

Reinforcement Learning Applications


1. Robotics:
1. RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.
2. Control:
1. RL can be used for adaptive control such as Factory processes, admission control in
telecommunication, and Helicopter pilot is an example of reinforcement learning.
3. Game Playing:
1. RL can be used in Game playing such as tic-tac-toe, chess, etc.
4. Chemistry:
1. RL can be used for optimizing the chemical reactions.
5. Business:
1. RL is now used for business strategy planning.
6. Manufacturing:
1. In various automobile manufacturing companies, the robots use deep reinforcement learning to pick
goods and put them in some containers.
7. Finance Sector:
1. The RL is currently used in the finance sector for evaluating trading strategies

Genetic Algorithm in Machine Learning

A genetic algorithm is an adaptive heuristic search algorithm inspired by "Darwin's theory of evolution
in Nature." It is used to solve optimization problems in machine learning. It is one of the important
algorithms as it helps solve complex problems that would take a long time to solve.
Genetic Algorithms are being widely used in different real-world applications, for example, Designing
electronic circuits, code-breaking, image processing, and artificial creativity.

In this topic, we will explain Genetic algorithm in detail, including basic terminologies used in Genetic
algorithm, how it works, advantages and limitations of genetic algorithm, etc.

What is a Genetic Algorithm?


Before understanding the Genetic algorithm, let's first understand basic terminologies to better understand
this algorithm:

 Population: Population is the subset of all possible or probable solutions, which can solve the given problem.
 Chromosomes: A chromosome is one of the solutions in the population for the given problem, and the
collection of gene generate a chromosome.
 Gene: A chromosome is divided into a different gene, or it is an element of the chromosome.
 Allele: Allele is the value provided to the gene within a particular chromosome.
 Fitness Function: The fitness function is used to determine the individual's fitness level in the population. It
means the ability of an individual to compete with other individuals. In every iteration, individuals are
evaluated based on their fitness function.
 Genetic Operators: In a genetic algorithm, the best individual mate to regenerate offspring better than
parents. Here genetic operators play a role in changing the genetic composition of the next generation.
 Selection

After calculating the fitness of every existent in the population, a selection process is used to determine
which of the individualities in the population will get to reproduce and produce the seed that will form the
coming generation.

Types of selection styles available

 Roulette wheel selection


 Event selection
 Rank- grounded selection

So, now we can define a genetic algorithm as a heuristic search algorithm to solve optimization problems. It
is a subset of evolutionary algorithms, which is used in computing. A genetic algorithm uses genetic a nd
natural selection concepts to solve optimization problems.
How Genetic Algorithm Work?
The genetic algorithm works on the evolutionary generational cycle to generate high-quality solutions.
These algorithms use different operations that either enhance or replace the population to give an improved
fit solution.

It basically involves five phases to solve the complex optimization problems, which are given as below:

 Initialization
 Fitness Assignment
 Selection
 Reproduction
 Termination

1. Initialization

The process of a genetic algorithm starts by generating the set of individuals, which is called population.
Here each individual is the solution for the given problem. An individual contains or is characterized by a
set of parameters called Genes. Genes are combined into a string and generate chromosomes, which is the
solution to the problem. One of the most popular techniques for initialization is the use of random binary
strings.

2. Fitness Assignment

Fitness function is used to determine how fit an individual is? It means the ability of an individual to
compete with other individuals. In every iteration, individuals are evaluated based on their fitnes s function.
The fitness function provides a fitness score to each individual. This score further determines the probability
of being selected for reproduction. The high the fitness score, the more chances of getting selected for
reproduction.

3. Selection

The selection phase involves the selection of individuals for the reproduction of offspring. All the selected
individuals are then arranged in a pair of two to increase reproduction. Then these individuals transfer their
genes to the next generation.

There are three types of Selection methods available, which are:


 Roulette wheel selection
 Tournament selection
 Rank-based selection

4. Reproduction

After the selection process, the creation of a child occurs in the reproduction step. In this step, the genetic
algorithm uses two variation operators that are applied to the parent population. The two operators involved
in the reproduction phase are given below:

 Crossover: The crossover plays a most significant role in the reproduction phase of the genetic algorithm. In
this process, a crossover point is selected at random within the genes. Then the crossover operator swaps
genetic information of two parents from the current generation to produce a new individual representing
the offspring.

The genes of parents are exchanged among themselves until the crossover point is met. These newly
generated offspring are added to the population. This process is also called or crossover. Types of crossover
styles available:
o One point crossover
o Two-point crossover
o Livery crossover
o Inheritable Algorithms crossover
 Mutation
The mutation operator inserts random genes in the offspring (new child) to maintain the diversity in the
population. It can be done by flipping some bits in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances diversification. The below
image shows the mutation process:
Types of mutation styles available,
o Flip bit mutation
o Gaussian mutation
o Exchange/Swap mutation
5. Termination

After the reproduction phase, a stopping criterion is applied as a base for termination. The algorithm
terminates after the threshold fitness solution is reached. It will identify the final solution as the best solution
in the population.

General Workflow of a Simple Genetic Algorithm

Advantages of Genetic Algorithm


 The parallel capabilities of genetic algorithms are best.
 It helps in optimizing various problems such as discrete functions, multi-objective problems, and continuous
functions.
 It provides a solution for a problem that improves over time.
 A genetic algorithm does not need derivative information.

Limitations of Genetic Algorithms


 Genetic algorithms are not efficient algorithms for solving simple problems.
 It does not guarantee the quality of the final solution to a problem.
 Repetitive calculation of fitness values may generate some computational challenges

You might also like