Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ML&DL PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 126

MACHINE LEARNING

&
DEEP LEARNING

BY:
SUMA KOMALI
2nd MCA
What is Machine Learning
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions. But can a machine also learn from experiences or past data
like a human does? So here comes the role of Machine Learning.

A machine has the ability to learn if it can improve its performance by gaining
more data.

How does Machine Learning work


A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The
accuracy of predicted output depends upon the amount of data, as the huge amount
of data helps to build a better model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so


instead of writing a code for it, we just need to feed the data to generic algorithms,
and with the help of these algorithms, machine builds the logic as per the data and
predict the output. Machine learning has changed our way of thinking about the
problem. The below block diagram explains the working of Machine Learning
algorithm:

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge
amount of the data.

Need for Machine Learning


The need for machine learning is increasing day by day. The reason behind the need
for machine learning is that it is capable of doing tasks that are too complex for a
person to implement directly. As a human, we have some limitations as we cannot
access the huge amount of data manually, so for this, we need some computer systems
and here comes the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data
and let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of
machine learning, we can save both time and money.

.
Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning


At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample
labeled data to the machine learning system in order to train it, and on that basis, it
predicts the output.

The system creates a model using labeled data to understand the datasets and learn
about each data, once the training and processing are done then we test the model
by providing a sample data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student learns
things in the supervision of the teacher. The example of supervised learning is spam
filtering.

Supervised learning can be grouped further in two categories of algorithms:

o Classification
o Regression

2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.

The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to


find useful insights from the huge amount of data. It can be further classifieds into two
categories of algorithms:

o Clustering
o Association

3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning
agent gets a reward for each right action and gets a penalty for each wrong action.
The agent learns automatically with these feedbacks and improves its performance. In
reinforcement learning, the agent interacts with the environment and explores it. The
goal of an agent is to get the most reward points, and hence, it improves its
performance.

The robotic dog, which automatically learns the movement of his arms, is an example
of Reinforcement learning.

Batch Learning
In batch learning, the system is incapable of learning incrementally: It must be
trained using all the available data. This will generally take a lot of time and
computing resources, so it is typically done offline, first the system is trained and
then it’s launched into production and runs without learning anymore; it just applied
what it has learned. This is called offline learning.

If you wish a batch learning system to know about new data, (such as a new type of
spam), you will have to train a new version of the system from scratch on the full
dataset (both new data and old data). The stop the old system and replace it with the
new one.

Fortunately, the whole process of training, evaluation, and launching a Machine


Learning system can be automated fairly easily so even a batch learning system can
adapt to change. Simple update the data and train a new version of the system from
scratch as often as needed.

This solution is simple and often works fine, but training using the full set of data can
take many hours and may not be a part of best practice. So, you would typically train
a new system only every 24 hrs or just weekly. If your system needs to adapt to
rapidly changing data then you need a more tractive solution.

Also, training on the full set of data requires a lot of computing resources (CPU,
memory space, disk space, disk I/O, network I/O, etc.).If you have a lot of data and
you automate your systems to train from scratch every day, it will end up costing you
a lot of money. If the amount of data is huge, it may even be impossible to use a
batch learning algorithm.

Finally, if the system needs to be able to learn autonomously and it has limited
resources (e.g. a smartphone application or a rover on Mars). Then carrying around
large amounts of training data taking it a lot of resources to train for hours every day
is a showstopper.

So, a better option in all these cases is to use algorithms that are capable of learning
incrementally.

Online Learning
In online learning, we train the system incrementally by feeding it data instances
sequentially, either individually or by small groups called mini-batches. Each learning
step is cheap and fast, so the system can learn about new data.
Online learning is great for systems that receives data as a continuous flow (e.g.,
stock prices) and needs to adapt to change rapidly or autonomously. It is also a good
option if you have limited computing resources: Once an online learning system has
learned about new data instances, it does not need them anymore, so you can
discard them (unless you to be able to roll back to a previous state and “replay” the
data). This can save a huge amount of space.

Online learning algorithms may also be used to train systems on huge datasets that
cannot fit in one machine’s main memory which is called out-of-core learning. This
algorithm loads part of the data, runs a training step on that data, and repeats the
process until it has run on all of the data.
One important parameter of online learning systems is how fast they should adapt to
changing data: This is called the learning rate. If you set a high learning rate, then
the system will rapidly adapt to new data. But it will also tend to quickly forget the old
data and you don’t want a spam filter to flag only the latest kinds of spam it was
shown. Conversely, if you set a low learning rate then the system will have more
inertia, that is it will learn slowly, but it will also be less sensitive to noise in the new
data or to sequences of non-representative’s data points.

Instance-based vs Model-based Learning:


Differences

Machine learning is a field of artificial intelligence that deals with giving machines the
ability to learn without being explicitly programmed. In this context, instance-based
learning and model-based learning are two different approaches used to create machine
learning models. While both approaches can be effective, they also have distinct
differences that must be taken into account when building a machine learning system.

What is instance-based learning & how does it


work?
Instance-based learning (also known as memory-based learning or lazy learning)
involves memorizing training data in order to make predictions about future data points.
This approach doesn’t require any prior knowledge or assumptions about the data, which
makes it easy to implement and understand.

However, it can be computationally expensive since all of the training data needs to be
stored in memory before making a prediction. Additionally, this approach doesn’t
generalize well to unseen data sets because its predictions are based on memorized
examples rather than learned models.
In instance-based learning, the system learns the training data by heart. At the time of
making prediction, the system uses similarity measure and compare the new cases
with the learned data.
K-nearest neighbors (KNN) is an algorithm that belongs to the instance-based learning
class of algorithms. KNN is a non-parametric algorithm because it does not assume any
specific form or underlying structure in the data.
Instead, it relies on a measure of similarity between each pair of data points. Generally
speaking, this measure is based on either Euclidean distance or cosine similarity;
however, other forms of metric can be used depending on the type of data being
analyzed.
Once the similarity between two points is calculated, KNN looks at how many neighbors
are within a certain radius around that point and uses these neighbors as examples to
make its prediction.
This means that instead of creating a generalizable model from all of the data, KNN
looks for similarities among individual data points and makes predictions
accordingly. The picture below demonstrates how the new instance will be predicted as
triangle based on greater number of triangles in its proximity.

Because KNN is an instance-based learning algorithm, it is not suitable for very large
datasets. This is because the model has to store all of the training examples in memory,
and making predictions on new data points involves comparing the new point to all of
the stored training examples. However, for small or medium-sized datasets, KNN can be
a very effective learning algorithm.

Other instance-based learning algorithms include learning vector quantization


(LVQ) and self-organizing maps (SOMs). These algorithms also memorize the
training examples and use them to make predictions on new data, but they use different
techniques to do so.

What is model-based learning & how does it


work?
Model-based learning (also known as structure-based or eager learning) takes a different
approach by constructing models from the training data that can generalize better than
instance-based methods. This involves using algorithms like linear regression, logistic
regression, random forest, etc. trees to create an underlying model from which
predictions can be made for new data points. The picture below represents how the
prediction about the class is decided based on boundary learned from training data
rather than comparing with learned data set based on similarity measures.
The model based learning approach has several benefits over instance-based methods,
such as faster processing speeds and better generalization capabilities due to its use of an
underlying model rather than relying solely on memorized examples. However, this
approach requires more time and effort to develop and tune the model for optimal
performance on unseen data sets.

Conclusion
In conclusion, instance-based and model-base learning are two distinct approaches used
in machine learning systems. Instance-based methods require less effort but don’t
generalize well while model-base methods require more effort but produce better
generalization capabilities. It is important for anyone working with machine learning
systems to understand how these two approaches differ so they can choose the best one
for their specific applications. With a proper understanding of both types of machine
learning techniques, you will be able to create powerful systems that achieve your
desired goals with minimal effort and maximum accuracy!

Challenges of Machine Learning


The advancement of machine learning technology in recent years certainly has
improved our lives. However, the implementation of machine learning in companies
has also brought up several ethical issues regarding AI technology. A few of them are:

Technological Singularity:
Although this topic attracts lots of attention from the many public, scientists are not
interested in the notion of AI exceeding humans' intelligence anytime in the immediate
future. This is often referred to as superintelligence and superintelligence, which Nick
Bostrum defines as "any intelligence that far surpasses the top human brains in virtually
every field, which includes general wisdom, scientific creativity and social abilities." In
spite of the fact that the concept of superintelligence and strong AI isn't a reality in
the world, the concept poses some interesting questions when we contemplate the
potential use of autonomous systems, such as self-driving vehicles. It's impossible to
imagine that a car with no driver would never be involved in a car accident, but who
would be accountable and accountable in those situations? Do we need to continue
to explore autonomous vehicles, or should we restrict the use of this technology to
produce semi-autonomous cars that encourage the safety of drivers? The jury isn't yet
out on this issue. However, these kinds of ethical debates are being fought as new and
genuine AI technology is developed.

AI Impact on Jobs:
While the majority of public opinion about artificial intelligence revolves around job
loss, the issue should likely be changed. With each new and disruptive technology, we
can see shifts in demand for certain job positions. For instance, when we consider the
automotive industry, a lot of manufacturers like GM are focusing their efforts on
electric vehicles to be in line with green policies. The energy sector isn't going away,
but the primary source that fuels it is changing from an energy economy based on fuel
to an electrical one. Artificial intelligence must be seen as a way to think about it, as
artificial intelligence is expected to shift the need for jobs to different areas. There will
be people who can control these systems as data expands and changes each day. It is
still necessary resources in order to solve more complicated issues within sectors that
are more likely to suffer from demand shifts, including customer service. The most
important element of artificial intelligence and its impact on the employment market
will be in helping individuals adapt to the new realms that are a result of the market.

Privacy:
Privacy is often frequently discussed in relation to data privacy security, data
protection, and security. These concerns have helped policymakers advance their
efforts recently. For instance, in 2016, GDPR legislation was introduced to safeguard
the personal information of individuals within Europe's European Union and European
Economic Area, which gives individuals more control over their data. Within the United
States, individual states are creating policies, including the California Consumer Privacy
Act (CCPA), that require companies to inform their customers about the processing of
their data. This legislation is forcing companies to think about how they handle and
store personally identifiable information (PII). In the process, security investments have
become a business priority to remove any potential vulnerabilities or opportunities to
hack, monitor, and cyber-attacks.
Bias and Discrimination:
Discrimination and bias in different intelligent machines have brought up several
ethical issues about using artificial intelligence. How can we protect ourselves from
bias and discrimination when training data could be biased? While most companies
have well-meaning intentions with regard to their automation initiatives, Reuters
highlights the unexpected effects of incorporating AI in hiring practices. As they tried
to automate and make it easier to do so, Amazon unintentionally biased potential
candidates based on gender in positions in the technical field, which led them to end
the project. When events like these come to light, Harvard Business Review (link
located outside of IBM) has raised pertinent questions about the application of AI in
hiring practices. For example, what kind of data could you analyse when evaluating a
candidate for a particular job.

Discrimination and bias aren't just limited to the human resource function. They are
present in a variety of applications ranging from software for facial recognition to
algorithms for social media.

Accountability:
There isn't a significant law to control AI practices. There's no mechanism for
enforcement to make sure that ethical AI is being used. Companies' primary
motivations to adhere to these standards are the negative effects of an untrustworthy
AI system on their bottom lines. To address the issue, ethical frameworks have been
developed in a partnership between researchers and ethicists to regulate the creation
and use of AI models. But, for the time being, they only serve as a provide guidance
the development of AI models. Research has shown that shared responsibility and
insufficient awareness of potential effects aren't ideal for protecting society from harm.

1. Insufficient Quantity of Training Data-

It takes a lot of data for most Machine Learning algorithms to work


properly.

Even for very simple problems you typically need thousands of


examples, and for complex problems such as image or speech
recognition, you may need millions of examples (unless you can
reuse parts of an existing model).

However, that small- and medium-sized datasets are still very


common, and it is not always easy or cheap to get extra training data
— so don’t abandon algorithms just yet.

2. Non-representative Training Data

In order to generalize well, it is crucial that your training data


be representative of the new cases you want to generalize to. This
is true whether you use instance-based learning or model-based
learning

For example-

The set of countries we used earlier for training the linear


model was not perfectly representative, a few countries
were missing.

• solid line → linear model on new data

• dotted line →old model

Observations-

• adding a few missing countries significantly alter the model

• very rich countries are not happier than moderately rich


countries and some poor countries seem happier than in
many rich countries.
• a simple linear model is probably never going to work well

By using a nonrepresentative training set, we trained a model


that is unlikely to make accurate predictions, especially for very poor
and very rich countries

If the sample is too small, you will have sampling noise

Even for large sample size, if the sampling method is flawed,


samples can be nonrepresentative. This is called sampling bias

In statistics, sampling bias is a bias in which a sample is


collected in such a way that some members of the intended
population have a lower or higher sampling probability than
others.

Sampling Bias Example

Termenology-

Dimension usually refers to the number of attributes


Attribute is one particular “type of data”. Attributes are often
called features in Machine Learning.

Ex- name,weight, height, age, etc.

Instance- An instance is an example in the training data. An


instance is described by many attributes.

Ex- Sanidhya,62,5.11,18

3. Poor-Quality Data

If your training data is full of errors, outliers, and noise, it will make
it harder for the system to detect the underlying patterns.

The solution is to clean up training data. (90% of the things you’ll do


as Data Scientist)

Outliers -

• Discard or Fix

Some Instances Missing few features-

(e.g., 5% of your customers did not specify their age)-

• Ignore that attribute altogether (Ex- age attribute)

• Ignore incomplete instances (Ex- that 5%)

• Fill Data (e.g., with the median age)


• Train one model with the feature(attribute) and one model
without it

4. Irrelevant Features

A model will only be capable of learning if the training data contains


enough relevant features and not too many irrelevant ones.

A critical part of the success of a Machine Learning project is


coming up with a good set of features to train on. This process
is called Feature engineering

It involves the following steps:

• Feature selection — selecting the most useful features to


train on among existing features)

• Feature extraction — combining existing features to


produce a more useful ones we saw earlier, dimensionality
reduction algorithms can help)

• Creating new features by gathering new data

Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this,
the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.

The chances of occurrence of overfitting increase as much we provide training to our


model. It means the more we train our model, the more chances of occurring the
overfitted model.

Overfitting is the main problem that occurs in supervised learning.


Example: The concept of the overfitting can be understood by the below graph of the
linear regression output:

As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because the
goal of the regression model to find the best fit line, but here we have not got any
best fit, so, it will generate the prediction errors.

How to avoid the Overfitting in Model


Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which we
can reduce the occurrence of overfitting in our model.

o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant trend
in the data.

In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear
regression model:

As we can see from the above diagram, the model is unable to capture the data points
present in the plot.

How to avoid underfitting:


o By increasing the training time of the model.
o By increasing the number of features.

Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how
closely the result or predicted values match the true values of the dataset.

The model with a good fit is between the underfitted and overfitted model, and ideally,
it makes predictions with 0 errors, but in practice, it is difficult to achieve it.

As when we train our model for a time, the errors in the training data go down, and
the same happens with test data. But if we train the model for a long duration, then
the performance of the model may decrease due to the overfitting, as the model also
learn the noise present in the dataset. The errors in the test dataset start increasing, so
the point, just before the raising of errors, is the good point, and we can stop here for
achieving a good model.

What is the goal of ML testing?


First of all, what are we trying to achieve when performing ML testing, as
well as any software testing whatsoever?

• Quality assurance is required to make sure that the software system


works according to the requirements. Were all the features
implemented as agreed? Does the program behave as expected? All
the parameters that you test the program against should be stated in
the technical specification document.
• Moreover, software testing has the power to point out all the defects
and flaws during development. You don’t want your clients to
encounter bugs after the software is released and come to you
waving their fists. Different kinds of testing allow us to catch bugs that
are visible only during runtime.

However, in machine learning, a programmer usually inputs the data and


the desired behavior, and the logic is elaborated by the machine. This is
especially true for deep learning. Therefore, the purpose of machine
learning testing is, first of all, to ensure that this learned logic will remain
consistent, no matter how many times we call the program.

Model evaluation in machine learning testing


Usually, software testing includes:

• Unit tests. The program is broken down into blocks, and each
element (unit) is tested separately.
• Regression tests. They cover already tested software to see if it
doesn’t suddenly break.
• Integration tests. This type of testing observes how multiple
components of the program work together.
First of all, you split the database into three non-overlapping sets. You use
a training set to train the model. Then, to evaluate the performance of the
model, you use two sets of data:

• Validation set. Having only a training set and a testing set is not
enough if you do many rounds of hyperparameter-tuning (which is
always). And that can result in overfitting. To avoid that, you can
select a small validation data set to evaluate a model. Only after you
get maximum accuracy on the validation set, you make the testing
set come into the game.
• Test set (or holdout set). Your model might fit the training dataset
perfectly well. But where are the guarantees that it will do equally well
in real-life? In order to assure that, you select samples for a testing
set from your training set — examples that the machine hasn’t seen
before. It is important to remain unbiased during selection and draw
samples at random. Also, you should not use the same set many
times to avoid training on your test data. Your test set should be large
enough to provide statistically meaningful results and be
representative of the data set as a whole.
But just as test sets, validation sets “wear out” when used repeatedly. The
more times you use the same data to make decisions about
hyperparameter settings or other model improvements, the less confident
you are that the model will generalize well on new, unseen data. So it is a
good idea to collect more data to ‘freshen up’ the test set and validation set

UNIT 2

What is Classification?

Classification is defined as the process of recognition, understanding, and grouping


of objects and ideas into preset categories a.k.a “sub-populations.” With the help of
these pre-categorized training datasets, classification in machine learning programs
leverage a wide range of algorithms to classify future datasets into respective and
relevant categories.

Binary Classification

What is Binary Classification?


In machine learning, binary classification is a supervised learning
algorithm that categorizes new observations into one of two classes.
The following are a few binary classification applications, where the 0
and 1 columns are two possible classes for each observation:

Application Observation 0 1

Medical Diagnosis Patient Healthy Diseased

Email Analysis Email Not Spam Spam

Financial Data Analysis Transaction Not Fraud Fraud

Marketing Website visitor Won't Buy Will Buy

Image Classification Image Hotdog Not Hotdog

Quick example
In a medical diagnosis, a binary classifier for a specific disease could
take a patient's symptoms as input features and predict whether the
patient is healthy or has the disease. The possible outcomes of the
diagnosis are positive and negative.

Evaluation of binary classifiers


If the model successfully predicts the patients as positive, this case is
called True Positive (TP). If the model successfully predicts patients as
negative, this is called True Negative (TN). The binary classifier may
misdiagnose some patients as well. If a diseased patient is classified as
healthy by a negative test result, this error is called False Negative (FN).
Similarly, If a healthy patient is classified as diseased by a positive test
result, this error is called False Positive(FP).
We can evaluate a binary classifier based on the following parameters:

• True Positive (TP): The patient is diseased and the model predicts "diseased"

• False Positive (FP): The patient is healthy but the model predicts "diseased"

• True Negative (TN): The patient is healthy and the model predicts "healthy"

• False Negative (FN): The patient is diseased and the model predicts "healthy"

After obtaining these values, we can compute the accuracy score of


the binary classifier as
The following is a confusion matrix, which represents the above
parameters:

In machine learning, many methods utilize binary classification. The


most common are:
• Support Vector Machines

• Naive Bayes

• Nearest Neighbor

• Decision Trees

• Logistic Regression

• Neural Networks

Performance Measures for a Classification Model

Confusion Matrix

For a binary problem (2 class labels in the dataset), the confusion


matrix looks like this:

• here f(i)(j) is an incorrect prediction (class i predicted to be


class j)

• Confusion matrix is also known as error matrix

• Each row represents the instances in the actual class


whereas each column represents the instances in the
predicted class
→ This confusion matrix allows us to derive a lot of performance
metrics which we will discuss below.

Accuracy

It is closeness of the measurements to a specific value. In simpler


terms, if we are measuring something repeatedly, we say the
measurement to be accurate if it is close to the true value of the
quantity being measured.

Introduction to Data Mining — Pang-Ning Tan, Michael Steinbach, Vipin Kumar

Error Rate

It is the opposite of accuracy. This metric measures the performance


of a model as the name suggests in terms of incorrect predictions.

Introduction to Data Mining — Pang-Ning Tan, Michael Steinbach, Vipin Kumar

Note: It is important to note that accuracy and error rate


metrics are prone to class imbalance problem. Class imbalance
problem occurs when the dataset consists of a few classes in
lower proportion (or rare) to the rest of the classes.

Example:

→ Consider a 2-class problem


• Number of class POS instances = 10

• Number of class NEG instances = 990

If the model is predicting everything to be NEG, then

accuracy = 990 / 1000 = 90%

This is misleading because the model doesn’t detect any POS class.
Detecting the rare class is usually more interesting (examples:
frauds, spams, cancer detection etc)

This requires to involve other performance measurement metrics


that do not suffer similar problems. We will discuss them further.

Precision

It is the degree to which repeated measurements under the same


conditions show the same result. It is often measured by the
standard deviation of a set of values.

Example: We have an item that weighs 1g, we measure it 5 times and


get the following set of weights: {1.015, 0.990, 1.013, 1.001, 0.986}.
The precision using standard deviation measured is 0.013. It means
that the most precisely we can say the weight of the item is 1 + 0.013
or 1–0.013.

Having said that, in Machine Learning precision is defined as:


Precision for a binary problem — Image by Author | Introduction to Data Mining — Pang-Ning Tan,
Michael Steinbach, Vipin Kumar

Precision determines the fraction of records that actually turns out


to be positive in the group the classifier has declared as a positive
class. The higher the precision is, the lower the number of False
Positives committed by the model.

To understand it with an example, let’s say we are trying to search


for documents that contain the term ‘machine learning’ in a corpus
of 100 documents. The number of relevant documents for ‘machine
learning’ are 20 out of the 100. The model gives us 12 documents
when queried for the term ‘machine learning’ and fetching 15
documents. The precision turns out to be

precision = 12 / 15 = 80%

Recall / True Positive Rate


It measures the fraction of positive examples correctly predicted by
the classifier. To understand it with an example, let’s say we are
trying to search for documents that contain the term ‘machine
learning’ in a corpus of 100 documents. The number of relevant
documents for ‘machine learning’ are 20 out of the 100. The model
gives us 12 documents when queried for the term ‘machine learning’
and fetching 15 documents. The recall turns out to be

recall = 12 / 20 = 60%

Recall for a binary


problem — Image by Author | Introduction to Data Mining — Pang-Ning Tan, Michael Steinbach,
Vipin Kumar

F-measure

The two metrics above precision and recall can be combined into
a single metric called F-measure. It is a harmonic mean of
precision and recall. The harmonic mean of two numbers x and y is
close to the smaller of the two numbers. Hence, a high value of F-
measure ensures both precision and recall are reasonably high.
3.1. Cross-validation: evaluating estimator performance
Learning the parameters of a prediction function and testing it on the same data is a
methodological mistake: a model that would just repeat the labels of the samples
that it has just seen would have a perfect score but would fail to predict anything
useful on yet-unseen data. This situation is called overfitting. To avoid it, it is
common practice when performing a (supervised) machine learning experiment to
hold out part of the available data as a test set X_test, y_test. Note that the word
“experiment” is not intended to denote academic use only, because even in
commercial settings machine learning usually starts out experimentally. Here is a
flowchart of typical cross validation workflow in model training. The best parameters
can be determined by grid search techniques.

In machine learning, we couldn’t fit the model on the training data and can’t
say that the model will work accurately for the real data. For this, we must
assure that our model got the correct patterns from the data, and it is not
getting up too much noise. For this purpose, we use the cross-validation
technique.
Cross validation is a technique used in machine learning to evaluate the
performance of a model on unseen data. It involves dividing the available
data into multiple folds or subsets, using one of these folds as a validation

Actual Positive Actual Negative


o

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative


set, and training the model on the remaining folds. This process is repeated
multiple times, each time using a different fold as the validation set. Finally,
the results from each validation step are averaged to produce a more robust
estimate of the model’s performance.
The main purpose of cross validation is to prevent overfitting, which occurs
when a model is trained too well on the training data and performs poorly on
new, unseen data. By evaluating the model on multiple validation sets, cross
validation provides a more realistic estimate of the model’s generalization
performance, i.e., its ability to perform well on new, unseen data.

Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the


performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like
as below table:

AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC
stands for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the
AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on
Y-axis and FPR(False Positive Rate) on X-axis.

ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds. This curve
plots two parameters:

o True Positive Rate


o False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

False Positive Rate (FPR) is defined as follows:

o What is multiclass classification?


o Classification means categorizing data and forming groups based
on the similarities. In a dataset, the independent variables or features
play a vital role in classifying our data. When we talk about
multiclass classification, we have more than two classes in our
dependent or target variable, as can be seen in Fig.1:

o
o The above picture is taken from the Iris dataset which depicts that
the target variable has three categories i.e., Virginica, setosa, and
Versicolor, which are three species of Iris plant. We might use this
dataset later, as an example of a conceptual understanding of
multiclass classification.

o Which classifiers do we use in multiclass


classification? When do we use them?
o We use many algorithms such as Naïve Bayes, Decision trees, SVM,
Random forest classifier, KNN, and logistic regression for
classification. But we might learn about only a few of them here
because our motive is to understand multiclass classification. So,
using a few algorithms we will try to cover almost all the relevant
concepts related to multiclass classification.

. What is Error Analysis?

Error analysis is the process to isolate, observe and diagnose erroneous ML


predictions thereby helping understand pockets of high and low performance
of the model. When it is said that “the model accuracy is 90%” it might not
be uniform across subgroups of data and there might be some input
conditions which the model fails more. So, it is the next step from aggregate
metrics to a more in-depth review of model errors for improvement.

An example might be that a dog detection image recognition model might be


doing better for dogs in an outdoor setting but not so good in low-lit indoor
settings. Of course, this might be due to skewed datasets and error analysis
helps identify if such cases impact model performance. The below illustration
provides a view of how moving from aggregate to group-wise errors provides
a better picture of model performance.

Source: Responsible ML with error analysis

2. Error Identification and Diagnosis

This helps know how errors are distributed across key hypotheses or key
features/classes/cohorts of the dataset. For example, in a loan approval
model used by a bank, it might be that the model is giving more errors on the
individuals who are younger and have a low monthly average balance with
the bank.
A. How to do this (manually) especially in case your data is image, voice, or
text where you might not have apparent features.

Let us take another example – a favorite one – A cat-classifier! Suppose we


train a model on 4000 images (Some having cats) in different settings and test
using 1000 images. We find that the model accuracy is 85% (meaning 850 of
the 1000 images were predicted correctly!) and 150 images were wrongly
tagged.

• Contact
Multi-Label Classification with Deep Learning
Multi-label classification involves predicting zero or more class labels.

Unlike normal classification tasks where class labels are mutually exclusive, multi-label
classification requires specialized machine learning algorithms that support predicting
multiple mutually non-exclusive classes or “labels.”

Deep learning neural networks are an example of an algorithm that natively supports
multi-label classification problems. Neural network models for multi-label classification
tasks can be easily defined and evaluated using the Keras deep learning library.

• Multi-label classification is a predictive modeling task that involves predicting zero


or more mutually non-exclusive class labels.
• Neural network models can be configured for multi-label classification tasks.
• How to evaluate a neural network for multi-label classification and make a
prediction for new data.
.

Multi-Label Classification
For multi-label classification, the data has more than 1 independent
variable (target class) and cardinality of the each class should be 2
(binary). Stackoverflow tag prediction dataset is an example of a
multi-label classification problem. In this type of classification
problem, there is more than 1 output prediction.
Most of the classification machine learning algorithms are not able
to handle multi-label classification. One needs to use a wrapper
around the machine learning algorithms to train multi-label
classification data. Scikit-learn comes up with 2 wrapper
implementations:

• MultiOutputClassifier: This strategy fits one binary


classifier per target. This wrapper can be used for
estimators that do not support multi-target classification
such as logistic regression.

• ChainClassifier: This strategy can be used when the


classification targets are dependent on each other. In this
strategy chain of binary estimators are trained with the
independent features along with the prediction of the last
estimator.

Classification is a predictive modeling problem that involves outputting a class label


given some input

It is different from regression tasks that involve predicting a numeric value.

Typically, a classification task involves predicting a single label. Alternately, it might


involve predicting the likelihood across two or more class labels. In these cases, the
classes are mutually exclusive, meaning the classification task assumes that the input
belongs to one class only.

Some classification tasks require predicting more than one class label. This means that
class labels or class membership are not mutually exclusive. These tasks are referred to
as multiple label classification, or multi-label classification for short.
In multi-label classification, zero or more labels are required as output for each input
sample, and the outputs are required simultaneously. The assumption is that the output
labels are a function of the inputs.
We can create a synthetic multi-label classification dataset using
the make_multilabel_classification() function in the scikit-learn library.
Our dataset will have 1,000 samples with 10 input features. The dataset will have three
class label outputs for each sample and each class will have one or two values (0 or 1,
e.g. present or not present).

The complete example of creating and summarizing the synthetic multi-label


classification dataset is listed below.

Running the example creates the dataset and summarizes the shape of the input and
output elements.

We can see that, as expected, there are 1,000 samples, each with 10 input features and
three output features.

The first 10 rows of inputs and outputs are summarized and we can see that all inputs
for this dataset are numeric and that output class labels have 0 or 1 values for each of
the three class labels.

1 (1000, 10) (1000, 3)

3 [ 3. 3. 6. 7. 8. 2. 11. 11. 1. 3.] [1 1 0]

4 [7. 6. 4. 4. 6. 8. 3. 4. 6. 4.] [0 0 0]

5 [ 5. 5. 13. 7. 6. 3. 6. 11. 4. 2.] [1 1 0]

6 [1. 1. 5. 5. 7. 3. 4. 6. 4. 4.] [1 1 1]

7 [ 4. 2. 3. 13. 7. 2. 4. 12. 1. 7.] [0 1 0]

8 [ 4. 3. 3. 2. 5. 2. 3. 7. 2. 10.] [0 0 0]

9 [ 3. 3. 3. 11. 6. 3. 4. 14. 1. 3.] [0 1 0]

10[ 2. 1. 7. 8. 4. 5. 10. 4. 6. 6.] [1 1 1]

11[ 5. 1. 9. 5. 3. 4. 11. 8. 1. 8.] [1 1 1]

12[ 2. 11. 7. 6. 2. 2. 9. 11. 9. 3.] [1 1 1]

Multi-Output Classification with


Multi-output classification is a type of machine learning that
predicts multiple outputs simultaneously. In multi-output
classification, the model will give two or more outputs after
making any prediction. In other types of classifications, the
model usually predicts only a single output.

An example of a multi-output classification model is a model that


predicts the type and color of fruit simultaneously. The type of fruit can
be, orange, mango and pineapple. The color can be, red, green, yellow,
and orange. The multi-output classification solves this problem and
gives two prediction results.

In this tutorial, we will build a multi-output text classification model


using the Netflix dataset. The model will classify the input text as
either TV Show or Movie. This will be the first output. The model will also
classify the rating as: TV-MA, TV-14, TV-PG, R, PG-13 and TV-Y. The rating will
be the second output. We will use Scikit-
Learn MultiOutputClassifier algorithm to build this model.

Multi-output regression is similar to multi-label classification, but


this is only for regression tasks. In this condition of problem
statements, the data has more than 1 continuous target label. Some
of the regression algorithms such as linear regression and K-NN
regressor handle multi-output regression, as they inherently
implement direct regressor.

For the regression algorithms that don’t handle multi-output


regression, the scikit-learn package offers wrappers that can be used
for such tasks:

• MultiOutputRegressor: This strategy fits one regressor


estimator per target class. This strategy can be used for
regressor estimators that don't handle multi-output
regression such as LinearSVC.
• RegressorChain: This strategy can be used when the
regression targets features are dependent on each other. In
this strategy chain of the regressor, estimators are trained
with the independent continuous features along with the
prediction of the last estimator.

ENSEMBLE LEARNING

Ensemble learning, in general, is a model that makes predictions


based on a number of different models. By combining individual
models, the ensemble model tends to be more flexible🤸‍♀️ (less bias)
and less data-sensitive🧘‍♀️ (less variance).

Two most popular ensemble methods are bagging and boosting.

• Bagging: Training a bunch of individual models in a


parallel way. Each model is trained by a random subset of
the data

• Boosting: Training a bunch of individual models in a


sequential way. Each individual model learns from
mistakes made by the previous model.
Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model.

As the name suggests, "Random Forest is a classifier that contains a number of


decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one decision
tree, the random forest takes the prediction from each tree and based on the majority
votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:
Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into subsets and
given to each decision tree. During the training phase, each decision tree produces a
prediction result, and when a new data point occurs, then based on the majority of
results, the Random Forest classifier predicts the final decision. Consider the below
image:
ML | Voting Classifier
A Voting Classifier is a machine learning model that trains on an ensemble of
numerous models and predicts an output (class) based on their highest
probability of chosen class as the output.
It simply aggregates the findings of each classifier passed into Voting Classifier
and predicts the output class based on the highest majority of voting. The idea is
instead of creating separate dedicated models and finding the accuracy for each
them, we create a single model which trains by these models and predicts output
based on their combined majority of voting for each output class.
Voting Classifier supports two types of votings.
1. Hard Voting: In hard voting, the predicted output class is a class with
the highest majority of votes i.e the class which had the highest
probability of being predicted by each of the classifiers. Suppose three
classifiers predicted the output class(A, A, B), so here the majority
predicted A as output. Hence A will be the final prediction.
2. Soft Voting: In soft voting, the output class is the prediction based on
the average of probability given to that class. Suppose given some input
to three models, the prediction probability for class A = (0.30, 0.47,
0.53) and B = (0.20, 0.32, 0.40). So the average for class A is 0.4333 and B
is 0.3067, the winner is clearly class A because it had the highest
probability averaged by each classifier.

3. Bagging & Pasting

4. Bagging means bootstrap+aggregating and it is a ensemble


method in which we first bootstrap our data and for each
bootstrap sample we train one model. After that, we aggregate
them with equal weights. When it’s not used replacement, the
method is called pasting.

Out-Of-Bag Sample
In our above example, we can observe that some animals are repeated while
making the sample and some animals did not even occur once in the sample.
Here, Sample1 does not have Rat and Cow whereas sample 3 had all the
animals equal to the main training set.

While making the samples, data points were chosen randomly and with
replacement, and the data points which fail to be a part of that particular
sample are known as OUT-OF-BAG points.

Out-of-Bag Score (OOB_Score)

Where does OOB_Score come into the picture?? OOB_Score is a very


powerful Validation Technique used especially for the Random Forest
algorithm for least Variance results.

Here, we have a training set with 5 rows and a classification target variable
of whether the animals are domestic/pet?
Out of multiple decision trees built in the random forest, a bootstrapped
sample for one particular decision tree, say DT_1 is shown below

Here, Rat and Cat data have been left out. And since, Rat and Cat are OOB
for DT_1, we would predict the values for Rat and Cat using DT_1. (Note:
Data of Rat and Cat hasn’t been seen by DT_1 while training the tree.)

Similarly, every data point is passed for prediction to trees where it would be
behaving as OOB and an aggregated prediction is recorded for each row.

The OOB_score is computed as the number of correctly predicted rows


from the out-of-bag sample.

And

OOB Error is the number of wrongly classifying the OOB Sample.


What Is The Random Subspace Method?

The random subspace method is a technique used in order to introduce


variation among the predictors in an ensemble model. This is done as
decreasing the correlation between the predictors increases the
performance of the ensemble model. The random subspace method is
also known as feature or attribute bagging. What it does is it creates
subsets of the training set that only contain certain features. The chosen
number of features are randomly sampled from the training set with
replacement. However, most implementations allow the user to specify
whether or not they would like features to be sampled with or without
replacement.

These subsets are then used in order to train the predictors of an


ensemble.
Figure 1. An illustration of a
dataset created by the random subspace method.

Random Patches Method.

When the random subspace method is used along with bagging or


pasting it is known as the random patches method.

The random subspaces/patches methods and their purpose is closely


related to that of bagging and pasting. If you are unfamiliar with why these
techniques are used, I have written a full article about the topic which is
available here.
Extremely Randomized Trees Classifier(Extra Trees
Classifier)
is a type of ensemble learning technique which aggregates the results of multiple
de-correlated decision trees collected in a “forest” to output it’s classification
result. In concept, it is very similar to a Random Forest Classifier and only differs
from it in the manner of construction of the decision trees in the forest.
Each Decision Tree in the Extra Trees Forest is constructed from the original
training sample. Then, at each test node, Each tree is provided with a random
sample of k features from the feature-set from which each decision tree must
select the best feature to split the data based on some mathematical criteria
(typically the Gini Index). This random sample of features leads to the creation of
multiple de-correlated decision trees.
To perform feature selection using the above forest structure, during the
construction of the forest, for each feature, the normalized total reduction in the
mathematical criteria used in the decision of feature of split (Gini Index if the Gini
Index is used in the construction of the forest) is computed. This value is called
the Gini Importance of the feature. To perform feature selection, each feature is
ordered in descending order according to the Gini Importance of each feature
and the user selects the top k features according to his/her choice.
What Is Feature Importance in Machine
Learning?
Feature (variable) importance indicates how much each feature
contributes to the model prediction. Basically, it determines the
degree of usefulness of a specific variable for a current model and
prediction. For example, if we want to predict the weight of a person
based on height, age, and name, it’s obvious that the variable height will
have the strongest influence, while the variable name is not even
relevant to the person’s weight.
Overall, we represent feature importance using a numeric value that we
call the score, where the higher the score value has, the more important
it is. There are many benefits of having a feature importance score. For
instance, it’s possible to determine the relationship between
independent variables (features) and dependent variables (targets). By
analyzing variable importance scores, we would be able to find out
irrelevant features and exclude them. Reducing the number of not
meaningful variables in the model may speed up the model or even
improve its performance.
Also, feature importance is commonly used as a tool for ML model
interpretability. From the scores, it’s possible to explain why the ML
model makes particular predictions and how we can manipulate
features to change its predictions.
There are many ways of calculating feature importance, but generally,
we can divide them into two groups:

• Model agnostic
• Model dependent

Boosting in Machine Learning | Boosting and AdaBoost


Boosting is an ensemble modeling technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model by
using weak models in series. Firstly, a model is built from the training data. Then
the second model is built which tries to correct the errors present in the first
model. This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum number of
models are added.
AdaBoost was the first really successful boosting algorithm developed for the
purpose of binary classification. AdaBoost is short for Adaptive Boosting and is a
very popular boosting technique that combines multiple “weak classifiers” into a
single “strong classifier”

Algorithm:

1. Initialise the dataset and assign equal weight to each of the data
point.
2. Provide this as input to the model and identify the wrongly classified
data points.
3. Increase the weight of the wrongly classified data points.
4. if (got required results)
Goto step 5
else
Goto step 2

5. End
ML – Gradient Boosting

Gradient Boosting is a popular boosting algorithm. In gradient boosting, each


predictor corrects its predecessor’s error. In contrast to Adaboost, the weights of
the training instances are not tweaked, instead, each predictor is trained using
the residual errors of predecessor as labels.
There is a technique called the Gradient Boosted Trees whose base learner is
CART (Classification and Regression Trees).
The below diagram explains how gradient boosted trees are trained for
regression problems.

Gradient Boosted Trees for Regression

The ensemble consists of N trees. Tree1 is trained using the feature matrix X and
the labels y. The predictions labelled y1(hat) are used to determine the training
set residual errors r1. Tree2 is then trained using the feature matrix X and the
residual errors r1 of Tree1 as labels. The predicted results r1(hat) are then used
to determine the residual r2. The process is repeated until all the N trees forming
the ensemble are trained.

Stacking in Machine Learning


Stacking is a way to ensemble multiple classifications or regression model. There
are many ways to ensemble models, the widely known models
are Bagging or Boosting. Bagging allows multiple similar models with high
variance are averaged to decrease variance. Boosting builds multiple incremental
models to decrease the bias, while keeping variance small.
Stacking (sometimes called Stacked Generalization) is a different paradigm. The
point of stacking is to explore a space of different models for the same problem.
The idea is that you can attack a learning problem with different types of models
which are capable to learn some part of the problem, but not the whole space of
the problem. So, you can build multiple different learners and you use them to
build an intermediate prediction, one prediction for each learned model. Then
you add a new model which learns from the intermediate predictions the same
target.
This final model is said to be stacked on the top of the others, hence the name.
Thus, you might improve your overall performance, and often you end up with a
model which is better than any individual intermediate model. Notice however,
that it does not give you any guarantee, as is often the case with any machine
learning technique.

UNIT 3

What is Artificial Neural Network?


The term "Artificial Neural Network" is derived from Biological neural networks that
develop the structure of a human brain. Similar to the human brain that has neurons
interconnected to one another, artificial neural networks also have neurons that are
interconnected to one another in various layers of the networks. These neurons are
known as nodes.

56.6M
1K
Java Try Catch

The given figure illustrates the typical diagram of Biological Neural Network.

The typical Artificial Neural Network looks something like the given figure.

Dendrites from Biological Neural Network represent inputs in Artificial Neural


Networks, cell nucleus represents Nodes, synapse represents Weights, and Axon
represents Output.

Relationship between Biological neural network and artificial neural network:


Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output

An Artificial Neural Network in the field of Artificial intelligence where it attempts


to mimic the network of neurons makes up a human brain so that computers will have
an option to understand things and make decisions in a human-like manner. The
artificial neural network is designed by programming computers to behave simply like
interconnected brain cells.

There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000. In the human brain,
data is stored in such a manner as to be distributed, and we can extract more than one
piece of this data when necessary from our memory parallelly. We can say that the
human brain is made up of incredibly amazing parallel processors.

We can understand the artificial neural network with an example, consider an example
of a digital logic gate that takes an input and gives an output. "OR" gate, which takes
two inputs. If one or both the inputs are "On," then we get "On" in output. If both the
inputs are "Off," then we get "Off" in output. Here the output depends upon input. Our
brain does not perform the same task. The outputs to inputs relationship keep
changing because of the neurons in our brain, which are "learning."

The architecture of an artificial neural network:


To understand the concept of the architecture of an artificial neural network, we have
to understand what a neural network consists of. In order to define a neural network
that consists of a large number of artificial neurons, which are termed units arranged
in a sequence of layers. Lets us look at various types of layers available in an artificial
neural network.

Artificial Neural Network primarily consists of three layers:


Input Layer:

As the name suggests, it accepts inputs in several different formats provided by the
programmer.

Hidden Layer:

The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.

Output Layer:

The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.

BUILDING INTELLIGENT MACHINES

Imagine a future where you are not stuck in traffic because machines
are driving the vehicles, not humans. A future where everyone has a
personal assistant to do mundane tasks. A future where industry is
more productive and per capita income is rising.
That future is already here, just not evenly distributed across the
world. Some parts of the world are experiencing it earlier than the
others. A big driving force behind this development is advancement
of AI technologies.

The impact of AI on economy and industry productivity is growing


exponentially. Look at the chart below — it shows that — AI is
doubling the baseline growth in most sectors. This happens rarely. It
means new markets, improved GDP, improved per capita income
and improved quality of life.

Ref. Accenture AI Report

Industrial revolution took people out of field jobs and moved them
to offices to manage processes and troubleshoot problems. Humans
switched to managing the job while the real job was done by robotic
assembly lines or machines.
Perceptron in Machine Learning
In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used
term for all folks. It is the primary step to learn Machine Learning and Deep Learning
technologies, which consists of a set of weights, input values or scores, and a
threshold. Perceptron is a building block of an Artificial Neural Network. Initially,
in the mid of 19th century, Mr. Frank Rosenblatt invented the Perceptron for
performing certain calculations to detect input data capabilities or business
intelligence. Perceptron is a linear Machine Learning algorithm used for supervised
learning for various binary classifiers. This algorithm enables neurons to learn elements
and processes them one by one during preparation. In this tutorial, "Perceptron in
Machine Learning," we will discuss in-depth knowledge of Perceptron and its basic
functions in brief. Let's start with the basic introduction of Perceptron.

What is the Perceptron model in Machine Learning?


Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks. Further, Perceptron is also understood as an Artificial Neuron
or neural network unit that helps to detect certain input data computations in
business intelligence.

Perceptron model is also treated as one of the best and simplest types of Artificial
Neural networks. However, it is a supervised learning algorithm of binary classifiers.
Hence, we can consider it as a single-layer neural network with four main parameters,
i.e., input values, weights and Bias, net sum, and an activation function.

Basic Components of Perceptron


Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which
contains three main components. These are as follows:

o Input Nodes or Input Layer:

This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.

o Wight and Bias:

Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.

o Activation Function:

These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.

Types of Activation functions:

o Sign function
o Step function, and
o Sigmoid function
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and Bias,
net sum, and an activation function. The perceptron model begins with the
multiplication of all input values and their weights, then adds these values together to
create the weighted sum. Then this weighted sum is applied to the activation function
'f' to obtain the desired output. This activation function is also known as the step
function and is represented by 'f'.

This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the weight
of input is indicative of the strength of a node. Similarly, an input's bias value gives the
ability to shift the activation function curve up or down.

Types of Perceptron Models


Based on the layers, Perceptron models are divided into two types. These are as
follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model

Single Layer Perceptron Model:


This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold
transfer function inside the model. The main objective of the single-layer perceptron
model is to analyze the linearly separable objects with binary outcomes.

"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:


Like a single-layer perceptron model, a multi-layer perceptron model also has the
same model structure but has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm,


which executes in two stages as follows:

o Forward Stage: Activation functions start from the input layer in the forward stage and
terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.

Feed Forward Process in Deep Neural


Network
Now, we know how with the combination of lines with different weight and biases can
result in non-linear models. How does a neural network know what weight and biased
values to have in each layer? It is no different from how we did it for the single based
perceptron model.

We are still making use of a gradient descent optimization algorithm which acts to
minimize the error of our model by iteratively moving in the direction with the steepest
descent, the direction which updates the parameters of our model while ensuring the
minimal error. It updates the weight of every model in every single layer. We will talk
more about optimization algorithms and backpropagation later.

It is important to recognize the subsequent training of our neural network. Recognition


is done by dividing our data samples through some decision boundary.
"The process of receiving an input to produce some kind of output to make some kind
of prediction is known as Feed Forward." Feed Forward neural network is the core of
many other important neural networks such as convolution neural network.

56.6M

1K

Java Try Catch

In the feed-forward neural network, there are not any feedback loops or connections
in the network. Here is simply an input layer, a hidden layer, and an output layer.

There can be multiple hidden layers which depend on what kind of data you are
dealing with. The number of hidden layers is known as the depth of the neural network.
The deep neural network can learn from more functions. Input layer first provides the
neural network with data and the output layer then make predictions on that data
which is based on a series of functions. ReLU Function is the most commonly used
activation function in the deep neural network.

To gain a solid understanding of the feed-forward process, let's see this


mathematically.

1) The first input is fed to the network, which is represented as matrix x1, x2, and one
where one is the bias value.
2) Each input is multiplied by weight with respect to the first and second model to
obtain their probability of being in the positive region in each model.

So, we will multiply our inputs by a matrix of weight using matrix multiplication.

3) After that, we will take the sigmoid of our scores and gives us the probability of the
point being in the positive region in both models.

4) We multiply the probability which we have obtained from the previous step with the
second set of weights. We always include a bias of one whenever taking a combination
of inputs.

And as we know to obtain the probability of the point being in the positive region of
this model, we take the sigmoid and thus producing our final output in a feed-forward
process.

Let takes the neural network which we had previously with the following linear models
and the hidden layer which combined to form the non-linear model in the output layer.
So, what we will do we use our non-linear model to produce an output that describes
the probability of the point being in the positive region. The point was represented by
2 and 2. Along with bias, we will represent the input as

The first linear model in the hidden layer recall and the equation defined it

Which means in the first layer to obtain the linear combination the inputs are
multiplied by -4, -1 and the bias value is multiplied by twelve.

The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three
to obtain the linear combination of that same point in our second model.
Now, to obtain the probability of the point is in the positive region relative to both
models we apply sigmoid to both points as

The second layer contains the weights which dictated the combination of the linear
models in the first layer to obtain the non-linear model in the second layer. The weights
are 1.5, 1, and a bias value of 0.5.

Now, we have to multiply our probabilities from the first layer with the second set of
weights as

Now, we will take the sigmoid of our final score

It is complete math behind the feed forward process where the inputs from the input
traverse the entire depth of the neural network. In this example, there is only one
hidden layer. Whether there is one hidden layer or twenty, the computational
processes are the same for all hidden layers.

Linear Neural Networks


The linear networks discussed in this section are similar to the perceptron, but their transfer function
is linear rather than hard-limiting. This allows their outputs to take on any value, whereas the
perceptron output is limited to either 0 or 1. Linear networks, like the perceptron, can only solve
linearly separable problems.
Here you design a linear network that, when presented with a set of given input vectors, produces
outputs of corresponding target vectors. For each input vector, you can calculate the network's output
vector. The difference between an output vector and its target vector is the error. You would like to
find values for the network weights and biases such that the sum of the squares of the errors is
minimized or below a specific value. This problem is manageable because linear systems have a
single error minimum. In most cases, you can calculate a linear network directly, such that its error is
a minimum for the given input vectors and target vectors. In other cases, numerical problems prohibit
direct calculation. Fortunately, you can always train the network to have a minimum error by using the
least mean squares (Widrow-Hoff) algorithm.
This section introduces linearlayer, a function that creates a linear layer, and newlind, a function
that designs a linear layer for a specific purpose.

Neuron Model
A linear neuron with R inputs is shown below.

This network has the same basic structure as the perceptron. The only difference is that the
linear neuron uses a linear transfer function purelin.

The linear transfer function calculates the neuron's output by simply returning the value
passed to it.
α=purelin(n)=purelin(Wp+b)=Wp+b
This neuron can be trained to learn an affine function of its inputs, or to find a linear
approximation to a nonlinear function. A linear network cannot, of course, be made to
perform a nonlinear computation.
Network Architecture
The linear network shown below has one layer of S neurons connected to R inputs through a
matrix of weights W.
Note that the figure on the right defines an S-length output vector a.
A single-layer linear network is shown. However, this network is just as capable as multilayer
linear networks. For every multilayer linear network, there is an equivalent single-layer linear
network.
What are the limitations of linear neural networks?
• Linear neurons are easy to compute with, but they run into
serious limitations. In fact, it can be shown that any feed-
forward neural network consisting of only linear neurons can be expressed as
a network with no hidden layers.

What is an activation function and why use them?


The activation function decides whether a neuron should be activated or not
by calculating the weighted sum and further adding bias to it. The purpose of
the activation function is to introduce non-linearity into the output of a
neuron.
Sigmoid Function

• It is a function which is plotted as ‘S’ shaped graph.


• Equation : A = 1/(1 + e-x)
• Nature : Non-linear. Notice that X values lies between -2 to 2, Y
values are very steep. This means, small changes in x would also
bring about large changes in the value of Y.
• Value Range : 0 to 1
• Uses : Usually used in output layer of a binary classification, where
result is either 0 or 1, as value for sigmoid function lies between 0
and 1 only so, result can be predicted easily to be 1 if value is
greater than 0.5 and 0 otherwise.
Tanh Function

• The activation that works almost always better than sigmoid


function is Tanh function also known as Tangent Hyperbolic
function. It’s actually mathematically shifted version of the sigmoid
function. Both are similar and can be derived from each other.
• Equation :-
• Value Range :- -1 to +1
• Nature :- non-linear
• Uses :- Usually used in hidden layers of a neural network as it’s
values lies between -1 to 1 hence the mean for the hidden layer
comes out be 0 or very close to it, hence helps in centering the
data by bringing mean close to 0. This makes learning for the next
layer much easier.
RELU Function

• It Stands for Rectified linear unit. It is the most widely used


activation function. Chiefly implemented in hidden layers of Neural
network.
• Equation :- A(x) = max(0,x). It gives an output x if x is positive and
0 otherwise.
• Value Range :- [0, inf)
• Nature :- non-linear, which means we can easily backpropagate
the errors and have multiple layers of neurons being activated by
the ReLU function.
• Uses :- ReLu is less computationally expensive than tanh and
sigmoid because it involves simpler mathematical operations. At a
time only a few neurons are activated making the network sparse
making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function

The softmax function is also a type of sigmoid function but is handy when we
are trying to handle multi- class classification problems.
• Nature :- non-linear
• Uses :- Usually used when trying to handle multiple classes. the
softmax function was commonly found in the output layer of image
classification problems.The softmax function would squeeze the
outputs for each class between 0 and 1 and would also divide by
the sum of the outputs.
• Output:- The softmax function is ideally used in the output layer of
the classifier where we are actually trying to attain the probabilities
to define the class of each input.
• The basic rule of thumb is if you really don’t know what activation
function to use, then simply use RELU as it is a general activation
function in hidden layers and is used in most cases these days.
• If your output is for binary classification then, sigmoid function is
very natural choice for output layer.
• If your output is for multi-class classification then, Softmax is very
useful to predict the probabilities of each classes.

Although it is a subclass of the sigmoid function, the softmax function comes in handy
when dealing with multiclass classification issues.

Used frequently when managing several classes. In the output nodes of image
classification issues, the softmax was typically present. The softmax function would
split by the sum of the outputs and squeeze all outputs for each category between 0
and 1.
The output unit of the classifier, where we are actually attempting to obtain the
probabilities to determine the class of each input, is where the softmax function is best
applied.

The usual rule of thumb is to utilise RELU, which is a usual perceptron in hidden layers
and is employed in the majority of cases these days, if we really are unsure of what
encoder to apply.

A very logical choice for the output layer is the sigmoid function if your input is for
binary classification. If our output involves multiple classes, Softmax can be quite
helpful in predicting the odds for each class.

The Fast-Food Problem

We’re beginning to understand how we can tackle some interesting problems


using deep learning, but one big question still remains: how exactly do we
figure out what the parameter vectors (the weights for all of the connections in
our neural network) should be? This is accomplished by a process commonly
referred to as training (see Figure 2-1). During training, we show the neural net
a large number of training examples and iteratively modify the weights to
minimize the errors we make on the training examples. After enough
examples, we expect that our neural network will be quite effective at solving
the task it’s been trained to do.

Figure 2-1. This is the neuron we want to train for the fast-food problem
Let’s continue with the example we mentioned in the previous chapter
involving a linear neuron. As a brief review, every single day, we purchase a
restaurant meal consisting of burgers, fries, and sodas. We buy some number
of servings for each item. We want to be able to predict how much a meal is
going to cost us, but the items don’t have price tags. The only thing the cashier
will tell us is the total price of the meal. We want to train a single linear
neuron to solve this problem. How do we do it?

One idea is to be intelligent about picking our training cases. For one meal we
could buy only a single serving of burgers, for another we could only buy a
single serving of fries, and then for our last meal we could buy a single serving
of soda. In general, intelligently selecting training examples is a very good
idea. There’s lots of research that shows that by engineering a clever training
set, you can make your neural network a lot more effective. The issue with
using this approach alone is that in real situations, it rarely ever gets you close
to the solution. For example, there’s no clear analog of this strategy in image
recognition. It’s just not a practical solution.

Instead, we try to motivate a solution that works well in general. Let’s say we
have a large set of training examples. Then we can calculate what the neural
network will output on the I tℎ training example using the simple formula in
the diagram. We want to train the neuron so that we pick the optimal weights
possible—the weights that minimize the errors we make on the training
examples. In this case, let’s say we want to minimize the square error over all
of the training examples that we encounter. More formally, if we know
that t(i) is the true answer for the I tℎ training example and y(i) is the value
computed by the neural network, we want to minimize the value of the error
function E:

E=1/2∑i(t(i)-y(i))2

The squared error is zero when our model makes a perfectly correct
prediction on every training example. Moreover, the closer E is to 0, the better
our model is. As a result, our goal will be to select our parameter
vector 0teta (the values for all the weights in our model) such that E is as close
to 0 as possible.

Now at this point you might be wondering why we need to bother ourselves
with error functions when we can treat this problem as a system of
equations. After all, we have a bunch of unknowns (weights) and we have a
set of equations (one for each training example). That would automatically
give us an error of 0, assuming that we have a consistent set of training
examples.
That’s a smart observation, but the insight unfortunately doesn’t generalize
well. Remember that although we’re using a linear neuron here, linear
neurons aren’t used very much in practice because they’re constrained in
what they can learn. And the moment we start using nonlinear neurons like
the sigmoidal, tanh, or ReLU neurons we talked about at the end of the
previous chapter, we can no longer set up a system of equations! Clearly we
need a better strategy to tackle the training process.

Gradient Descent in Machine Learning


Gradient Descent is known as one of the most commonly used optimization algorithms
to train machine learning models by means of minimizing errors between actual and
expected results. Further, gradient descent is also used to train Neural Networks.

In mathematical terminology, Optimization algorithm refers to the task of


minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in
machine learning, optimization is the task of minimizing the cost function
parameterized by the model's parameters. The main objective of gradient descent is
to minimize the convex function using iteration of parameter updates. Once these
machine learning models are optimized, these models can be used as powerful tools
for Artificial Intelligence and various computer science applications.

In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about
gradient descent, the role of cost functions specifically as a barometer within Machine
Learning, types of gradient descents, learning rates, etc.

What is Gradient Descent or Steepest Descent?


Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th
century. Gradient Descent is defined as one of the most commonly used iterative
optimization algorithms of machine learning to train the machine learning and
deep learning models. It helps in finding the local minimum of a function.
Play Video

The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:

o If we move towards a negative gradient or away from the gradient of the function at
the current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the function
at the current point, we will get the local maximum of that function.

This entire procedure is known as Gradient Ascent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize
the cost function using iteration. To achieve this goal, it performs two steps
iteratively:

o Calculates the first-order derivative of the function to compute the gradient or slope
of that function.
o Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.

What is Cost-function?
The cost function is defined as the measurement of difference or error between
actual values and expected values at the current position and present in the form
of a single real number. It helps to increase and improve machine learning efficiency
by providing feedback to this model so that it can minimize error and find the local or
global minimum. Further, it continuously iterates along the direction of the negative
gradient until the cost function approaches zero. At this steepest descent point, the
model will stop learning further. Although cost function and loss function are
considered synonymous, also there is a minor difference between them. The slight
difference between the loss function and the cost function is about the error within
the training of machine learning models, as loss function refers to the error of one
training example, while a cost function calculates the average error across an entire
training set.

The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to
reduce the cost function.

Hypothesis:

Parameters:

Cost function:

Goal:

How does Gradient Descent work?


Before starting the working principle of gradient descent, we should know some basic
concepts to find out the slope of a line from linear regression. The equation for simple
linear regression is given as:

1. Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-
axis.
The starting point(shown in above fig.) is used to evaluate the performance as it is
considered just as an arbitrary point. At this starting point, we will derive the first
derivative or slope and then use a tangent line to calculate the steepness of this slope.
Further, this slope will inform the updates to the parameters (weights and bias).

The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point,
it approaches the lowest point, which is called a point of convergence.

The main objective of gradient descent is to minimize the cost function or the error
between expected and actual. To minimize the cost function, two data points are
required:

o Direction & Learning Rate

These two factors are used to determine the partial derivative calculation of future
iteration and allow it to the point of convergence or local minimum or global minimum.
Let's discuss learning rate factors in brief;

Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is
typically a small value that is evaluated and updated based on the behavior of the cost
function. If the learning rate is high, it results in larger steps but also leads to risks of
overshooting the minimum. At the same time, a low learning rate shows the small step
sizes, which compromises overall efficiency but gives the advantage of more precision.

Types of Gradient Descent


Based on the error in various training models, the Gradient Descent learning algorithm
can be divided into Batch gradient descent, stochastic gradient descent, and mini-
batch gradient descent. Let's understand these different types of gradient descent:

1. Batch Gradient Descent:


Batch gradient descent (BGD) is used to find the error for each point in the training set
and update the model after evaluating all training examples. This procedure is known
as the training epoch. In simple words, it is a greedy approach where we have to sum
over all examples for each update.

Advantages of Batch gradient descent:

o It produces less noise in comparison to other gradient descent.


o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training samples.

2. Stochastic gradient descent


Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration. Or in other words, it processes a training epoch for each
example within a dataset and updates each training example's parameters one at a
time. As it requires only one training example at a time, hence it is easier to store in
allocated memory. However, it shows some computational efficiency losses in
comparison to batch gradient systems as it shows frequent updates that require more
detail and speed. Further, due to frequent updates, it is also treated as a noisy gradient.
However, sometimes it can be helpful in finding the global minimum and also escaping
the local minimum.

Advantages of Stochastic gradient descent:

In Stochastic gradient descent (SGD), learning happens on every example, and it


consists of a few advantages over other gradient descent.

o It is easier to allocate in desired memory.


o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.

3. MiniBatch Gradient Descent:


Mini Batch gradient descent is the combination of both batch gradient descent and
stochastic gradient descent. It divides the training datasets into small batch sizes then
performs the updates on those batches separately. Splitting training datasets into
smaller batches make a balance to maintain the computational efficiency of batch
gradient descent and speed of stochastic gradient descent. Hence, we can achieve a
special type of gradient descent with higher computational efficiency and less noisy
gradient descent.

Advantages of Mini Batch gradient descent:

o It is easier to fit in allocated memory.


o It is computationally efficient.
o It produces stable gradient descent convergence.

o The Delta Rule and Learning Rates

o Before we derive the exact algorithm for training our fast-food neuron,
a quick note on hyperparameters. In addition to the weight parameters
defined in our neural network, learning algorithms also require a
couple of additional parameters to carry out the training process. One of
these so-called hyperparameters is the learning rate.
o In practice, at each step of moving perpendicular to the contour, we
need to determine how far we want to walk before recalculating our
new direction. This distance needs to depend on the steepness of the
surface. Why? The closer we are to the minimum, the shorter we want
to step forward. We know we are close to the minimum, because the
surface is a lot flatter, so we can use the steepness as an indicator of
how close we are to the minimum. However, if our error surface is
rather mellow, training can potentially take a large amount of time. As a
result, we often multiply the gradient by a factor �, the learning
rate. Picking the learning rate is a hard problem (Figure 2-4). As we just
discussed, if we pick a learning rate that’s too small, we risk taking too
long during the training process. But if we pick a learning rate that’s too
big, we’ll mostly likely start diverging away from the minimum.
In Chapter 3, we’ll learn about various optimization techniques that
utilize adaptive learning rates to automate the process of selecting
learning rates.

o Figure 2-4. Convergence is difficult when our learning rate is too large
o Now, we are finally ready to derive the delta rule for training our linear
neuron. In order to calculate how to change each weight, we evaluate
the gradient, which is essentially the partial derivative of the error
function with respect to each of the weights. In other words, we want:
o

o Applying this method of changing the weights at every iteration, we are


finally able to utilize gradient descent.
Backpropagation Process in Deep Neural
Network
Backpropagation is one of the important concepts of a neural network. Our task is to
classify our data best. For this, we have to update the weights of parameter and bias,
but how can we do that in a deep neural network? In the linear regression model, we
use gradient descent to optimize the parameter. Similarly here we also use gradient
descent algorithm using Backpropagation.

For a single training example, Backpropagation algorithm calculates the gradient of


the error function. Backpropagation can be written as a function of the neural
network. Backpropagation algorithms are a set of methods used to efficiently train
artificial neural networks following a gradient descent approach which exploits the
chain rule.

The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not
able to perform the task for which it is being trained. Derivatives of the activation
function to be known at network design time is required to Backpropagation.

Now, how error function is used in Backpropagation and how Backpropagation works?
Let start with an example and do it mathematically to understand how exactly updates
the weight using Backpropagation.

Play Video
Input values
X1=0.05
X2=0.10

Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

Bias Values
b1=0.35 b2=0.60

Target Values
T1=0.01
T2=0.99

Now, we first calculate the values of H1 and H2 by a forward pass.


Test Sets, Validation Sets, and
Overfitting

One of the major issues with artificial neural networks is that the models are
quite complicated. For example, let’s consider a neural network that’s pulling
data from an image from the MNIST database (28 x 28 pixels), feeds into two
hidden layers with 30 neurons, and finally reaches a softmax layer of 10
neurons. The total number of parameters in the network is nearly 25,000. This
can be quite problematic, and to understand why, let’s consider a new toy
example, illustrated in Figure 2-8.

Figure 2-8. Two potential models that might describe our dataset: a linear model versus a degree 12
polynomial
We are given a bunch of data points on a flat plane, and our goal is to find a
curve that best describes this dataset (i.e., will allow us to predict the y-
coordinate of a new point given its x-coordinate). Using the data, we train two
different models: a linear model and a degree 12 polynomial. Which curve
should we trust? The line which gets almost no training example correctly? Or
the complicated curve that hits every single point in the dataset? At this point
we might trust the linear fit because it seems much less contrived. But just to
be sure, let’s add more data to our dataset! The result is shown in Figure 2-9.

Figure 2-9. Evaluating our model on new data indicates that the linear fit is a much better model
than the degree 12 polynomial
Now the verdict is clear: the linear model is not only better subjectively but
also quantitatively (measured using the squared error metric). But this leads to
a very interesting point about training and evaluating machine learning
models. By building a very complex model, it’s quite easy to perfectly fit our
training dataset because we give our model enough degrees of freedom to
contort itself to fit the observations in the training set. But when we evaluate
such a complex model on new data, it performs very poorly. In other words,
the model does not generalize well. This is a phenomenon called overfitting, and
it is one of the biggest challenges that a machine learning engineer must
combat. This becomes an even more significant issue in deep learning, where
our neural networks have large numbers of layers containing many neurons.
The number of connections in these models is astronomical, reaching the
millions. As a result, overfitting is commonplace.
Let’s see how this looks in the context of a neural network. Let’s say we have a
neural network with two inputs, a softmax output of size two, and a hidden
layer with 3, 6, or 20 neurons. We train these networks using mini-batch
gradient descent (batch size 10), and the results, visualized using
ConvNetJS, are shown in Figure 2-10.3

Figure 2-10. A visualization of neural networks with 3, 6, and 20 neurons (in that order) in their
hidden layer
It’s already quite apparent from these images that as the number of
connections in our network increases, so does our propensity to overfit to the
data. We can similarly see the phenomenon of overfitting as we make our
neural networks deep. These results are shown in Figure 2-11, where we use
networks that have one, two, or four hidden layers of three neurons each.

Figure 2-11. A visualization of neural networks with one, two, and four hidden layers (in that
order) of three neurons each
This leads to three major observations. First, the machine learning engineer is
always working with a direct trade-off between overfitting and model
complexity. If the model isn’t complex enough, it may not be powerful enough
to capture all of the useful information necessary to solve a problem.
However, if our model is very complex (especially if we have a limited amount
of data at our disposal), we run the risk of overfitting. Deep learning takes the
approach of solving very complex problems with complex models and taking
additional countermeasures to prevent overfitting. We’ll see a lot of these
measures in this chapter as well as in later chapters.
Second, it is very misleading to evaluate a model using the data we used to
train it. Using the example in Figure 2-8, this would falsely suggest that the
degree 12 polynomial model is preferable to a linear fit. As a result, we almost
never train our model on the entire dataset. Instead, as shown in Figure 2-12,
we split up our data into a training set and a test set.

Figure 2-12. We often split our data into nonoverlapping training and test sets in order to fairly
evaluate our model
This enables us to make a fair evaluation of our model by directly measuring
how well it generalizes on new data it has not yet seen. In the real world, large
datasets are hard to come by, so it might seem like a waste to not use all of the
data at our disposal during the training process. Consequently, it may be very
tempting to reuse training data for testing or cut corners while compiling test
data. Be forewarned: if the test set isn’t well constructed, we won’t be able
draw any meaningful conclusions about our model.

Third, it’s quite likely that while we’re training our data, there’s a point in time
where instead of learning useful features, we start overfitting to the training
set. To avoid that, we want to be able to stop the training process as soon as we
start overfitting, to prevent poor generalization. To do this, we divide our
training process into epochs. An epoch is a single iteration over the entire
training set. In other words, if we have a training set of size � and we are
doing mini-batch gradient descent with batch size �, then an epoch would be
equivalent to �� model updates. At the end of each epoch, we want to
measure how well our model is generalizing. To do this, we use an
additional validation set, which is shown in Figure 2-13. At the end of an epoch,
the validation set will tell us how the model does on data it has yet to see. If the
accuracy on the training set continues to increase while the accuracy on the
validation set stays the same (or decreases), it’s a good sign that it’s time to
stop training because we’re overfitting.
The validation set is also helpful as a proxy measure of accuracy during the
process of hyperparameter optimization. We’ve covered several hyperparameters
so far in our discussion (learning rate, minibatch size, etc.), but we have yet to
develop a framework for how to find the optimal values for these
hyperparameters. One potential way to find the optimal setting of
hyperparameters is by applying a grid search, where we pick a value for each
hyperparameter from a finite set of options
(e.g., �∈{0.001,0.01,0.1},batchsize∈{16,64,128},...), and train the model with
every possible permutation of hyperparameter choices. We elect the
combination of hyperparameters with the best performance on the validation
set, and report the accuracy of the model trained with best combination on the
test set.4

Figure 2-13. In deep learning we often include a validation set to prevent overfitting during the
training process
With this in mind, before we jump into describing the various ways to directly
combat overfitting, let’s outline the workflow we use when building and
training deep learning models. The workflow is described in detail in Figure 2-
14. It is a tad intricate, but it’s critical to understand the pipeline in order to
ensure that we’re properly training our neural networks.
First we define our problem rigorously. This involves determining our inputs,
the potential outputs, and the vectorized representations of both. For instance,
let’s say our goal was to train a deep learning model to identify cancer. Our
input would be an RBG image, which can be represented as a vector of pixel
values. Our output would be a probability distribution over three mutually
exclusive possibilities: 1) normal, 2) benign tumor (a cancer that has yet to
metastasize), or 3) malignant tumor (a cancer that has already metastasized to
other organs).

After we define our problem, we need to build a neural network architecture


to solve it. Our input layer would have to be of appropriate size to accept the
raw data from the image, and our output layer would have to be a softmax of
size 3. We will also have to define the internal architecture of the network
(number of hidden layers, the connectivities, etc.). We’ll further discuss the
architecture of image recognition models when we talk about convolutional
neural networks in Chapter 4. At this point, we also want to collect a
significant amount of data for training or modeling. This data would probably
be in the form of uniformly sized pathological images that have been labeled
by a medical expert. We shuffle and divide this data up into separate training,
validation, and test sets.
Figure 2-14. Detailed workflow for training and evaluating a deep learning model
Finally, we’re ready to begin gradient descent. We train the model on our
training set for an epoch at a time. At the end of each epoch, we ensure that
our error on the training set and validation set is decreasing. When one of
these stops to improve, we terminate and make sure we’re happy with the
model’s performance on the test data. If we’re unsatisfied, we need to rethink
our architecture or reconsider whether the data we collect has the
information required to make the prediction we’re interested in making. If our
training set error stopped improving, we probably need to do a better job of
capturing the important features in our data. If our validation set error
stopped improving, we probably need to take measures to prevent overfitting.

If, however, we are happy with the performance of our model on the training
data, then we can measure its performance on the test data, which the model
has never seen before this point. If it is unsatisfactory, we need more data in
our dataset because the test set seems to consist of example types that weren’t
well represented in the training set. Otherwise, we are finished!

Preventing Overfitting in Deep Neural


Networks

There are several techniques that have been proposed to prevent overfitting
during the training process. In this section, we’ll discuss these techniques in
detail.

One method of combatting overfitting is called regularization. Regularization


modifies the objective function that we minimize by adding additional terms
that penalize large weights. In other words, we change the objective function
so that it becomes �����+��(�), where �(�) grows larger as the
components of � grow larger, and � is the regularization strength (another
hyperparameter). The value we choose for � determines how much we want
to protect against overfitting. A �=0 implies that we do not take any measures
against the possibility of overfitting. If � is too large, then our model will
prioritize keeping � as small as possible over trying to find the parameter
values that perform well on our training set. As a result, choosing � is a very
important task and can require some trial and error.
The most common type of regularization in machine learning
is L2 regularization.5 It can be implemented by augmenting the error function
with the squared magnitude of all weights in the neural network. In other
words, for every weight � in the neural network, we add 12��2 to the error
function. The L2 regularization has the intuitive interpretation of heavily
penalizing peaky weight vectors and preferring diffuse weight vectors. This
has the appealing property of encouraging the network to use all of its inputs a
little rather than using only some of its inputs a lot. Of particular note is that
during the gradient descent update, using the L2 regularization ultimately
means that every weight is decayed linearly to zero. Because of this
phenomenon, L2 regularization is also commonly referred to as weight decay.
We can visualize the effects of L2 regularization using ConvNetJS. Similar to
Figures 2-10 and 2-11, we use a neural network with 2 inputs, a softmax output
of size 2, and a hidden layer with 20 neurons. We train the networks using
mini-batch gradient descent (batch size 10) and regularization strengths of
0.01, 0.1, and 1. The results can be seen in Figure 2-15.

Figure 2-15. A visualization of neural networks trained with regularization strengths of 0.01, 0.1,
and 1 (in that order)
Another common type of regularization is L1 regularization. Here, we add the
term �� for every weight � in the neural network. The L1 regularization has
the intriguing property that it leads the weight vectors to become sparse
during optimization (i.e., very close to exactly zero). In other words, neurons
with L1 regularization end up using only a small subset of their most
important inputs and become quite resistant to noise in the inputs. In
comparison, weight vectors from L2 regularization are usually diffuse, small
numbers. L1 regularization is very useful when you want to understand
exactly which features are contributing to a decision. If this level of feature
analysis isn’t necessary, we prefer to use L2 regularization because it
empirically performs better.
Max norm constraints have a similar goal of attempting to restrict � from
becoming too large, but they do this more directly.6 Max norm constraints
enforce an absolute upper bound on the magnitude of the incoming weight
vector for every neuron and use projected gradient descent to enforce the
constraint. In other words, any time a gradient descent step moves the
incoming weight vector such that �2>�, we project the vector back onto the
ball (centered at the origin) with radius �. Typical values of � are 3 and
4. One of the nice properties is that the parameter vector cannot grow out of
control (even if the learning rates are too high) because the updates to the
weights are always bounded.
Dropout is a very different kind of method for preventing overfitting that has
become one of the most favored methods of preventing overfitting in deep
neural networks.7 While training, dropout is implemented by only keeping a
neuron active with some probability � (a hyperparameter), or setting it to
zero otherwise. Intuitively, this forces the network to be accurate even in the
absence of certain information. It prevents the network from becoming too
dependent on any one (or any small combination) of neurons. Expressed more
mathematically, it prevents overfitting by providing a way of approximately
combining exponentially many different neural network architectures
efficiently. The process of dropout is expressed pictorially in Figure 2-16.

Figure 2-16. Dropout sets each neuron in the network as inactive with some random probability
during each minibatch of training
Dropout is pretty intuitive to understand, but there are some important
intricacies to consider. First, we’d like the outputs of neurons during test time
to be equivalent to their expected outputs at training time. We could fix this
naïvely by scaling the output at test time. For example, if �=0.5, neurons must
halve their outputs at test time in order to have the same (expected) output
they would have during training. This is easy to see because a neuron’s output
is set to 0 with probability 1-�. This means that if a neuron’s output prior to
dropout was �, then after dropout, the expected output would
be �[output]=��+(1-�)·0=��. This naïve implementation of dropout is
undesirable, however, because it requires scaling of neuron outputs at test
time. Test-time performance is extremely critical to model evaluation, so it’s
always preferable to use inverted dropout, where the scaling occurs at training
time instead of at test time. In inverted dropout, any neuron whose activation
hasn’t been silenced has its output divided by � before the value is
propagated to the next layer. With this fix, �[output]=�·��+(1-�)·0=�, and
we can avoid arbitrarily scaling neuronal output at test time.

Unit -4
What is TensorFlow?
TensorFlow is a popular framework of machine learning and deep learning. It is a
free and open-source library which is released on 9 November 2015 and developed
by Google Brain Team. It is entirely based on Python programming language and use
for numerical computation and data flow, which makes machine learning faster and
easier.

TensorFlow can train and run the deep neural networks for image recognition,
handwritten digit classification, recurrent neural network, word embedding, natural
language processing, video detection, and many more. TensorFlow is run on
multiple CPUs or GPUs and also mobile operating systems.

The word TensorFlow is made by two words, i.e., Tensor and Flow

1. Tensor is a multidimensional array


2. Flow is used to define the flow of data in operation.

TensorFlow is used to define the flow of data in operation on a multidimensional array


or Tensor.

Play Video
History of TensorFlow
Many years ago, deep learning started to exceed all other machine learning algorithms
when giving extensive data. Google has seen it could use these deep neural networks
to upgrade its services:

o Google search engine


o Gmail
o Photo

They build a framework called TensorFlow to permit researchers and developers to


work together in an AI model. Once it approved and scaled, it allows lots of people to
use it.

It was first released in 2015, while the first stable version was coming in 2017. It is an
open- source platform under Apache Open Source License. We can use it, modify it,
and reorganize the revised version for free without paying anything to Google.

Components of TensorFlow
Tensor
The name TensorFlow is derived from its core framework, "Tensor." A tensor is a vector
or a matrix of n-dimensional that represents all type of data. All values in a tensor hold
similar data type with a known shape. The shape of the data is the dimension of the
matrix or an array.
A tensor can be generated from the input data or the result of a computation. In
TensorFlow, all operations are conducted inside a graph. The group is a set of
calculation that takes place successively. Each transaction is called an op node are
connected.

Graphs
TensorFlow makes use of a graph framework. The chart gathers and describes all the
computations done during the training.

Advantages
o It was fixed to run on multiple CPUs or GPUs and mobile operating systems.
o The portability of the graph allows to conserve the computations for current or later
use. The graph can be saved because it can be executed in the future.
o All the computation in the graph is done by connecting tensors together.

Consider the following expression a= (b+c)*(c+2)

We can break the functions into components given below:

d=b+c
e=c+2
a=d*e

Now, we can represent these operations graphically below:


Session
A session can execute the operation from the graph. To feed the graph with the value
of a tensor, we need to open a session. Inside a session, we must run an operator to
create an output.

Why is TensorFlow popular?


TensorFlow is the better library for all because it is accessible to everyone. TensorFlow
library integrates different API to create a scale deep learning architecture like CNN
(Convolutional Neural Network) or RNN (Recurrent Neural Network).

TensorFlow is based on graph computation; it can allow the developer to create the
construction of the neural network with Tensorboard. This tool helps debug our
program. It runs on CPU (Central Processing Unit) and GPU (Graphical Processing Unit).

TensorFlow attracts the most considerable popularity on GitHub compare to the other deep
learning framework.

Use Cases/Applications of TensorFlow


TensorFlow provides amazing functionalities and services when compared to other
popular deep learning frameworks. TensorFlow is used to create a large-scale neural
network with many layers.
It is mainly used for deep learning or machine learning problems such
as Classification, Perception, Understanding, Discovering Prediction,
and Creation.

1. Voice/Sound Recognition
Voice and sound recognition applications are the most-known use cases of deep-
learning. If the neural networks have proper input data feed, neural networks are
capable of understanding audio signals.

For example:

Voice recognition is used in the Internet of Things, automotive, security, and UX/UI.

Sentiment Analysis is mostly used in customer relationship management (CRM).

Flaw Detection (engine noise) is mostly used in automotive and Aviation.

Voice search is mostly used in customer relationship management (CRM)

2. Image Recognition
Image recognition is the first application that made deep learning and machine
learning popular. Telecom, Social Media, and handset manufacturers mostly use image
recognition. It is also used for face recognition, image search, motion detection,
machine vision, and photo clustering.

For example, image recognition is used to recognize and identify people and objects
in from of images. Image recognition is used to understand the context and content
of any image.

For object recognition, TensorFlow helps to classify and identify arbitrary objects within
larger images.

This is also used in engineering application to identify shape for modeling purpose
(3d reconstruction from 2d image) and by Facebook for photo tagging.

For example, deep learning uses TensorFlow for analyzing thousands of photos of
cats. So a deep learning algorithm can learn to identify a cat because this algorithm is
used to find general features of objects, animals, or people.
3. Time Series
Deep learning is using Time Series algorithms for examining the time series data to
extract meaningful statistics. For example, it has used the time series to predict the
stock market.

A recommendation is the most common use case for Time


Series. Amazon, Google, Facebook, and Netflix are using deep learning for the
suggestion. So, the deep learning algorithm is used to analyze customer activity and
compare it to millions of other users to determine what the customer may like to
purchase or watch.

For example, it can be used to recommend us TV shows or movies that people like
based on TV shows or movies we already watched.

4. Video Detection
The deep learning algorithm is used for video detection. It is used for motion detection,
real-time threat detection in gaming, security, airports, and UI/UX field.

For example, NASA is developing a deep learning network for object clustering of
asteroids and orbit classification. So, it can classify and predict NEOs (Near Earth
Objects).

5. Text-Based Applications
Text-based application is also a popular deep learning algorithm. Sentimental analysis,
social media, threat detection, and fraud detection, are the example of Text-based
applications.

For example, Google Translate supports over 100 languages.

Some companies who are currently using TensorFlow are Google, AirBnb, eBay, Intel,
DropBox, Deep Mind, Airbus, CEVA, Snapchat, SAP, Uber, Twitter, Coca-Cola, and IBM.

Features of TensorFlow
TensorFlow has an interactive multiplatform programming interface which is scalable
and reliable compared to other deep learning libraries which are available.

These features of TensorFlow will tell us about the popularity of TensorFlow.


1. Responsive Construct
We can visualize each part of the graph, which is not an option while
using Numpy or SciKit. To develop a deep learning application, firstly, there are two
or three components that are required to create a deep learning application and need
a programming language.

2. Flexible
It is one of the essential TensorFlow Features according to its operability. It has
modularity and parts of it which we want to make standalone.

3. Easily Trainable
It is easily trainable on CPU and for GPU in distributed computing.

4. Parallel Neural Network Training


TensorFlow offers to the pipeline in the sense that we can train multiple neural
networks and various GPUs, which makes the models very efficient on large-scale
systems.
5. Large Community
Google has developed it, and there already is a large team of software engineers who
work on stability improvements continuously.

6. Open Source
The best thing about the machine learning library is that it is open source so anyone
can use it as much as they have internet connectivity. So, people can manipulate the
library and come up with a fantastic variety of useful products. And it has become
another DIY community which has a massive forum for people getting started with it
and those who find it hard to use it.

7. Feature Columns
TensorFlow has feature columns which could be thought of as intermediates between
raw data and estimators; accordingly, bridging input data with our model.

The feature below describes how the feature column is implemented.


8. Availability of Statistical Distributions
This library provides distributions functions including Bernoulli, Beta, Chi2, Uniform,
Gamma, which are essential, especially where considering probabilistic approaches
such as Bayesian models.

9. Layered Components
TensorFlow produces layered operations of weight and biases from the function such
as tf.contrib.layers and also provides batch normalization, convolution layer, and
dropout layer. So tf.contrib.layers.optimizers have optimizers such
as Adagrad, SGD, Momentum which are often used to solve optimization problems
for numerical analysis.

10. Visualizer (With TensorBoard)


We can inspect a different representation of a model and make the changed necessary
while debugging it with the help of TensorBoard.

11.Event Logger (With TensorBoard)


It is just like UNIX, where we use tail - f to monitor the output of tasks at the cmd. It
checks, logging events and summaries from the graph and production with the
TensorBoard.
How Does TensorFlow Compare to
Alternatives?

In addition to TensorFlow, there are a number of libraries that have popped


up over the years for building deep neural networks. These include Theano,
Torch, Caffe, Neon, and Keras.2 Based on two simple criteria (expressiveness
and presence of an active developer community), we ultimately narrowed the
field of options to TensorFlow, Theano (built by the LISA Lab out of the
University of Montreal), and Torch (largely maintained by Facebook AI
Research).
All three of these options boast a hefty developer community, enable users to
manipulate tensors with few restrictions, and feature automatic
differentiation (which enables users to train deep models without having to
crank out the backpropagation algorithms for arbitrary architectures, as we
had to do in the previous chapter). One of the drawbacks of Torch, however, is
that the framework is written in Lua. Lua is a scripting language much like
Python, but is less commonly used outside the deep learning community. We
wanted to avoid forcing newcomers to learn a whole new language to build
deep learning models, so we further narrowed our options to TensorFlow and
Theano.

Between these two options, the decision was difficult (and in fact, an early
version of this chapter was first written using Theano), but we chose
TensorFlow in the end for several subtle reasons. First, Theano has an
additional “graph compilation” step that took significant amounts of time
while setting up certain kinds of deep learning architectures. While small in
comparison to train time, this compilation phase proved frustrating while
writing and debugging new code. Second, TensorFlow has a much cleaner
interface as compared to Theano. Many classes of models can be expressed in
significantly fewer lines without sacrificing the expressiveness of the
framework. Finally, TensorFlow was built with production use in mind,
whereas Theano was designed by researchers almost purely for research
purposes. As a result, TensorFlow has many features out of the box and in the
works that make it a better choice for real systems (the ability to run in mobile
environments, easily build models that span multiple GPUs on a single
machine, and train large-scale networks in a distributed fashion). Although
familiarity with Theano and Torch can be extremely helpful while navigating
open source examples, overviews of these frameworks are beyond the scope
of this book.
Installing TensorFlow
Installing TensorFlow in your local development environment is straightforward if you aren’t
planning on modifying the TensorFlow source code. We use a Python package installation
manager called Pip. If you don’t already have Pip installed on your computer, use the
following commands in your terminal:

# Ubuntu/Linux 64-bit
$ sudo apt-get install python-pip python-dev

# Mac OS X
$ sudo easy_install pip
Once we have Pip (version 8.1 or later) installed on our computers, we can use the following
commands to install TensorFlow. Note the difference in Pip package naming if we would like
to install a GPU-enabled version of TensorFlow (which we strongly recommend):

$ pip install --upgrade tensorflow # for Python 2.7


$ pip3 install --upgrade tensorflow # for Python 3.n
$ pip install --upgrade tensorflow-gpu # for Python 2.7
# and GPU
$ pip3 install --upgrade tensorflow-gpu # for Python 3.n
# and GPU
If you installed the GPU-enabled version of TensorFlow, you’ll also have to take a couple of
additional steps. Specifically, you’ll have to download the CUDA Toolkit 8.03 and the latest
CUDNN Toolkit.4 Install the CUDA Toolkit 7.0 into /usr/local/cuda. Then uncompress and
copy the CUDNN files into the toolkit directory. Assuming the toolkit is installed
in/usr/local/cuda, you can follow these instructions to accomplish this:

$ tar xvzf cudnn-version-os.tgz


$ sudo cp cudnn-version-os/cudnn.h /usr/local/cuda/include
$ sudo cp cudnn-version-os/libcudnn* /usr/local/cuda/lib64
You will also need to set the LD_LIBRARY_PATH and CUDA_HOME environment variables to
give TensorFlow access to your CUDA installation. Consider adding the commands below to
your ~/.bash_profile. These assume your CUDA installation is in /usr/local/cuda:
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export CUDA_HOME=/usr/local/cuda
Note that to see these changes appropriately reflected in your current terminal session,
you’ll have to run:

$ source ~/.bash_profile
You should now be able to run TensorFlow from your Python shell of choice. In this tutorial,
we choose to use IPython. Using Pip, installing IPython only requires the following
command:

$ pip install ipython


Then we can test that our installation of TensorFlow functions as expected:

$ ipython

...

In [1]: import tensorflow as tf

In [2]: deep_learning = tf.constant('Deep Learning')

In [3]: session = tf.Session()

In [4]: session.run(deep_learning)
Out[4]: 'Deep Learning'

In [5]: a = tf.constant(2)

In [6]: a = tf.constant(2)
In [7]: multiply = tf.mul(a, b)

In [7]: session.run(multiply)
Out[7]: 6
Additional, up-to-date instructions and details about installation can be found on the
TensorFlow website.5

Creating and Manipulating TensorFlow Variables


When we build a deep learning model in TensorFlow, we use variables to represent the
parameters of the model. TensorFlow variables are in-memory buffers that contain tensors;
but unlike normal tensors that are only instantiated when a graph is run and that are
immediately wiped clean afterward, variables survive across multiple executions of a graph.
As a result, TensorFlow variables have the following three properties:

Variables must be explicitly initialized before a graph is used for the first time.
We can use gradient methods to modify variables after each iteration as we search for a
model’s optimal parameter settings.
We can save the values stored in variables to disk and restore them for later use.
These three properties are what make TensorFlow especially useful for building machine
learning models.

Creating a variable is simple, and TensorFlow provides mechanics that allow us to initialize
variables in several ways. Let’s start off by initializing a variable that describes the weights
connecting neurons between two layers of a feed-forward neural network:

weights = tf.Variable(tf.random_normal([300, 200], stddev=0.5),


name="weights")
Here we pass two arguments to tf.Variable.6 The first, tf.random_normal,7 is an operation
that produces a tensor initialized using a normal distribution with standard deviation 0.5.
We’ve specified that this tensor is of size 300 x 200, implying that the weights connect a
layer with 300 neurons to a layer with 200 neurons. We’ve also passed a name to our call to
tf.Variable. The name is a unique identifier that allows us to refer to the appropriate node
in the computation graph. In this case, weights is meant to be trainable; or in other words,
we will automatically compute and apply gradients to weights. If weights is not meant to be
trainable, we may pass an optional flag when we call tf.Variable:

weights = tf.Variable(tf.random_normal([300, 200], stddev=0.5),


name="weights", trainable=False)
In addition to using tf.random_normal, there are several other methods to initialize a
TensorFlow variable:

# Common tensors from the TensorFlow API docs

tf.zeros(shape, dtype=tf.float32, name=None)


tf.ones(shape, dtype=tf.float32, name=None)
tf.random_normal(shape, mean=0.0, stddev=1.0,
dtype=tf.float32, seed=None,
name=None)
tf.truncated_normal(shape, mean=0.0, stddev=1.0,
dtype=tf.float32, seed=None,
name=None)
tf.random_uniform(shape, minval=0, maxval=None,
dtype=tf.float32, seed=None,
name=None)
When we call tf.Variable, three operations are added to the computation graph:

The operation producing the tensor we use to initialize our variable


The tf.assign operation, which is responsible for filling the variable with the initializing
tensor prior to the variable’s use
The variable operation, which holds the current value of the variable
This can be visualized as shown in Figure 3-1.
Figure 3-1. Three operations are added to the graph when instantiating a TensorFlow
variable. In this example, we instantiate the variable weights using a random normal
initializer.
As we mentioned previously in the three operations, before we use any TensorFlow
variable, the tf.assign8 operation must be run so that the variable is appropriately initialized
with the desired value. We can do this by running tf.initialize_all_variables(),9 which will
trigger all of the tf.assign operations in our graph. We can also selectively initialize only
certain variables in our computational graph using the tf.initialize_variables(var1, var2,
...).10 We’ll describe this in more detail when we discuss sessions in TensorFlow.

TensorFlow Operations
We’ve already talked a little bit about operations in the context of variable
initialization, but these only make up a small subset of the universe of
operations available in TensorFlow. On a high level, TensorFlow operations
represent abstract transformations that are applied to tensors in the
computation graph. Operations may have attributes that may be supplied a
priori or are inferred at runtime. For example, an attribute may serve to
describe the expected types of input (adding tensors of type float32 versus
int32). Just as variables are named, operations may also be supplied with an
optional name attribute for easy reference into the computation graph.

An operation consists of one or more kernels, which represent device-specific


implementations. For example, an operation may have separate CPU and GPU
kernels because it can be more efficiently expressed on a GPU. This is the case
for many TensorFlow operations on matrices.

To provide an overview of the types of operations available, we include


Table 3-1 from the original TensorFlow white paper detailing the various
categories of operations in TensorFlow.11

Table 3-1. A summary table of TensorFlow operations


Category Examples
Element-wise mathematical operations Add, Sub, Mul, Div, Exp, Log,
Greater, Less, Equal, ...
Array operations Concat, Slice, Split, Constant, Rank, Shape, Shuffle, ...
Matrix operations MatMul, MatrixInverse, MatrixDeterminant, ...
Stateful operations Variable, Assign, AssignAdd, ...
Neural network building blocks SoftMax, Sigmoid, ReLU, Convolution2D,
MaxPool, ...
Checkpointing operations Save, Restore
Queue and synchronization operations Enqueue, Dequeue,
MutexAcquire, MutexRelease, ...
Control flow operations Merge, Switch, Enter, Leave, NextIteration

Placeholder Tensors
A placeholder is a variable in Tensorflow to which data will be assigned
sometime later on. It enables us to create processes or operations without the
requirement for data. Data is fed into the placeholder as the session starts, and
the session is run. We can feed in data into tensorflow graphs using
placeholders.
Syntax: tf.compat.v1.placeholder(dtype, shape=None, name=None)
Parameters:
• dtype: the datatype of the elements in the tensor that will be fed.
• shape : by default None. The tensor’s shape that will be fed , it is an
optional parameter. One can feed a tensor of any shape if the shape isn’t
specified.
• name: by default None. The operation’s name , optional parameter.
Returns:
A Tensor that can be used to feed
Explanation:
• Eager mode is disabled in case there are any errors.
• A placeholder is created using tf.placeholder() method which has a
dtype ‘tf.float32’, None says we didn’t specify any size.
• Operation is created before feeding in data.
• The operation adds 10 to the tensor.
• A session is created and started using tf.Session().
• Session.run takes the operation we created and data to be fed as
parameters and it returns the result.

Sessions in TensorFlow
A TensorFlow program interacts with a computation graph using a session.13
The TensorFlow session is responsible for building the initial graph, and can be
used to initialize all variables appropriately and to run the computational
graph. To explore each of these pieces, let’s consider the following simple
Python script:

import tensorflow as tf
from read_data import get_minibatch()

x = tf.placeholder(tf.float32, name="x", shape=[None, 784])


W = tf.Variable(tf.random_uniform([784, 10], -1, 1), name="W")
b = tf.Variable(tf.zeros([10]), name="biases")
output = tf.matmul(x, W) + b

init_op = tf.initialize_all_variables()

sess = tf.Session()
sess.run(init_op)
feed_dict = {"x" : get_minibatch()}
sess.run(output, feed_dict=feed_dict)
The first four lines after the import statement describe the computational
graph that is built by the session when it is finally instantiated. The graph (sans
variable initialization operations) is depicted in Figure 3-2. We then initialize
the variables as required by using the session variable to run the initialization
operation in sess.run(init_op). Finally, we can run the subgraph by calling
sess.run again, but this time we pass in the tensors (or list of tensors) we want
to compute along with a feed_dict that fills the placeholders with the
necessary input data.
Finally, the sess.run interface can also be used to train networks. We will
explore this in further detail when we use TensorFlow to train our first
machine learning model on MNIST. But how exactly does a single line of code
(sess.run) accomplish such a wide variety of functions? The answer lies in the
powerful expressivity of the underlying computational graph. All of these
functionalities are represented as TensorFlow operations that can be passed as
arguments to sess.run. All sess.run needs to do is traverse down the
computational graph to identify all of the dependencies that compose the
relevant subgraph, ensure that all of the placeholder variables that belong to
the identified subgraph are filled using the feed_dict, and then traverse back
up the subgraph (executing all of the intermediate operations) to evaluate the
original arguments.

Now that we have a comprehensive understanding of sessions and how to run


them, we’ll explore two more major concepts in building and maintaining
computational graphs.

Navigating Variable Scopes and Sharing Variables


Although we won’t run into this problem just yet, building complex models
often requires reusing and sharing large sets of variables that we’ll want to
instantiate together in one place. Unfortunately, trying to enforce modularity
and readability can result in unintended results if we aren’t careful. Let’s
consider the following example:

def my_network(input):
W_1 = tf.Variable(tf.random_uniform([784, 100], -1, 1),
name="W_1")
b_1 = tf.Variable(tf.zeros([100]), name="biases_1")
output_1 = tf.matmul(input, W_1) + b_1

W_2 = tf.Variable(tf.random_uniform([100, 50], -1, 1),


name="W_2")
b_2 = tf.Variable(tf.zeros([50]), name="biases_2")
output_2 = tf.matmul(output_1, W_2) + b_2

W_3 = tf.Variable(tf.random_uniform([50, 10], -1, 1),


name="W_3")
b_3 = tf.Variable(tf.zeros([10]), name="biases_3")
output_3 = tf.matmul(output_2, W_3) + b_3

# printing names
print "Printing names of weight parameters"
print W_1.name, W_2.name, W_3.name
print "Printing names of bias parameters"
print b_1.name, b_2.name, b_3.name

return output_3
This network setup consists of six variables describing three layers. As a result,
if we wanted to use this network multiple times, we’d prefer to encapsulate it
into a compact function like my_network, which we can call multiple
times. However, when we try to use this network on two different inputs, we
get something unexpected:

In [1]: i_1 = tf.placeholder(tf.float32, [1000, 784],


name="i_1")

In [2]: my_network(i_1)
Printing names of weight parameters
W_1:0 W_2:0 W_3:0
Printing names of bias parameters
biases_1:0 biases_2:0 biases_3:0
Out[2]: <tensorflow.python.framework.ops.Tensor ...>

In [1]: i_2 = tf.placeholder(tf.float32, [1000, 784],


name="i_2")

In [2]: my_network(i_2)
Printing names of weight parameters
W_1_1:0 W_2_1:0 W_3_1:0
Printing names of bias parameters
biases_1_1:0 biases_2_1:0 biases_3_1:0
Out[2]: <tensorflow.python.framework.ops.Tensor ...>
If we observe closely, our second call to my_network doesn’t use the same
variables as the first call (in fact, the names are different!). Instead, we’ve
created a second set of variables! In many cases, we don’t want to create a
copy, but rather reuse the model and its variables. It turns out, that in this
case, we shouldn’t be using tf.Variable. Instead, we should be using a more
advanced naming scheme that takes advantage of TensorFlow’s variable
scoping.

TensorFlow’s variable scoping mechanisms are largely controlled by two


functions:

tf.get_variable(<name>, <shape>, <initializer>)


Checks if a variable with this name exists, retrieves the variable if it does, or
creates it using the shape and initializer if it doesn’t.14
tf.variable_scope(<scope_name>)
Manages the namespace and determines the scope in which tf.get_variable
operates.15

Let’s try to rewrite my_network in a cleaner fashion using TensorFlow variable


scoping. The new names of our variables are namespaced as "layer1/W",
"layer2/b", "layer2/W", and so forth:

def layer(input, weight_shape, bias_shape):


weight_init = tf.random_uniform_initializer(minval=-1,
maxval=1)
bias_init = tf.constant_initializer(value=0)
W = tf.get_variable("W", weight_shape,
initializer=weight_init)
b = tf.get_variable("b", bias_shape,
initializer=bias_init)
return tf.matmul(input, W) + b

def my_network(input):
with tf.variable_scope("layer_1"):
output_1 = layer(input, [784, 100], [100])

with tf.variable_scope("layer_2"):
output_2 = layer(output_1, [100, 50], [50])

with tf.variable_scope("layer_3"):
output_3 = layer(output_2, [50, 10], [10])
return output_3

Now let’s try to call my_network twice, just like we did in the preceding code
block:

In [1]: i_1 = tf.placeholder(tf.float32, [1000, 784],


name="i_1")

In [2]: my_network(i_1)
Out[2]: <tensorflow.python.framework.ops.Tensor ...>

In [1]: i_2 = tf.placeholder(tf.float32, [1000, 784],


name="i_2")

In [2]: my_network(i_2)
ValueError: Over-sharing: Variable layer_1/W already exists...
Unlike tf.Variable, the tf.get_variable command checks that a variable of the
given name hasn’t already been instantiated. By default, sharing is not allowed
(just to be safe!), but if we want to enable sharing within a variable scope, we
can say so explicitly:

with tf.variable_scope("shared_variables") as scope:


i_1 = tf.placeholder(tf.float32, [1000, 784], name="i_1")
my_network(i_1)
scope.reuse_variables()
i_2 = tf.placeholder(tf.float32, [1000, 784], name="i_2")
my_network(i_2)
This allows us to retain modularity while still allowing variable sharing. And as
a nice byproduct, our naming scheme is cleaner as well.

Managing Models over the CPU and GPU


TensorFlow allows us to utilize multiple computing devices, if we so desire, to
build and train our models. Supported devices are represented by string IDs
and normally consist of the following:

"/cpu:0"
The CPU of our machine.

"/gpu:0"
The first GPU of our machine, if it has one.

"/gpu:1"
The second GPU of our machine, if it has one.

When a TensorFlow operation has both CPU and GPU kernels, and GPU use is
enabled, TensorFlow will automatically opt to use the GPU implementation. To
inspect which devices are used by the computational graph, we can initialize
our TensorFlow session with the log_device_placement set to True:

sess = tf.Session(config=tf.ConfigProto(
log_device_placement=True))
If we desire to use a specific device, we may do so by using with tf.device16 to
select the appropriate device. If the chosen device is not available, however,
an error will be thrown. If we would like TensorFlow to find another available
device if the chosen device does not exist, we can pass
the allow_soft_placement flag to the session variable as follows:17

Specifying the Logistic Regression Model in


TensorFlow
Now that we’ve developed all of the basic concepts of TensorFlow, let’s build a
simple model to tackle the MNIST dataset. As you may recall, our goal is to
identify handwritten digits from 28 x 28 black-and-white images. The first
network that we’ll build implements a simple machine learning algorithm
known as logistic regression.18

On a high level, logistic regression is a method by which we can calculate the


probability that an input belongs to one of the target classes. In our case, we’ll
compute the probability that a given input image is a 0, 1, ..., or 9. Our model
uses a matrix
We’ll build the the logistic regression model in four phases:

inference: produces a probability distribution over the output classes given a


minibatch
loss: computes the value of the error function (in this case, the cross-entropy
loss)
training: responsible for computing the gradients of the model’s parameters
and updating the model
evaluate: will determine the effectiveness of a model
Given a minibatch, which consists of 784-dimensional vectors representing
MNIST images, we can represent logistic regression by taking the softmax of
the input multiplied with a matrix representing the weights connecting the
input and output layer. Each row of the output tensor represents the
probability distribution over output classes for each corresponding data
sample in the minibatch:
Logging and Training the Logistic Regression Model
Now that we have all of the major pieces, we begin to stitch them together. In
order to log important information as we train the model, we log several
summary statistics. For example, we use the tf.scalar_summary19
and tf.histogram_summary20 commands to log the cost for each minibatch,
validation error, and the distribution of parameters. For reference, we’ll
demonstrate the scalar summary statistic for the cost function:

def training(cost, global_step):


tf.scalar_summary("cost", cost)
optimizer = tf.train.GradientDescentOptimizer(
learning_rate)
train_op = optimizer.minimize(cost,
global_step=global_step)
return train_op
Every epoch, we run the tf.merge_all_summaries21 in order to collect all
summary statistics we’ve logged and use a tf.train.SummaryWriter to write the
log to disk. In the next section, we’ll describe how we can use visualize these
logs with the built-in TensorBoard tool.

In addition to saving summary statistics, we also save the model parameters


using the tf.train.Saver model saver. By default, the saver maintains the latest
five checkpoints, and we can restore them for future use.

Putting it all together, we obtain the following Python script:

# Parameters
learning_rate = 0.01
training_epochs = 1000
batch_size = 100
display_step = 1

with tf.Graph().as_default():

# mnist data image of shape 28*28=784


x = tf.placeholder("float", [None, 784])

# 0-9 digits recognition => 10 classes


y = tf.placeholder("float", [None, 10])

output = inference(x)

cost = loss(output, y)

global_step = tf.Variable(0, name='global_step',


trainable=False)

train_op = training(cost, global_step)

eval_op = evaluate(output, y)

summary_op = tf.merge_all_summaries()

saver = tf.train.Saver()
sess = tf.Session()

summary_writer = tf.train.SummaryWriter("logistic_logs/",
graph_def=sess.graph_def)

init_op = tf.initialize_all_variables()

sess.run(init_op)

# Training cycle
for epoch in range(training_epochs):

avg_cost = 0.
total_batch = int(mnist.train.num_examples/batch_size)
# Loop over all batches
for i in range(total_batch):
mbatch_x, mbatch_y = mnist.train.next_batch(
batch_size)
# Fit training using batch data
feed_dict = {x : mbatch_x, y : mbatch_y}
sess.run(train_op, feed_dict=feed_dict)
# Compute average loss
minibatch_cost = sess.run(cost,
feed_dict=feed_dict)
avg_cost += minibatch_cost/total_batch
# Display logs per epoch step
if epoch % display_step == 0:
val_feed_dict = {
x : mnist.validation.images,
y : mnist.validation.labels
}
accuracy = sess.run(eval_op,
feed_dict=val_feed_dict)

print "Validation Error:", (1 - accuracy)

summary_str = sess.run(summary_op,
feed_dict=feed_dict)
summary_writer.add_summary(summary_str,
sess.run(global_step))

saver.save(sess, "logistic_logs/model-checkpoint",
global_step=global_step)

print "Optimization Finished!"

test_feed_dict = {
x : mnist.test.images,
y : mnist.test.labels
}

accuracy = sess.run(eval_op, feed_dict=test_feed_dict)

print "Test Accuracy:", accuracy


Running the script gives us a final accuracy of 91.9% on the test set within 100
epochs of training. This isn’t bad, but we’ll try to do better in the final section
of this chapter, when we approach the problem with a feed-forward neural
network.

Leveraging TensorBoard to Visualize Computation Graphs and


Learning
Once we set up the logging of summary statistics as described in the previous
section, we are ready to visualize the data we’ve collected. TensorFlow comes
with a visualization tool called TensorBoard, which provides an easy-to-use
interface for navigating through our summary statistics.22 Launching
TensorBoard is as easy as running:

tensorboard --logdir=<absolute_path_to_log_dir>
The logdir flag should be set to the directory where our
tf.train.SummaryWriter was configured to serialize our summary statistics. Be
sure to pass an absolute path (and not a relative path), because otherwise
TensorBoard may not be able to find out logs. If we successfully launch
TensorBoard, it should be serving our data at http://localhost:6006/, which we
can navigate to in our browser.
Building a Multilayer Model for MNIST
in TensorFlow

Using a logistic regression model, we were able to achieve an 8.1% error rate
on the MNIST dataset. This may seem impressive, but it isn’t particularly
useful for high-value practical applications. For example, if we were using our
system to read personal checks written out for 4-digit amounts ($1,000 to
$9,999), we would make errors on nearly 30% of checks! To create an MNIST
digit reader that’s more practical, let’s try to build a feed-forward network to
tackle the MNIST challenge.

We construct a feed-forward model with two hidden layers, each with 256
ReLU neurons, as shown in Figure 3-7.

MNIST Dataset in CNN


The MNIST (Modified National Institute of Standards and Technology) database is
a large database of handwritten numbers or digits that are used for training various
image processing systems. The dataset also widely used for training and testing in the
field of machine learning. The set of images in the MNIST database are a combination
of two of NIST's databases: Special Database 1 and Special Database 3.

The MNIST dataset has 60,000 training images and 10,000 testing images.

The MNIST dataset can be online, and it is essentially a database of various


handwritten digits. The MNIST dataset has a large amount of data and is commonly
used to demonstrate the real power of deep neural networks. Our brain and eyes work
together to recognize any numbered image. Our mind is a potent tool, and it's capable
of categorizing any image quickly. There are so many shapes of a number, and our
mind can easily recognize these shapes and determine what number is it, but the same
task is not simple for a computer to complete. There is only one way to do this, which
is the use of deep neural network which allows us to train a computer to classify the
handwritten digits effectively.
So, we have only dealt with data which contains simple data points on a Cartesian
coordinate system. From starting till now, we have distributed with binary class
datasets. And when we use multiclass datasets, we will use the Softmax
activation function is quite useful for classifying binary datasets. And it was quite
effective in arranging values between 0 and 1. The sigmoid function is not effective for
multicausal datasets, and for this purpose, we use the softmax activation function,
which is capable of dealing with it.

Play Video

The MNIST dataset is a multilevel dataset consisting of 10 classes in which we can


classify numbers from 0 to 9. The major difference between the datasets that we have
used before and the MNIST dataset is the method in which MNIST data is inputted in
a neural network.

In the perceptual model and linear regression model, each of the data points was
defined by a simple x and y coordinate. This means that the input layer needs two
nodes to input single data points.

In the MNIST dataset, a single data point comes in the form of an image. These images
included in the MNIST dataset are typical of 28*28 pixels such as 28 pixels crossing
the horizontal axis and 28 pixels crossing the vertical axis. This means that a single
image from the MNIST database has a total of 784 pixels that must be analyzed. The
input layer of our neural network has 784 nodes to explain one of these images.
Here, we will see how to create a function that is a model for recognizing handwritten
digits by looking at each pixel in the image. Then using TensorFlow to train the model
to predict the image by making it look at thousands of examples which are already
labeled. We will then check the model's accuracy with a test dataset.

MNIST dataset in TensorFlow, containing information of handwritten digits


spitted into three parts:

o Training Data (mnist.train) -55000 datapoints


o Validation Data (mnist.validate) -5000 datapoint
o Test Data (mnist.test) -10000 datapoints

Now before we start, it is important to note that every data point has two parts: an
image (x) and a corresponding label (y) describing the actual image and each image is
a 28x28 array, i.e., 784 numbers. The label of the image is a number between 0 and 9
corresponding to the TensorFlow MNIST image. To download and use MNIST dataset,
use the following commands:

1. from tensorflow.examples.tutorials.mnist import input_data


2. mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

You might also like