ML&DL PDF
ML&DL PDF
ML&DL PDF
&
DEEP LEARNING
BY:
SUMA KOMALI
2nd MCA
What is Machine Learning
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions. But can a machine also learn from experiences or past data
like a human does? So here comes the role of Machine Learning.
A machine has the ability to learn if it can improve its performance by gaining
more data.
We can train machine learning algorithms by providing them the huge amount of data
and let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of
machine learning, we can save both time and money.
.
Following are some key points which show the importance of Machine Learning:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample
labeled data to the machine learning system in order to train it, and on that basis, it
predicts the output.
The system creates a model using labeled data to understand the datasets and learn
about each data, once the training and processing are done then we test the model
by providing a sample data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student learns
things in the supervision of the teacher. The example of supervised learning is spam
filtering.
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
o Clustering
o Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning
agent gets a reward for each right action and gets a penalty for each wrong action.
The agent learns automatically with these feedbacks and improves its performance. In
reinforcement learning, the agent interacts with the environment and explores it. The
goal of an agent is to get the most reward points, and hence, it improves its
performance.
The robotic dog, which automatically learns the movement of his arms, is an example
of Reinforcement learning.
Batch Learning
In batch learning, the system is incapable of learning incrementally: It must be
trained using all the available data. This will generally take a lot of time and
computing resources, so it is typically done offline, first the system is trained and
then it’s launched into production and runs without learning anymore; it just applied
what it has learned. This is called offline learning.
If you wish a batch learning system to know about new data, (such as a new type of
spam), you will have to train a new version of the system from scratch on the full
dataset (both new data and old data). The stop the old system and replace it with the
new one.
This solution is simple and often works fine, but training using the full set of data can
take many hours and may not be a part of best practice. So, you would typically train
a new system only every 24 hrs or just weekly. If your system needs to adapt to
rapidly changing data then you need a more tractive solution.
Also, training on the full set of data requires a lot of computing resources (CPU,
memory space, disk space, disk I/O, network I/O, etc.).If you have a lot of data and
you automate your systems to train from scratch every day, it will end up costing you
a lot of money. If the amount of data is huge, it may even be impossible to use a
batch learning algorithm.
Finally, if the system needs to be able to learn autonomously and it has limited
resources (e.g. a smartphone application or a rover on Mars). Then carrying around
large amounts of training data taking it a lot of resources to train for hours every day
is a showstopper.
So, a better option in all these cases is to use algorithms that are capable of learning
incrementally.
Online Learning
In online learning, we train the system incrementally by feeding it data instances
sequentially, either individually or by small groups called mini-batches. Each learning
step is cheap and fast, so the system can learn about new data.
Online learning is great for systems that receives data as a continuous flow (e.g.,
stock prices) and needs to adapt to change rapidly or autonomously. It is also a good
option if you have limited computing resources: Once an online learning system has
learned about new data instances, it does not need them anymore, so you can
discard them (unless you to be able to roll back to a previous state and “replay” the
data). This can save a huge amount of space.
Online learning algorithms may also be used to train systems on huge datasets that
cannot fit in one machine’s main memory which is called out-of-core learning. This
algorithm loads part of the data, runs a training step on that data, and repeats the
process until it has run on all of the data.
One important parameter of online learning systems is how fast they should adapt to
changing data: This is called the learning rate. If you set a high learning rate, then
the system will rapidly adapt to new data. But it will also tend to quickly forget the old
data and you don’t want a spam filter to flag only the latest kinds of spam it was
shown. Conversely, if you set a low learning rate then the system will have more
inertia, that is it will learn slowly, but it will also be less sensitive to noise in the new
data or to sequences of non-representative’s data points.
Machine learning is a field of artificial intelligence that deals with giving machines the
ability to learn without being explicitly programmed. In this context, instance-based
learning and model-based learning are two different approaches used to create machine
learning models. While both approaches can be effective, they also have distinct
differences that must be taken into account when building a machine learning system.
However, it can be computationally expensive since all of the training data needs to be
stored in memory before making a prediction. Additionally, this approach doesn’t
generalize well to unseen data sets because its predictions are based on memorized
examples rather than learned models.
In instance-based learning, the system learns the training data by heart. At the time of
making prediction, the system uses similarity measure and compare the new cases
with the learned data.
K-nearest neighbors (KNN) is an algorithm that belongs to the instance-based learning
class of algorithms. KNN is a non-parametric algorithm because it does not assume any
specific form or underlying structure in the data.
Instead, it relies on a measure of similarity between each pair of data points. Generally
speaking, this measure is based on either Euclidean distance or cosine similarity;
however, other forms of metric can be used depending on the type of data being
analyzed.
Once the similarity between two points is calculated, KNN looks at how many neighbors
are within a certain radius around that point and uses these neighbors as examples to
make its prediction.
This means that instead of creating a generalizable model from all of the data, KNN
looks for similarities among individual data points and makes predictions
accordingly. The picture below demonstrates how the new instance will be predicted as
triangle based on greater number of triangles in its proximity.
Because KNN is an instance-based learning algorithm, it is not suitable for very large
datasets. This is because the model has to store all of the training examples in memory,
and making predictions on new data points involves comparing the new point to all of
the stored training examples. However, for small or medium-sized datasets, KNN can be
a very effective learning algorithm.
Conclusion
In conclusion, instance-based and model-base learning are two distinct approaches used
in machine learning systems. Instance-based methods require less effort but don’t
generalize well while model-base methods require more effort but produce better
generalization capabilities. It is important for anyone working with machine learning
systems to understand how these two approaches differ so they can choose the best one
for their specific applications. With a proper understanding of both types of machine
learning techniques, you will be able to create powerful systems that achieve your
desired goals with minimal effort and maximum accuracy!
Technological Singularity:
Although this topic attracts lots of attention from the many public, scientists are not
interested in the notion of AI exceeding humans' intelligence anytime in the immediate
future. This is often referred to as superintelligence and superintelligence, which Nick
Bostrum defines as "any intelligence that far surpasses the top human brains in virtually
every field, which includes general wisdom, scientific creativity and social abilities." In
spite of the fact that the concept of superintelligence and strong AI isn't a reality in
the world, the concept poses some interesting questions when we contemplate the
potential use of autonomous systems, such as self-driving vehicles. It's impossible to
imagine that a car with no driver would never be involved in a car accident, but who
would be accountable and accountable in those situations? Do we need to continue
to explore autonomous vehicles, or should we restrict the use of this technology to
produce semi-autonomous cars that encourage the safety of drivers? The jury isn't yet
out on this issue. However, these kinds of ethical debates are being fought as new and
genuine AI technology is developed.
AI Impact on Jobs:
While the majority of public opinion about artificial intelligence revolves around job
loss, the issue should likely be changed. With each new and disruptive technology, we
can see shifts in demand for certain job positions. For instance, when we consider the
automotive industry, a lot of manufacturers like GM are focusing their efforts on
electric vehicles to be in line with green policies. The energy sector isn't going away,
but the primary source that fuels it is changing from an energy economy based on fuel
to an electrical one. Artificial intelligence must be seen as a way to think about it, as
artificial intelligence is expected to shift the need for jobs to different areas. There will
be people who can control these systems as data expands and changes each day. It is
still necessary resources in order to solve more complicated issues within sectors that
are more likely to suffer from demand shifts, including customer service. The most
important element of artificial intelligence and its impact on the employment market
will be in helping individuals adapt to the new realms that are a result of the market.
Privacy:
Privacy is often frequently discussed in relation to data privacy security, data
protection, and security. These concerns have helped policymakers advance their
efforts recently. For instance, in 2016, GDPR legislation was introduced to safeguard
the personal information of individuals within Europe's European Union and European
Economic Area, which gives individuals more control over their data. Within the United
States, individual states are creating policies, including the California Consumer Privacy
Act (CCPA), that require companies to inform their customers about the processing of
their data. This legislation is forcing companies to think about how they handle and
store personally identifiable information (PII). In the process, security investments have
become a business priority to remove any potential vulnerabilities or opportunities to
hack, monitor, and cyber-attacks.
Bias and Discrimination:
Discrimination and bias in different intelligent machines have brought up several
ethical issues about using artificial intelligence. How can we protect ourselves from
bias and discrimination when training data could be biased? While most companies
have well-meaning intentions with regard to their automation initiatives, Reuters
highlights the unexpected effects of incorporating AI in hiring practices. As they tried
to automate and make it easier to do so, Amazon unintentionally biased potential
candidates based on gender in positions in the technical field, which led them to end
the project. When events like these come to light, Harvard Business Review (link
located outside of IBM) has raised pertinent questions about the application of AI in
hiring practices. For example, what kind of data could you analyse when evaluating a
candidate for a particular job.
Discrimination and bias aren't just limited to the human resource function. They are
present in a variety of applications ranging from software for facial recognition to
algorithms for social media.
Accountability:
There isn't a significant law to control AI practices. There's no mechanism for
enforcement to make sure that ethical AI is being used. Companies' primary
motivations to adhere to these standards are the negative effects of an untrustworthy
AI system on their bottom lines. To address the issue, ethical frameworks have been
developed in a partnership between researchers and ethicists to regulate the creation
and use of AI models. But, for the time being, they only serve as a provide guidance
the development of AI models. Research has shown that shared responsibility and
insufficient awareness of potential effects aren't ideal for protecting society from harm.
For example-
Observations-
Termenology-
Ex- Sanidhya,62,5.11,18
3. Poor-Quality Data
If your training data is full of errors, outliers, and noise, it will make
it harder for the system to detect the underlying patterns.
Outliers -
• Discard or Fix
4. Irrelevant Features
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this,
the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.
As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because the
goal of the regression model to find the best fit line, but here we have not got any
best fit, so, it will generate the prediction errors.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant trend
in the data.
In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.
Example: We can understand the underfitting using below output of the linear
regression model:
As we can see from the above diagram, the model is unable to capture the data points
present in the plot.
Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how
closely the result or predicted values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and ideally,
it makes predictions with 0 errors, but in practice, it is difficult to achieve it.
As when we train our model for a time, the errors in the training data go down, and
the same happens with test data. But if we train the model for a long duration, then
the performance of the model may decrease due to the overfitting, as the model also
learn the noise present in the dataset. The errors in the test dataset start increasing, so
the point, just before the raising of errors, is the good point, and we can stop here for
achieving a good model.
• Unit tests. The program is broken down into blocks, and each
element (unit) is tested separately.
• Regression tests. They cover already tested software to see if it
doesn’t suddenly break.
• Integration tests. This type of testing observes how multiple
components of the program work together.
First of all, you split the database into three non-overlapping sets. You use
a training set to train the model. Then, to evaluate the performance of the
model, you use two sets of data:
• Validation set. Having only a training set and a testing set is not
enough if you do many rounds of hyperparameter-tuning (which is
always). And that can result in overfitting. To avoid that, you can
select a small validation data set to evaluate a model. Only after you
get maximum accuracy on the validation set, you make the testing
set come into the game.
• Test set (or holdout set). Your model might fit the training dataset
perfectly well. But where are the guarantees that it will do equally well
in real-life? In order to assure that, you select samples for a testing
set from your training set — examples that the machine hasn’t seen
before. It is important to remain unbiased during selection and draw
samples at random. Also, you should not use the same set many
times to avoid training on your test data. Your test set should be large
enough to provide statistically meaningful results and be
representative of the data set as a whole.
But just as test sets, validation sets “wear out” when used repeatedly. The
more times you use the same data to make decisions about
hyperparameter settings or other model improvements, the less confident
you are that the model will generalize well on new, unseen data. So it is a
good idea to collect more data to ‘freshen up’ the test set and validation set
UNIT 2
What is Classification?
Binary Classification
Application Observation 0 1
Quick example
In a medical diagnosis, a binary classifier for a specific disease could
take a patient's symptoms as input features and predict whether the
patient is healthy or has the disease. The possible outcomes of the
diagnosis are positive and negative.
• True Positive (TP): The patient is diseased and the model predicts "diseased"
• False Positive (FP): The patient is healthy but the model predicts "diseased"
• True Negative (TN): The patient is healthy and the model predicts "healthy"
• False Negative (FN): The patient is diseased and the model predicts "healthy"
• Naive Bayes
• Nearest Neighbor
• Decision Trees
• Logistic Regression
• Neural Networks
Confusion Matrix
Accuracy
Error Rate
Example:
This is misleading because the model doesn’t detect any POS class.
Detecting the rare class is usually more interesting (examples:
frauds, spams, cancer detection etc)
Precision
precision = 12 / 15 = 80%
recall = 12 / 20 = 60%
F-measure
The two metrics above precision and recall can be combined into
a single metric called F-measure. It is a harmonic mean of
precision and recall. The harmonic mean of two numbers x and y is
close to the smaller of the two numbers. Hence, a high value of F-
measure ensures both precision and recall are reasonably high.
3.1. Cross-validation: evaluating estimator performance
Learning the parameters of a prediction function and testing it on the same data is a
methodological mistake: a model that would just repeat the labels of the samples
that it has just seen would have a perfect score but would fail to predict anything
useful on yet-unseen data. This situation is called overfitting. To avoid it, it is
common practice when performing a (supervised) machine learning experiment to
hold out part of the available data as a test set X_test, y_test. Note that the word
“experiment” is not intended to denote academic use only, because even in
commercial settings machine learning usually starts out experimentally. Here is a
flowchart of typical cross validation workflow in model training. The best parameters
can be determined by grid search techniques.
In machine learning, we couldn’t fit the model on the training data and can’t
say that the model will work accurately for the real data. For this, we must
assure that our model got the correct patterns from the data, and it is not
getting up too much noise. For this purpose, we use the cross-validation
technique.
Cross validation is a technique used in machine learning to evaluate the
performance of a model on unseen data. It involves dividing the available
data into multiple folds or subsets, using one of these folds as a validation
Confusion Matrix:
AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC
stands for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the
AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on
Y-axis and FPR(False Positive Rate) on X-axis.
ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds. This curve
plots two parameters:
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
o
o The above picture is taken from the Iris dataset which depicts that
the target variable has three categories i.e., Virginica, setosa, and
Versicolor, which are three species of Iris plant. We might use this
dataset later, as an example of a conceptual understanding of
multiclass classification.
This helps know how errors are distributed across key hypotheses or key
features/classes/cohorts of the dataset. For example, in a loan approval
model used by a bank, it might be that the model is giving more errors on the
individuals who are younger and have a low monthly average balance with
the bank.
A. How to do this (manually) especially in case your data is image, voice, or
text where you might not have apparent features.
• Contact
Multi-Label Classification with Deep Learning
Multi-label classification involves predicting zero or more class labels.
Unlike normal classification tasks where class labels are mutually exclusive, multi-label
classification requires specialized machine learning algorithms that support predicting
multiple mutually non-exclusive classes or “labels.”
Deep learning neural networks are an example of an algorithm that natively supports
multi-label classification problems. Neural network models for multi-label classification
tasks can be easily defined and evaluated using the Keras deep learning library.
Multi-Label Classification
For multi-label classification, the data has more than 1 independent
variable (target class) and cardinality of the each class should be 2
(binary). Stackoverflow tag prediction dataset is an example of a
multi-label classification problem. In this type of classification
problem, there is more than 1 output prediction.
Most of the classification machine learning algorithms are not able
to handle multi-label classification. One needs to use a wrapper
around the machine learning algorithms to train multi-label
classification data. Scikit-learn comes up with 2 wrapper
implementations:
Some classification tasks require predicting more than one class label. This means that
class labels or class membership are not mutually exclusive. These tasks are referred to
as multiple label classification, or multi-label classification for short.
In multi-label classification, zero or more labels are required as output for each input
sample, and the outputs are required simultaneously. The assumption is that the output
labels are a function of the inputs.
We can create a synthetic multi-label classification dataset using
the make_multilabel_classification() function in the scikit-learn library.
Our dataset will have 1,000 samples with 10 input features. The dataset will have three
class label outputs for each sample and each class will have one or two values (0 or 1,
e.g. present or not present).
Running the example creates the dataset and summarizes the shape of the input and
output elements.
We can see that, as expected, there are 1,000 samples, each with 10 input features and
three output features.
The first 10 rows of inputs and outputs are summarized and we can see that all inputs
for this dataset are numeric and that output class labels have 0 or 1 values for each of
the three class labels.
4 [7. 6. 4. 4. 6. 8. 3. 4. 6. 4.] [0 0 0]
6 [1. 1. 5. 5. 7. 3. 4. 6. 4. 4.] [1 1 1]
8 [ 4. 3. 3. 2. 5. 2. 3. 7. 2. 10.] [0 0 0]
ENSEMBLE LEARNING
The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into subsets and
given to each decision tree. During the training phase, each decision tree produces a
prediction result, and when a new data point occurs, then based on the majority of
results, the Random Forest classifier predicts the final decision. Consider the below
image:
ML | Voting Classifier
A Voting Classifier is a machine learning model that trains on an ensemble of
numerous models and predicts an output (class) based on their highest
probability of chosen class as the output.
It simply aggregates the findings of each classifier passed into Voting Classifier
and predicts the output class based on the highest majority of voting. The idea is
instead of creating separate dedicated models and finding the accuracy for each
them, we create a single model which trains by these models and predicts output
based on their combined majority of voting for each output class.
Voting Classifier supports two types of votings.
1. Hard Voting: In hard voting, the predicted output class is a class with
the highest majority of votes i.e the class which had the highest
probability of being predicted by each of the classifiers. Suppose three
classifiers predicted the output class(A, A, B), so here the majority
predicted A as output. Hence A will be the final prediction.
2. Soft Voting: In soft voting, the output class is the prediction based on
the average of probability given to that class. Suppose given some input
to three models, the prediction probability for class A = (0.30, 0.47,
0.53) and B = (0.20, 0.32, 0.40). So the average for class A is 0.4333 and B
is 0.3067, the winner is clearly class A because it had the highest
probability averaged by each classifier.
Out-Of-Bag Sample
In our above example, we can observe that some animals are repeated while
making the sample and some animals did not even occur once in the sample.
Here, Sample1 does not have Rat and Cow whereas sample 3 had all the
animals equal to the main training set.
While making the samples, data points were chosen randomly and with
replacement, and the data points which fail to be a part of that particular
sample are known as OUT-OF-BAG points.
Here, we have a training set with 5 rows and a classification target variable
of whether the animals are domestic/pet?
Out of multiple decision trees built in the random forest, a bootstrapped
sample for one particular decision tree, say DT_1 is shown below
Here, Rat and Cat data have been left out. And since, Rat and Cat are OOB
for DT_1, we would predict the values for Rat and Cat using DT_1. (Note:
Data of Rat and Cat hasn’t been seen by DT_1 while training the tree.)
Similarly, every data point is passed for prediction to trees where it would be
behaving as OOB and an aggregated prediction is recorded for each row.
And
• Model agnostic
• Model dependent
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data
point.
2. Provide this as input to the model and identify the wrongly classified
data points.
3. Increase the weight of the wrongly classified data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
ML – Gradient Boosting
The ensemble consists of N trees. Tree1 is trained using the feature matrix X and
the labels y. The predictions labelled y1(hat) are used to determine the training
set residual errors r1. Tree2 is then trained using the feature matrix X and the
residual errors r1 of Tree1 as labels. The predicted results r1(hat) are then used
to determine the residual r2. The process is repeated until all the N trees forming
the ensemble are trained.
UNIT 3
56.6M
1K
Java Try Catch
The given figure illustrates the typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.
Dendrites Inputs
Synapse Weights
Axon Output
There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000. In the human brain,
data is stored in such a manner as to be distributed, and we can extract more than one
piece of this data when necessary from our memory parallelly. We can say that the
human brain is made up of incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example
of a digital logic gate that takes an input and gives an output. "OR" gate, which takes
two inputs. If one or both the inputs are "On," then we get "On" in output. If both the
inputs are "Off," then we get "Off" in output. Here the output depends upon input. Our
brain does not perform the same task. The outputs to inputs relationship keep
changing because of the neurons in our brain, which are "learning."
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
Imagine a future where you are not stuck in traffic because machines
are driving the vehicles, not humans. A future where everyone has a
personal assistant to do mundane tasks. A future where industry is
more productive and per capita income is rising.
That future is already here, just not evenly distributed across the
world. Some parts of the world are experiencing it earlier than the
others. A big driving force behind this development is advancement
of AI technologies.
Industrial revolution took people out of field jobs and moved them
to offices to manage processes and troubleshoot problems. Humans
switched to managing the job while the real job was done by robotic
assembly lines or machines.
Perceptron in Machine Learning
In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used
term for all folks. It is the primary step to learn Machine Learning and Deep Learning
technologies, which consists of a set of weights, input values or scores, and a
threshold. Perceptron is a building block of an Artificial Neural Network. Initially,
in the mid of 19th century, Mr. Frank Rosenblatt invented the Perceptron for
performing certain calculations to detect input data capabilities or business
intelligence. Perceptron is a linear Machine Learning algorithm used for supervised
learning for various binary classifiers. This algorithm enables neurons to learn elements
and processes them one by one during preparation. In this tutorial, "Perceptron in
Machine Learning," we will discuss in-depth knowledge of Perceptron and its basic
functions in brief. Let's start with the basic introduction of Perceptron.
Perceptron model is also treated as one of the best and simplest types of Artificial
Neural networks. However, it is a supervised learning algorithm of binary classifiers.
Hence, we can consider it as a single-layer neural network with four main parameters,
i.e., input values, weights and Bias, net sum, and an activation function.
This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.
Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.
o Sign function
o Step function, and
o Sigmoid function
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and Bias,
net sum, and an activation function. The perceptron model begins with the
multiplication of all input values and their weights, then adds these values together to
create the weighted sum. Then this weighted sum is applied to the activation function
'f' to obtain the desired output. This activation function is also known as the step
function and is represented by 'f'.
This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the weight
of input is indicative of the strength of a node. Similarly, an input's bias value gives the
ability to shift the activation function curve up or down.
o Forward Stage: Activation functions start from the input layer in the forward stage and
terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
We are still making use of a gradient descent optimization algorithm which acts to
minimize the error of our model by iteratively moving in the direction with the steepest
descent, the direction which updates the parameters of our model while ensuring the
minimal error. It updates the weight of every model in every single layer. We will talk
more about optimization algorithms and backpropagation later.
56.6M
1K
In the feed-forward neural network, there are not any feedback loops or connections
in the network. Here is simply an input layer, a hidden layer, and an output layer.
There can be multiple hidden layers which depend on what kind of data you are
dealing with. The number of hidden layers is known as the depth of the neural network.
The deep neural network can learn from more functions. Input layer first provides the
neural network with data and the output layer then make predictions on that data
which is based on a series of functions. ReLU Function is the most commonly used
activation function in the deep neural network.
1) The first input is fed to the network, which is represented as matrix x1, x2, and one
where one is the bias value.
2) Each input is multiplied by weight with respect to the first and second model to
obtain their probability of being in the positive region in each model.
So, we will multiply our inputs by a matrix of weight using matrix multiplication.
3) After that, we will take the sigmoid of our scores and gives us the probability of the
point being in the positive region in both models.
4) We multiply the probability which we have obtained from the previous step with the
second set of weights. We always include a bias of one whenever taking a combination
of inputs.
And as we know to obtain the probability of the point being in the positive region of
this model, we take the sigmoid and thus producing our final output in a feed-forward
process.
Let takes the neural network which we had previously with the following linear models
and the hidden layer which combined to form the non-linear model in the output layer.
So, what we will do we use our non-linear model to produce an output that describes
the probability of the point being in the positive region. The point was represented by
2 and 2. Along with bias, we will represent the input as
The first linear model in the hidden layer recall and the equation defined it
Which means in the first layer to obtain the linear combination the inputs are
multiplied by -4, -1 and the bias value is multiplied by twelve.
The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three
to obtain the linear combination of that same point in our second model.
Now, to obtain the probability of the point is in the positive region relative to both
models we apply sigmoid to both points as
The second layer contains the weights which dictated the combination of the linear
models in the first layer to obtain the non-linear model in the second layer. The weights
are 1.5, 1, and a bias value of 0.5.
Now, we have to multiply our probabilities from the first layer with the second set of
weights as
It is complete math behind the feed forward process where the inputs from the input
traverse the entire depth of the neural network. In this example, there is only one
hidden layer. Whether there is one hidden layer or twenty, the computational
processes are the same for all hidden layers.
Neuron Model
A linear neuron with R inputs is shown below.
This network has the same basic structure as the perceptron. The only difference is that the
linear neuron uses a linear transfer function purelin.
The linear transfer function calculates the neuron's output by simply returning the value
passed to it.
α=purelin(n)=purelin(Wp+b)=Wp+b
This neuron can be trained to learn an affine function of its inputs, or to find a linear
approximation to a nonlinear function. A linear network cannot, of course, be made to
perform a nonlinear computation.
Network Architecture
The linear network shown below has one layer of S neurons connected to R inputs through a
matrix of weights W.
Note that the figure on the right defines an S-length output vector a.
A single-layer linear network is shown. However, this network is just as capable as multilayer
linear networks. For every multilayer linear network, there is an equivalent single-layer linear
network.
What are the limitations of linear neural networks?
• Linear neurons are easy to compute with, but they run into
serious limitations. In fact, it can be shown that any feed-
forward neural network consisting of only linear neurons can be expressed as
a network with no hidden layers.
The softmax function is also a type of sigmoid function but is handy when we
are trying to handle multi- class classification problems.
• Nature :- non-linear
• Uses :- Usually used when trying to handle multiple classes. the
softmax function was commonly found in the output layer of image
classification problems.The softmax function would squeeze the
outputs for each class between 0 and 1 and would also divide by
the sum of the outputs.
• Output:- The softmax function is ideally used in the output layer of
the classifier where we are actually trying to attain the probabilities
to define the class of each input.
• The basic rule of thumb is if you really don’t know what activation
function to use, then simply use RELU as it is a general activation
function in hidden layers and is used in most cases these days.
• If your output is for binary classification then, sigmoid function is
very natural choice for output layer.
• If your output is for multi-class classification then, Softmax is very
useful to predict the probabilities of each classes.
Although it is a subclass of the sigmoid function, the softmax function comes in handy
when dealing with multiclass classification issues.
Used frequently when managing several classes. In the output nodes of image
classification issues, the softmax was typically present. The softmax function would
split by the sum of the outputs and squeeze all outputs for each category between 0
and 1.
The output unit of the classifier, where we are actually attempting to obtain the
probabilities to determine the class of each input, is where the softmax function is best
applied.
The usual rule of thumb is to utilise RELU, which is a usual perceptron in hidden layers
and is employed in the majority of cases these days, if we really are unsure of what
encoder to apply.
A very logical choice for the output layer is the sigmoid function if your input is for
binary classification. If our output involves multiple classes, Softmax can be quite
helpful in predicting the odds for each class.
Figure 2-1. This is the neuron we want to train for the fast-food problem
Let’s continue with the example we mentioned in the previous chapter
involving a linear neuron. As a brief review, every single day, we purchase a
restaurant meal consisting of burgers, fries, and sodas. We buy some number
of servings for each item. We want to be able to predict how much a meal is
going to cost us, but the items don’t have price tags. The only thing the cashier
will tell us is the total price of the meal. We want to train a single linear
neuron to solve this problem. How do we do it?
One idea is to be intelligent about picking our training cases. For one meal we
could buy only a single serving of burgers, for another we could only buy a
single serving of fries, and then for our last meal we could buy a single serving
of soda. In general, intelligently selecting training examples is a very good
idea. There’s lots of research that shows that by engineering a clever training
set, you can make your neural network a lot more effective. The issue with
using this approach alone is that in real situations, it rarely ever gets you close
to the solution. For example, there’s no clear analog of this strategy in image
recognition. It’s just not a practical solution.
Instead, we try to motivate a solution that works well in general. Let’s say we
have a large set of training examples. Then we can calculate what the neural
network will output on the I tℎ training example using the simple formula in
the diagram. We want to train the neuron so that we pick the optimal weights
possible—the weights that minimize the errors we make on the training
examples. In this case, let’s say we want to minimize the square error over all
of the training examples that we encounter. More formally, if we know
that t(i) is the true answer for the I tℎ training example and y(i) is the value
computed by the neural network, we want to minimize the value of the error
function E:
E=1/2∑i(t(i)-y(i))2
The squared error is zero when our model makes a perfectly correct
prediction on every training example. Moreover, the closer E is to 0, the better
our model is. As a result, our goal will be to select our parameter
vector 0teta (the values for all the weights in our model) such that E is as close
to 0 as possible.
Now at this point you might be wondering why we need to bother ourselves
with error functions when we can treat this problem as a system of
equations. After all, we have a bunch of unknowns (weights) and we have a
set of equations (one for each training example). That would automatically
give us an error of 0, assuming that we have a consistent set of training
examples.
That’s a smart observation, but the insight unfortunately doesn’t generalize
well. Remember that although we’re using a linear neuron here, linear
neurons aren’t used very much in practice because they’re constrained in
what they can learn. And the moment we start using nonlinear neurons like
the sigmoidal, tanh, or ReLU neurons we talked about at the end of the
previous chapter, we can no longer set up a system of equations! Clearly we
need a better strategy to tackle the training process.
In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about
gradient descent, the role of cost functions specifically as a barometer within Machine
Learning, types of gradient descents, learning rates, etc.
The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
o If we move towards a negative gradient or away from the gradient of the function at
the current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the function
at the current point, we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize
the cost function using iteration. To achieve this goal, it performs two steps
iteratively:
o Calculates the first-order derivative of the function to compute the gradient or slope
of that function.
o Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
What is Cost-function?
The cost function is defined as the measurement of difference or error between
actual values and expected values at the current position and present in the form
of a single real number. It helps to increase and improve machine learning efficiency
by providing feedback to this model so that it can minimize error and find the local or
global minimum. Further, it continuously iterates along the direction of the negative
gradient until the cost function approaches zero. At this steepest descent point, the
model will stop learning further. Although cost function and loss function are
considered synonymous, also there is a minor difference between them. The slight
difference between the loss function and the cost function is about the error within
the training of machine learning models, as loss function refers to the error of one
training example, while a cost function calculates the average error across an entire
training set.
The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to
reduce the cost function.
Hypothesis:
Parameters:
Cost function:
Goal:
1. Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-
axis.
The starting point(shown in above fig.) is used to evaluate the performance as it is
considered just as an arbitrary point. At this starting point, we will derive the first
derivative or slope and then use a tangent line to calculate the steepness of this slope.
Further, this slope will inform the updates to the parameters (weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point,
it approaches the lowest point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error
between expected and actual. To minimize the cost function, two data points are
required:
These two factors are used to determine the partial derivative calculation of future
iteration and allow it to the point of convergence or local minimum or global minimum.
Let's discuss learning rate factors in brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is
typically a small value that is evaluated and updated based on the behavior of the cost
function. If the learning rate is high, it results in larger steps but also leads to risks of
overshooting the minimum. At the same time, a low learning rate shows the small step
sizes, which compromises overall efficiency but gives the advantage of more precision.
o Before we derive the exact algorithm for training our fast-food neuron,
a quick note on hyperparameters. In addition to the weight parameters
defined in our neural network, learning algorithms also require a
couple of additional parameters to carry out the training process. One of
these so-called hyperparameters is the learning rate.
o In practice, at each step of moving perpendicular to the contour, we
need to determine how far we want to walk before recalculating our
new direction. This distance needs to depend on the steepness of the
surface. Why? The closer we are to the minimum, the shorter we want
to step forward. We know we are close to the minimum, because the
surface is a lot flatter, so we can use the steepness as an indicator of
how close we are to the minimum. However, if our error surface is
rather mellow, training can potentially take a large amount of time. As a
result, we often multiply the gradient by a factor �, the learning
rate. Picking the learning rate is a hard problem (Figure 2-4). As we just
discussed, if we pick a learning rate that’s too small, we risk taking too
long during the training process. But if we pick a learning rate that’s too
big, we’ll mostly likely start diverging away from the minimum.
In Chapter 3, we’ll learn about various optimization techniques that
utilize adaptive learning rates to automate the process of selecting
learning rates.
o Figure 2-4. Convergence is difficult when our learning rate is too large
o Now, we are finally ready to derive the delta rule for training our linear
neuron. In order to calculate how to change each weight, we evaluate
the gradient, which is essentially the partial derivative of the error
function with respect to each of the weights. In other words, we want:
o
The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not
able to perform the task for which it is being trained. Derivatives of the activation
function to be known at network design time is required to Backpropagation.
Now, how error function is used in Backpropagation and how Backpropagation works?
Let start with an example and do it mathematically to understand how exactly updates
the weight using Backpropagation.
Play Video
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
One of the major issues with artificial neural networks is that the models are
quite complicated. For example, let’s consider a neural network that’s pulling
data from an image from the MNIST database (28 x 28 pixels), feeds into two
hidden layers with 30 neurons, and finally reaches a softmax layer of 10
neurons. The total number of parameters in the network is nearly 25,000. This
can be quite problematic, and to understand why, let’s consider a new toy
example, illustrated in Figure 2-8.
Figure 2-8. Two potential models that might describe our dataset: a linear model versus a degree 12
polynomial
We are given a bunch of data points on a flat plane, and our goal is to find a
curve that best describes this dataset (i.e., will allow us to predict the y-
coordinate of a new point given its x-coordinate). Using the data, we train two
different models: a linear model and a degree 12 polynomial. Which curve
should we trust? The line which gets almost no training example correctly? Or
the complicated curve that hits every single point in the dataset? At this point
we might trust the linear fit because it seems much less contrived. But just to
be sure, let’s add more data to our dataset! The result is shown in Figure 2-9.
Figure 2-9. Evaluating our model on new data indicates that the linear fit is a much better model
than the degree 12 polynomial
Now the verdict is clear: the linear model is not only better subjectively but
also quantitatively (measured using the squared error metric). But this leads to
a very interesting point about training and evaluating machine learning
models. By building a very complex model, it’s quite easy to perfectly fit our
training dataset because we give our model enough degrees of freedom to
contort itself to fit the observations in the training set. But when we evaluate
such a complex model on new data, it performs very poorly. In other words,
the model does not generalize well. This is a phenomenon called overfitting, and
it is one of the biggest challenges that a machine learning engineer must
combat. This becomes an even more significant issue in deep learning, where
our neural networks have large numbers of layers containing many neurons.
The number of connections in these models is astronomical, reaching the
millions. As a result, overfitting is commonplace.
Let’s see how this looks in the context of a neural network. Let’s say we have a
neural network with two inputs, a softmax output of size two, and a hidden
layer with 3, 6, or 20 neurons. We train these networks using mini-batch
gradient descent (batch size 10), and the results, visualized using
ConvNetJS, are shown in Figure 2-10.3
Figure 2-10. A visualization of neural networks with 3, 6, and 20 neurons (in that order) in their
hidden layer
It’s already quite apparent from these images that as the number of
connections in our network increases, so does our propensity to overfit to the
data. We can similarly see the phenomenon of overfitting as we make our
neural networks deep. These results are shown in Figure 2-11, where we use
networks that have one, two, or four hidden layers of three neurons each.
Figure 2-11. A visualization of neural networks with one, two, and four hidden layers (in that
order) of three neurons each
This leads to three major observations. First, the machine learning engineer is
always working with a direct trade-off between overfitting and model
complexity. If the model isn’t complex enough, it may not be powerful enough
to capture all of the useful information necessary to solve a problem.
However, if our model is very complex (especially if we have a limited amount
of data at our disposal), we run the risk of overfitting. Deep learning takes the
approach of solving very complex problems with complex models and taking
additional countermeasures to prevent overfitting. We’ll see a lot of these
measures in this chapter as well as in later chapters.
Second, it is very misleading to evaluate a model using the data we used to
train it. Using the example in Figure 2-8, this would falsely suggest that the
degree 12 polynomial model is preferable to a linear fit. As a result, we almost
never train our model on the entire dataset. Instead, as shown in Figure 2-12,
we split up our data into a training set and a test set.
Figure 2-12. We often split our data into nonoverlapping training and test sets in order to fairly
evaluate our model
This enables us to make a fair evaluation of our model by directly measuring
how well it generalizes on new data it has not yet seen. In the real world, large
datasets are hard to come by, so it might seem like a waste to not use all of the
data at our disposal during the training process. Consequently, it may be very
tempting to reuse training data for testing or cut corners while compiling test
data. Be forewarned: if the test set isn’t well constructed, we won’t be able
draw any meaningful conclusions about our model.
Third, it’s quite likely that while we’re training our data, there’s a point in time
where instead of learning useful features, we start overfitting to the training
set. To avoid that, we want to be able to stop the training process as soon as we
start overfitting, to prevent poor generalization. To do this, we divide our
training process into epochs. An epoch is a single iteration over the entire
training set. In other words, if we have a training set of size � and we are
doing mini-batch gradient descent with batch size �, then an epoch would be
equivalent to �� model updates. At the end of each epoch, we want to
measure how well our model is generalizing. To do this, we use an
additional validation set, which is shown in Figure 2-13. At the end of an epoch,
the validation set will tell us how the model does on data it has yet to see. If the
accuracy on the training set continues to increase while the accuracy on the
validation set stays the same (or decreases), it’s a good sign that it’s time to
stop training because we’re overfitting.
The validation set is also helpful as a proxy measure of accuracy during the
process of hyperparameter optimization. We’ve covered several hyperparameters
so far in our discussion (learning rate, minibatch size, etc.), but we have yet to
develop a framework for how to find the optimal values for these
hyperparameters. One potential way to find the optimal setting of
hyperparameters is by applying a grid search, where we pick a value for each
hyperparameter from a finite set of options
(e.g., �∈{0.001,0.01,0.1},batchsize∈{16,64,128},...), and train the model with
every possible permutation of hyperparameter choices. We elect the
combination of hyperparameters with the best performance on the validation
set, and report the accuracy of the model trained with best combination on the
test set.4
Figure 2-13. In deep learning we often include a validation set to prevent overfitting during the
training process
With this in mind, before we jump into describing the various ways to directly
combat overfitting, let’s outline the workflow we use when building and
training deep learning models. The workflow is described in detail in Figure 2-
14. It is a tad intricate, but it’s critical to understand the pipeline in order to
ensure that we’re properly training our neural networks.
First we define our problem rigorously. This involves determining our inputs,
the potential outputs, and the vectorized representations of both. For instance,
let’s say our goal was to train a deep learning model to identify cancer. Our
input would be an RBG image, which can be represented as a vector of pixel
values. Our output would be a probability distribution over three mutually
exclusive possibilities: 1) normal, 2) benign tumor (a cancer that has yet to
metastasize), or 3) malignant tumor (a cancer that has already metastasized to
other organs).
If, however, we are happy with the performance of our model on the training
data, then we can measure its performance on the test data, which the model
has never seen before this point. If it is unsatisfactory, we need more data in
our dataset because the test set seems to consist of example types that weren’t
well represented in the training set. Otherwise, we are finished!
There are several techniques that have been proposed to prevent overfitting
during the training process. In this section, we’ll discuss these techniques in
detail.
Figure 2-15. A visualization of neural networks trained with regularization strengths of 0.01, 0.1,
and 1 (in that order)
Another common type of regularization is L1 regularization. Here, we add the
term �� for every weight � in the neural network. The L1 regularization has
the intriguing property that it leads the weight vectors to become sparse
during optimization (i.e., very close to exactly zero). In other words, neurons
with L1 regularization end up using only a small subset of their most
important inputs and become quite resistant to noise in the inputs. In
comparison, weight vectors from L2 regularization are usually diffuse, small
numbers. L1 regularization is very useful when you want to understand
exactly which features are contributing to a decision. If this level of feature
analysis isn’t necessary, we prefer to use L2 regularization because it
empirically performs better.
Max norm constraints have a similar goal of attempting to restrict � from
becoming too large, but they do this more directly.6 Max norm constraints
enforce an absolute upper bound on the magnitude of the incoming weight
vector for every neuron and use projected gradient descent to enforce the
constraint. In other words, any time a gradient descent step moves the
incoming weight vector such that �2>�, we project the vector back onto the
ball (centered at the origin) with radius �. Typical values of � are 3 and
4. One of the nice properties is that the parameter vector cannot grow out of
control (even if the learning rates are too high) because the updates to the
weights are always bounded.
Dropout is a very different kind of method for preventing overfitting that has
become one of the most favored methods of preventing overfitting in deep
neural networks.7 While training, dropout is implemented by only keeping a
neuron active with some probability � (a hyperparameter), or setting it to
zero otherwise. Intuitively, this forces the network to be accurate even in the
absence of certain information. It prevents the network from becoming too
dependent on any one (or any small combination) of neurons. Expressed more
mathematically, it prevents overfitting by providing a way of approximately
combining exponentially many different neural network architectures
efficiently. The process of dropout is expressed pictorially in Figure 2-16.
Figure 2-16. Dropout sets each neuron in the network as inactive with some random probability
during each minibatch of training
Dropout is pretty intuitive to understand, but there are some important
intricacies to consider. First, we’d like the outputs of neurons during test time
to be equivalent to their expected outputs at training time. We could fix this
naïvely by scaling the output at test time. For example, if �=0.5, neurons must
halve their outputs at test time in order to have the same (expected) output
they would have during training. This is easy to see because a neuron’s output
is set to 0 with probability 1-�. This means that if a neuron’s output prior to
dropout was �, then after dropout, the expected output would
be �[output]=��+(1-�)·0=��. This naïve implementation of dropout is
undesirable, however, because it requires scaling of neuron outputs at test
time. Test-time performance is extremely critical to model evaluation, so it’s
always preferable to use inverted dropout, where the scaling occurs at training
time instead of at test time. In inverted dropout, any neuron whose activation
hasn’t been silenced has its output divided by � before the value is
propagated to the next layer. With this fix, �[output]=�·��+(1-�)·0=�, and
we can avoid arbitrarily scaling neuronal output at test time.
Unit -4
What is TensorFlow?
TensorFlow is a popular framework of machine learning and deep learning. It is a
free and open-source library which is released on 9 November 2015 and developed
by Google Brain Team. It is entirely based on Python programming language and use
for numerical computation and data flow, which makes machine learning faster and
easier.
TensorFlow can train and run the deep neural networks for image recognition,
handwritten digit classification, recurrent neural network, word embedding, natural
language processing, video detection, and many more. TensorFlow is run on
multiple CPUs or GPUs and also mobile operating systems.
The word TensorFlow is made by two words, i.e., Tensor and Flow
Play Video
History of TensorFlow
Many years ago, deep learning started to exceed all other machine learning algorithms
when giving extensive data. Google has seen it could use these deep neural networks
to upgrade its services:
It was first released in 2015, while the first stable version was coming in 2017. It is an
open- source platform under Apache Open Source License. We can use it, modify it,
and reorganize the revised version for free without paying anything to Google.
Components of TensorFlow
Tensor
The name TensorFlow is derived from its core framework, "Tensor." A tensor is a vector
or a matrix of n-dimensional that represents all type of data. All values in a tensor hold
similar data type with a known shape. The shape of the data is the dimension of the
matrix or an array.
A tensor can be generated from the input data or the result of a computation. In
TensorFlow, all operations are conducted inside a graph. The group is a set of
calculation that takes place successively. Each transaction is called an op node are
connected.
Graphs
TensorFlow makes use of a graph framework. The chart gathers and describes all the
computations done during the training.
Advantages
o It was fixed to run on multiple CPUs or GPUs and mobile operating systems.
o The portability of the graph allows to conserve the computations for current or later
use. The graph can be saved because it can be executed in the future.
o All the computation in the graph is done by connecting tensors together.
d=b+c
e=c+2
a=d*e
TensorFlow is based on graph computation; it can allow the developer to create the
construction of the neural network with Tensorboard. This tool helps debug our
program. It runs on CPU (Central Processing Unit) and GPU (Graphical Processing Unit).
TensorFlow attracts the most considerable popularity on GitHub compare to the other deep
learning framework.
1. Voice/Sound Recognition
Voice and sound recognition applications are the most-known use cases of deep-
learning. If the neural networks have proper input data feed, neural networks are
capable of understanding audio signals.
For example:
Voice recognition is used in the Internet of Things, automotive, security, and UX/UI.
2. Image Recognition
Image recognition is the first application that made deep learning and machine
learning popular. Telecom, Social Media, and handset manufacturers mostly use image
recognition. It is also used for face recognition, image search, motion detection,
machine vision, and photo clustering.
For example, image recognition is used to recognize and identify people and objects
in from of images. Image recognition is used to understand the context and content
of any image.
For object recognition, TensorFlow helps to classify and identify arbitrary objects within
larger images.
This is also used in engineering application to identify shape for modeling purpose
(3d reconstruction from 2d image) and by Facebook for photo tagging.
For example, deep learning uses TensorFlow for analyzing thousands of photos of
cats. So a deep learning algorithm can learn to identify a cat because this algorithm is
used to find general features of objects, animals, or people.
3. Time Series
Deep learning is using Time Series algorithms for examining the time series data to
extract meaningful statistics. For example, it has used the time series to predict the
stock market.
For example, it can be used to recommend us TV shows or movies that people like
based on TV shows or movies we already watched.
4. Video Detection
The deep learning algorithm is used for video detection. It is used for motion detection,
real-time threat detection in gaming, security, airports, and UI/UX field.
For example, NASA is developing a deep learning network for object clustering of
asteroids and orbit classification. So, it can classify and predict NEOs (Near Earth
Objects).
5. Text-Based Applications
Text-based application is also a popular deep learning algorithm. Sentimental analysis,
social media, threat detection, and fraud detection, are the example of Text-based
applications.
Some companies who are currently using TensorFlow are Google, AirBnb, eBay, Intel,
DropBox, Deep Mind, Airbus, CEVA, Snapchat, SAP, Uber, Twitter, Coca-Cola, and IBM.
Features of TensorFlow
TensorFlow has an interactive multiplatform programming interface which is scalable
and reliable compared to other deep learning libraries which are available.
2. Flexible
It is one of the essential TensorFlow Features according to its operability. It has
modularity and parts of it which we want to make standalone.
3. Easily Trainable
It is easily trainable on CPU and for GPU in distributed computing.
6. Open Source
The best thing about the machine learning library is that it is open source so anyone
can use it as much as they have internet connectivity. So, people can manipulate the
library and come up with a fantastic variety of useful products. And it has become
another DIY community which has a massive forum for people getting started with it
and those who find it hard to use it.
7. Feature Columns
TensorFlow has feature columns which could be thought of as intermediates between
raw data and estimators; accordingly, bridging input data with our model.
9. Layered Components
TensorFlow produces layered operations of weight and biases from the function such
as tf.contrib.layers and also provides batch normalization, convolution layer, and
dropout layer. So tf.contrib.layers.optimizers have optimizers such
as Adagrad, SGD, Momentum which are often used to solve optimization problems
for numerical analysis.
Between these two options, the decision was difficult (and in fact, an early
version of this chapter was first written using Theano), but we chose
TensorFlow in the end for several subtle reasons. First, Theano has an
additional “graph compilation” step that took significant amounts of time
while setting up certain kinds of deep learning architectures. While small in
comparison to train time, this compilation phase proved frustrating while
writing and debugging new code. Second, TensorFlow has a much cleaner
interface as compared to Theano. Many classes of models can be expressed in
significantly fewer lines without sacrificing the expressiveness of the
framework. Finally, TensorFlow was built with production use in mind,
whereas Theano was designed by researchers almost purely for research
purposes. As a result, TensorFlow has many features out of the box and in the
works that make it a better choice for real systems (the ability to run in mobile
environments, easily build models that span multiple GPUs on a single
machine, and train large-scale networks in a distributed fashion). Although
familiarity with Theano and Torch can be extremely helpful while navigating
open source examples, overviews of these frameworks are beyond the scope
of this book.
Installing TensorFlow
Installing TensorFlow in your local development environment is straightforward if you aren’t
planning on modifying the TensorFlow source code. We use a Python package installation
manager called Pip. If you don’t already have Pip installed on your computer, use the
following commands in your terminal:
# Ubuntu/Linux 64-bit
$ sudo apt-get install python-pip python-dev
# Mac OS X
$ sudo easy_install pip
Once we have Pip (version 8.1 or later) installed on our computers, we can use the following
commands to install TensorFlow. Note the difference in Pip package naming if we would like
to install a GPU-enabled version of TensorFlow (which we strongly recommend):
$ source ~/.bash_profile
You should now be able to run TensorFlow from your Python shell of choice. In this tutorial,
we choose to use IPython. Using Pip, installing IPython only requires the following
command:
$ ipython
...
In [4]: session.run(deep_learning)
Out[4]: 'Deep Learning'
In [5]: a = tf.constant(2)
In [6]: a = tf.constant(2)
In [7]: multiply = tf.mul(a, b)
In [7]: session.run(multiply)
Out[7]: 6
Additional, up-to-date instructions and details about installation can be found on the
TensorFlow website.5
Variables must be explicitly initialized before a graph is used for the first time.
We can use gradient methods to modify variables after each iteration as we search for a
model’s optimal parameter settings.
We can save the values stored in variables to disk and restore them for later use.
These three properties are what make TensorFlow especially useful for building machine
learning models.
Creating a variable is simple, and TensorFlow provides mechanics that allow us to initialize
variables in several ways. Let’s start off by initializing a variable that describes the weights
connecting neurons between two layers of a feed-forward neural network:
TensorFlow Operations
We’ve already talked a little bit about operations in the context of variable
initialization, but these only make up a small subset of the universe of
operations available in TensorFlow. On a high level, TensorFlow operations
represent abstract transformations that are applied to tensors in the
computation graph. Operations may have attributes that may be supplied a
priori or are inferred at runtime. For example, an attribute may serve to
describe the expected types of input (adding tensors of type float32 versus
int32). Just as variables are named, operations may also be supplied with an
optional name attribute for easy reference into the computation graph.
Placeholder Tensors
A placeholder is a variable in Tensorflow to which data will be assigned
sometime later on. It enables us to create processes or operations without the
requirement for data. Data is fed into the placeholder as the session starts, and
the session is run. We can feed in data into tensorflow graphs using
placeholders.
Syntax: tf.compat.v1.placeholder(dtype, shape=None, name=None)
Parameters:
• dtype: the datatype of the elements in the tensor that will be fed.
• shape : by default None. The tensor’s shape that will be fed , it is an
optional parameter. One can feed a tensor of any shape if the shape isn’t
specified.
• name: by default None. The operation’s name , optional parameter.
Returns:
A Tensor that can be used to feed
Explanation:
• Eager mode is disabled in case there are any errors.
• A placeholder is created using tf.placeholder() method which has a
dtype ‘tf.float32’, None says we didn’t specify any size.
• Operation is created before feeding in data.
• The operation adds 10 to the tensor.
• A session is created and started using tf.Session().
• Session.run takes the operation we created and data to be fed as
parameters and it returns the result.
Sessions in TensorFlow
A TensorFlow program interacts with a computation graph using a session.13
The TensorFlow session is responsible for building the initial graph, and can be
used to initialize all variables appropriately and to run the computational
graph. To explore each of these pieces, let’s consider the following simple
Python script:
import tensorflow as tf
from read_data import get_minibatch()
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
feed_dict = {"x" : get_minibatch()}
sess.run(output, feed_dict=feed_dict)
The first four lines after the import statement describe the computational
graph that is built by the session when it is finally instantiated. The graph (sans
variable initialization operations) is depicted in Figure 3-2. We then initialize
the variables as required by using the session variable to run the initialization
operation in sess.run(init_op). Finally, we can run the subgraph by calling
sess.run again, but this time we pass in the tensors (or list of tensors) we want
to compute along with a feed_dict that fills the placeholders with the
necessary input data.
Finally, the sess.run interface can also be used to train networks. We will
explore this in further detail when we use TensorFlow to train our first
machine learning model on MNIST. But how exactly does a single line of code
(sess.run) accomplish such a wide variety of functions? The answer lies in the
powerful expressivity of the underlying computational graph. All of these
functionalities are represented as TensorFlow operations that can be passed as
arguments to sess.run. All sess.run needs to do is traverse down the
computational graph to identify all of the dependencies that compose the
relevant subgraph, ensure that all of the placeholder variables that belong to
the identified subgraph are filled using the feed_dict, and then traverse back
up the subgraph (executing all of the intermediate operations) to evaluate the
original arguments.
def my_network(input):
W_1 = tf.Variable(tf.random_uniform([784, 100], -1, 1),
name="W_1")
b_1 = tf.Variable(tf.zeros([100]), name="biases_1")
output_1 = tf.matmul(input, W_1) + b_1
# printing names
print "Printing names of weight parameters"
print W_1.name, W_2.name, W_3.name
print "Printing names of bias parameters"
print b_1.name, b_2.name, b_3.name
return output_3
This network setup consists of six variables describing three layers. As a result,
if we wanted to use this network multiple times, we’d prefer to encapsulate it
into a compact function like my_network, which we can call multiple
times. However, when we try to use this network on two different inputs, we
get something unexpected:
In [2]: my_network(i_1)
Printing names of weight parameters
W_1:0 W_2:0 W_3:0
Printing names of bias parameters
biases_1:0 biases_2:0 biases_3:0
Out[2]: <tensorflow.python.framework.ops.Tensor ...>
In [2]: my_network(i_2)
Printing names of weight parameters
W_1_1:0 W_2_1:0 W_3_1:0
Printing names of bias parameters
biases_1_1:0 biases_2_1:0 biases_3_1:0
Out[2]: <tensorflow.python.framework.ops.Tensor ...>
If we observe closely, our second call to my_network doesn’t use the same
variables as the first call (in fact, the names are different!). Instead, we’ve
created a second set of variables! In many cases, we don’t want to create a
copy, but rather reuse the model and its variables. It turns out, that in this
case, we shouldn’t be using tf.Variable. Instead, we should be using a more
advanced naming scheme that takes advantage of TensorFlow’s variable
scoping.
def my_network(input):
with tf.variable_scope("layer_1"):
output_1 = layer(input, [784, 100], [100])
with tf.variable_scope("layer_2"):
output_2 = layer(output_1, [100, 50], [50])
with tf.variable_scope("layer_3"):
output_3 = layer(output_2, [50, 10], [10])
return output_3
Now let’s try to call my_network twice, just like we did in the preceding code
block:
In [2]: my_network(i_1)
Out[2]: <tensorflow.python.framework.ops.Tensor ...>
In [2]: my_network(i_2)
ValueError: Over-sharing: Variable layer_1/W already exists...
Unlike tf.Variable, the tf.get_variable command checks that a variable of the
given name hasn’t already been instantiated. By default, sharing is not allowed
(just to be safe!), but if we want to enable sharing within a variable scope, we
can say so explicitly:
"/cpu:0"
The CPU of our machine.
"/gpu:0"
The first GPU of our machine, if it has one.
"/gpu:1"
The second GPU of our machine, if it has one.
When a TensorFlow operation has both CPU and GPU kernels, and GPU use is
enabled, TensorFlow will automatically opt to use the GPU implementation. To
inspect which devices are used by the computational graph, we can initialize
our TensorFlow session with the log_device_placement set to True:
sess = tf.Session(config=tf.ConfigProto(
log_device_placement=True))
If we desire to use a specific device, we may do so by using with tf.device16 to
select the appropriate device. If the chosen device is not available, however,
an error will be thrown. If we would like TensorFlow to find another available
device if the chosen device does not exist, we can pass
the allow_soft_placement flag to the session variable as follows:17
# Parameters
learning_rate = 0.01
training_epochs = 1000
batch_size = 100
display_step = 1
with tf.Graph().as_default():
output = inference(x)
cost = loss(output, y)
eval_op = evaluate(output, y)
summary_op = tf.merge_all_summaries()
saver = tf.train.Saver()
sess = tf.Session()
summary_writer = tf.train.SummaryWriter("logistic_logs/",
graph_def=sess.graph_def)
init_op = tf.initialize_all_variables()
sess.run(init_op)
# Training cycle
for epoch in range(training_epochs):
avg_cost = 0.
total_batch = int(mnist.train.num_examples/batch_size)
# Loop over all batches
for i in range(total_batch):
mbatch_x, mbatch_y = mnist.train.next_batch(
batch_size)
# Fit training using batch data
feed_dict = {x : mbatch_x, y : mbatch_y}
sess.run(train_op, feed_dict=feed_dict)
# Compute average loss
minibatch_cost = sess.run(cost,
feed_dict=feed_dict)
avg_cost += minibatch_cost/total_batch
# Display logs per epoch step
if epoch % display_step == 0:
val_feed_dict = {
x : mnist.validation.images,
y : mnist.validation.labels
}
accuracy = sess.run(eval_op,
feed_dict=val_feed_dict)
summary_str = sess.run(summary_op,
feed_dict=feed_dict)
summary_writer.add_summary(summary_str,
sess.run(global_step))
saver.save(sess, "logistic_logs/model-checkpoint",
global_step=global_step)
test_feed_dict = {
x : mnist.test.images,
y : mnist.test.labels
}
tensorboard --logdir=<absolute_path_to_log_dir>
The logdir flag should be set to the directory where our
tf.train.SummaryWriter was configured to serialize our summary statistics. Be
sure to pass an absolute path (and not a relative path), because otherwise
TensorBoard may not be able to find out logs. If we successfully launch
TensorBoard, it should be serving our data at http://localhost:6006/, which we
can navigate to in our browser.
Building a Multilayer Model for MNIST
in TensorFlow
Using a logistic regression model, we were able to achieve an 8.1% error rate
on the MNIST dataset. This may seem impressive, but it isn’t particularly
useful for high-value practical applications. For example, if we were using our
system to read personal checks written out for 4-digit amounts ($1,000 to
$9,999), we would make errors on nearly 30% of checks! To create an MNIST
digit reader that’s more practical, let’s try to build a feed-forward network to
tackle the MNIST challenge.
We construct a feed-forward model with two hidden layers, each with 256
ReLU neurons, as shown in Figure 3-7.
The MNIST dataset has 60,000 training images and 10,000 testing images.
Play Video
In the perceptual model and linear regression model, each of the data points was
defined by a simple x and y coordinate. This means that the input layer needs two
nodes to input single data points.
In the MNIST dataset, a single data point comes in the form of an image. These images
included in the MNIST dataset are typical of 28*28 pixels such as 28 pixels crossing
the horizontal axis and 28 pixels crossing the vertical axis. This means that a single
image from the MNIST database has a total of 784 pixels that must be analyzed. The
input layer of our neural network has 784 nodes to explain one of these images.
Here, we will see how to create a function that is a model for recognizing handwritten
digits by looking at each pixel in the image. Then using TensorFlow to train the model
to predict the image by making it look at thousands of examples which are already
labeled. We will then check the model's accuracy with a test dataset.
Now before we start, it is important to note that every data point has two parts: an
image (x) and a corresponding label (y) describing the actual image and each image is
a 28x28 array, i.e., 784 numbers. The label of the image is a number between 0 and 9
corresponding to the TensorFlow MNIST image. To download and use MNIST dataset,
use the following commands: