Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
12 views

NNDL Notes Unit 3

Deep learning, a subset of machine learning and artificial intelligence, utilizes algorithms inspired by the brain's structure to process both structured and unstructured data. Its applications span various fields including virtual assistants, healthcare, entertainment, and fraud detection, enhancing user experiences and automating complex tasks. The document outlines numerous specific uses of deep learning, such as image colorization, natural language processing, and self-driving cars, highlighting its transformative impact across industries.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

NNDL Notes Unit 3

Deep learning, a subset of machine learning and artificial intelligence, utilizes algorithms inspired by the brain's structure to process both structured and unstructured data. Its applications span various fields including virtual assistants, healthcare, entertainment, and fraud detection, enhancing user experiences and automating complex tasks. The document outlines numerous specific uses of deep learning, such as image colorization, natural language processing, and self-driving cars, highlighting its transformative impact across industries.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Unit 3

Deep learning is a subfield of machine learning that deals with algorithms inspired by the structure
and function of the brain. Deep learning is a subset of machine learning, which is a part of artificial
intelligence (AI).

Artificial intelligence is the ability of a machine to imitate intelligent human behavior. Machine
learning allows a system to learn and improve from experience automatically. Deep learning is an
application of machine learning that uses complex algorithms and deep neural nets to train a model

Importance of Deep Learning

 Machine learning works only with sets of structured and semi-structured data, while deep learning
works with both structured and unstructured data
 Deep learning algorithms can perform complex operations efficiently, while machine learning
algorithms cannot
 Machine learning algorithms use labeled sample data to extract patterns, while deep learning
accepts large volumes of data as input and analyzes the input data to extract features out of an
object
 The performance of machine learning algorithms decreases as the number of data increases; so to
maintain the performance of the model, we need a deep learning
1. Virtual Assistants

Virtual Assistants are cloud-based applications that understand natural language voice commands
and complete tasks for the user. Amazon Alexa, Cortana, Siri, and Google Assistant are typical
examples of virtual assistants. They need internet-connected devices to work with their full
capabilities. Each time a command is fed to the assistant, they tend to provide a better user
experience based on past experiences using Deep Learning algorithms.

2. Chatbots

Chatbots can solve customer problems in seconds. A chatbot is an AI application to chat online via
text or text-to-speech. It is capable of communicating and performing actions similar to a human.
Chatbots are used a lot in customer interaction, marketing on social network sites, and instant
messaging the client. It delivers automated responses to user inputs. It uses machine learning and
deep learning algorithms to generate different types of reactions.

The next important deep learning application is related to Healthcare.

3. Healthcare

Deep Learning has found its application in the Healthcare sector. Computer-aided disease detection
and computer-aided diagnosis have been possible using Deep Learning. It is widely used for medical
research, drug discovery, and diagnosis of life-threatening diseases such as cancer and diabetic
retinopathy through the process of medical imaging.

4. Entertainment
Companies such as Netflix, Amazon, YouTube, and Spotify give relevant movies, songs, and video
recommendations to enhance their customer experience. This is all thanks to Deep Learning. Based
on a person’s browsing history, interest, and behavior, online streaming companies give suggestions
to help them make product and service choices. Deep learning techniques are also used to add sound
to silent movies and generate subtitles automatically.

Next, we have News Aggregation as our next important deep learning application.

5. News Aggregation and Fake News Detection

Deep Learning allows you to customize news depending on the readers’ persona. You can aggregate
and filter out news information as per social, geographical, and economic parameters and the
individual preferences of a reader. Neural Networks help develop classifiers that can detect fake and
biased news and remove it from your feed. They also warn you of possible privacy breaches.

6. Composing Music

A machine can learn the notes, structures, and patterns of music and start producing music
independently. Deep Learning-based generative models such as WaveNet can be used to develop raw
audio. Long Short Term Memory Network helps to generate music automatically. Music21 Python
toolkit is used for computer-aided musicology. It allows us to train a system to develop music by
teaching music theory fundamentals, generating music samples, and studying music.

Next in the list of deep learning applications, we have Image Coloring.

7. Image Coloring
Image colorization has seen significant advancements using Deep Learning. Image colorization is
taking an input of a grayscale image and then producing an output of a colorized image. ChromaGAN
is an example of a picture colorization model. A generative network is framed in an adversarial model
that learns to colorize by incorporating a perceptual and semantic understanding of both class
distributions and color.

8. Robotics

Deep Learning is heavily used for building robots to perform human-like tasks. Robots powered by
Deep Learning use real-time updates to sense obstacles in their path and pre-plan their journey
instantly. It can be used to carry goods in hospitals, factories, warehouses, inventory management,
manufacturing products, etc.

9. Image Captioning

Image Captioning is the method of generating a textual description of an image. It uses computer
vision to understand the image's content and a language model to turn the understanding of the
image into words in the right order. A recurrent neural network such as an LSTM is used to turn the
labels into a coherent sentence. Microsoft has built its caption bot where you can upload an image or
the URL of any image, and it will display the textual description of the image. Another such
application that suggests a perfect caption and best hashtags for a picture is Caption AI.
10. Advertising

In Advertising, Deep Learning allows optimizing a user's experience. Deep Learning helps publishers
and advertisers to increase the significance of the ads and boosts the advertising campaigns.

11. Self Driving Cars

Deep Learning is the driving force behind the notion of self-driving automobiles that are autonomous.
Deep Learning technologies are actually "learning machines" that learn how to act and respond using
millions of data sets and training. To diversify its business infrastructure, Uber Artificial
Intelligence laboratories are powering additional autonomous cars and developing self-driving cars
for on-demand food delivery. Amazon, on the other hand, has delivered their merchandise using
drones in select areas of the globe.

The perplexing problem about self-driving vehicles that the bulk of its designers are addressing is
subjecting self-driving cars to a variety of scenarios to assure safe driving. They have operational
sensors for calculating adjacent objects. Furthermore, they manoeuvre through traffic using data
from its camera, sensors, geo-mapping, and sophisticated models. Tesla is one popular example.

12. Natural Language Processing

Another important field where Deep Learning is showing promising results is NLP, or Natural
Language Processing. It is the procedure for allowing robots to study and comprehend human
language.

However, keep in mind that human language is excruciatingly difficult for robots to understand.
Machines are discouraged from correctly comprehending or creating human language not only
because of the alphabet and words, but also because of context, accents, handwriting, and other
factors.
Many of the challenges associated with comprehending human language are being addressed by
Deep Learning-based NLP by teaching computers (Autoencoders and Distributed Representation) to
provide suitable responses to linguistic inputs.

13. Visual Recognition

Just assume you're going through your old memories or photographs. You may choose to print some
of these. In the lack of metadata, the only method to achieve this was through physical labour. The
most you could do was order them by date, but downloaded photographs occasionally lack that
metadata. Deep Learning, on the other hand, has made the job easier. Images may be sorted using it
based on places recognised in pictures, faces, a mix of individuals, events, dates, and so on. To detect
aspects when searching for a certain photo in a library, state-of-the-art visual recognition algorithms
with various levels from basic to advanced are required.

14. Fraud Detection

Another attractive application for deep learning is fraud protection and detection; major companies
in the payment system sector are already experimenting with it. PayPal, for example, uses predictive
analytics technology to detect and prevent fraudulent activity. The business claimed that examining
sequences of user behavior using neural networks' long short-term memory architecture increased
anomaly identification by up to 10%. Sustainable fraud detection techniques are essential for every
fintech firm, banking app, or insurance platform, as well as any organization that gathers and uses
sensitive data. Deep learning has the ability to make fraud more predictable and hence avoidable.

15. Personalisations

Every platform is now attempting to leverage chatbots to create tailored experiences with a human
touch for its users. Deep Learning is assisting e-commerce behemoths such as Amazon, E-Bay, and
Alibaba in providing smooth tailored experiences such as product suggestions, customised packaging
and discounts, and spotting huge income potential during the holiday season. Even in newer markets,
reconnaissance is accomplished by providing goods, offers, or plans that are more likely to appeal to
human psychology and contribute to growth in micro markets. Online self-service solutions are on
the increase, and dependable procedures are bringing services to the internet that were previously
only physically available.
Early diagnosis of developmental impairments in children is critical since early intervention improves
children's prognoses. Meanwhile, a growing body of research suggests a link between developmental
impairment and motor competence, therefore motor skill is taken into account in the early diagnosis
of developmental disability. However, because of the lack of professionals and time restrictions,
testing motor skills in the diagnosis of the developmental problem is typically done through informal
questionnaires or surveys to parents. This is progressively becoming achievable with deep learning
technologies. Researchers at MIT's Computer Science and Artificial Intelligence Laboratory and the
Institute of Health Professions at Massachusetts General Hospital have created a computer system
that can detect language and speech impairments even before kindergarten.

17. Colourisation of Black and White images

The technique of taking grayscale photos in the form of input and creating colourized images for
output that represent the semantic colours and tones of the input is known as image colourization.
Given the intricacy of the work, this technique was traditionally done by hand using human labour.
However, using today's Deep Learning Technology, it is now applied to objects and their context
inside the shot - in order to colour the image, in the same way that a human operator would. In order
to reproduce the picture with the addition of color, high-quality convolutional neural networks are
utilized in supervised layers.

18. Adding Sounds to Silent Movies

In order to make a picture feel more genuine, sound effects that were not captured during
production are frequently added. This is referred to as "Foley." Deep learning was used by
researchers at the University of Texas to automate this procedure. They trained a neural network on
12 well-known film incidents in which filmmakers commonly used Foley effects.

19. Automatic Machine Translation

Deep learning has changed several disciplines in recent years. In response to these advancements,
the field of Machine Translation has switched to the use of deep-learning neural-based methods,
which have supplanted older approaches such as rule-based systems or statistical phrase-based
methods. Neural MT (NMT) models can now access the whole information accessible anywhere in the
source phrase and automatically learn which piece is important at which step of synthesising the
output text, thanks to massive quantities of training data and unparalleled processing power. The
elimination of previous independence assumptions is the primary cause for the remarkable
improvement in translation quality. This resulted in neural translation closing the quality gap between
human and neural translation.
20. Automatic Handwriting Generation

This Deep Learning application includes the creation of a new set of handwriting for a given corpus of
a word or phrase. The handwriting is effectively presented as a series of coordinates utilised by a pen
to make the samples. The link between pen movement and letter formation is discovered, and
additional instances are developed.

21. Automatic Game Playing

A corpus of text is learned here, and fresh text has created word for word or character for character.
Using deep learning algorithms, it is possible to learn how to spell, punctuate, and even identify the
style of the text in corpus phrases. Large recurrent neural networks are typically employed to learn
text production from objects in sequences of input strings. However, LSTM recurrent neural networks
have lately shown remarkable success in this challenge by employing a character-based model that
creates one character at a time.

22. Language Translations

Machine translation is receiving a lot of attention from technology businesses. This investment, along
with recent advances in deep learning, has resulted in significant increases in translation quality.
According to Google, transitioning to deep learning resulted in a 60% boost in translation accuracy
over the prior phrase-based strategy employed in Google Translate. Google and Microsoft can now
translate over 100 different languages with near-human accuracy in several of them.

23. Pixel Restoration

It was impossible to zoom into movies beyond their actual resolution until Deep Learning came along.
Researchers at Google Brain created a Deep Learning network in 2017 to take very low-quality photos
of faces and guess the person's face from them. Known as Pixel Recursive Super Resolution, this
approach uses pixels to achieve super resolution. It dramatically improves photo resolution,
highlighting salient characteristics just enough for personality recognition.

24. Demographic and Election Predictions

Gebru et al used 50 million Google Street View pictures to see what a Deep Learning network might
well as their make, model, body style, and year. The explorations did not end there, inspired by the
success story of these Deep Learning capabilities. The algorithm was shown to be capable of
estimating the demographics of each location based just on the automobile makeup.

25. Deep Dreaming

DeepDream is an experiment that visualises neural network taught patterns. DeepDream, like a
toddler watching clouds and attempting to decipher random forms, over-interprets and intensifies
the patterns it finds in a picture.

It accomplishes this by sending an image across the network and then calculating the gradient of the
picture in relation to the activations of a certain layer. The image is then altered to amplify these
activations, improving the patterns perceived by the network and producing a dream-like visual. This
method was named "Inceptionism" (a reference to InceptionNet, and the movie Inception).

Today, artificial intelligence (AI) is a thriving field with many practical applications
and active research topics. We look to intelligent software to automate routine labor,
understand speech or images, make diagnoses in medicine and support basic scientific
research.
In the early days of artificial intelligence, the field rapidly tackled and solved
problems that are intellectually difficult for human beings but relatively straight-
forward for computers—problems that can be described by a list of formal, math-
ematical rules. The true challenge to artificial intelligence proved to be solving the
tasks that are easy for people to perform but hard for people to describe formally—
problems that we solve intuitively, that feel automatic, like recognizing spoken words or
faces in images.
This solution is to allow computers to learn from experience and understand the
world in terms of a hierarchy of concepts, with each concept defined in terms of its
relation to simpler concepts. By gathering knowledge from experience, this approach
avoids the need for human operators to formally specify all of the knowledge that the
computer needs. The hierarchy of concepts allows the computer to learn complicated
concepts by building them out of simpler ones. If we draw a graph showing how
these concepts are built on top of each other, the graph is deep, with many layers. For
this reason, we call this approach to AI deep learning.
Many of the early successes of AI took place in relatively sterile and formal
environments and did not require computers to have much knowledge about the
world. For example, IBM’s Deep Blue chess-playing system defeated world champion
Garry Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world, containing
only sixty-four locations and thirty-two pieces that can move in only rigidly
circumscribed ways. Devising a successful chess strategy is a tremendous
accomplishment, but the challenge is not due to the difficulty of describing the set of
chess pieces and allowable moves to the computer. Chess can be completely described
by a very brief list of completely formal rules, easily provided ahead of time by the
programmer.
Ironically, abstract and formal tasks that are among the most difficult mental
undertakings for a human being are among the easiest for a computer. Computers
have long been able to defeat even the best human chess player, but are only recently
matching some of the abilities of average human beings to recognize objects or speech. A
person’s everyday life requires an immense amount of knowledge about the world.
Much of this knowledge is subjective and intuitive, and therefore difficult to articulate
in a formal way. Computers need to capture this same knowledge in order to behave
in an intelligent way. One of the key challenges in artificial intelligence is how to get
this informal knowledge into a computer.
Several artificial intelligence projects have sought to hard-code knowledge about the
world in formal languages. A computer can reason about statements in these formal
languages automatically using logical inference rules. This is known as the knowledge
base approach to artificial intelligence.
The difficulties faced by systems relying on hard-coded knowledge suggest that AI
systems need the ability to acquire their own knowledge, by extracting patterns from
raw data. This capability is known as machine learning. The introduction
of machine learning allowed computers to tackle problems involving knowledge of the real
world and make decisions that appear subjective. A simple machine learning algorithm called
logistic regression

The performance of these simple machine learning algorithms depends heavily on the
representation of the data they are given Instead, the doctor tells the system several pieces
of relevant information, such as the presence or absence of a uterine scar. Each piece of
information included in the representation of the patient is known as a feature.Logistic
regression learns how each of these features of the patient correlates with various
outcomes. However, it cannot influence the way that the features are defined in any
formalized report, it would not be able to make usefulpredictions. Individual pixels in an MRI
scan have negligible correlation with any complications that might occur during delivery.

This dependence on representations is a general phenomenon that appears throughout


computer science and even daily life. In computer science, opera- tions such as searching a
collection of data can proceed exponentially faster if the collection is structured and
indexed intelligently that the choice of representation has anenormous effect on the
performance of machine learning algorithms

One solution to this problem is to use machine learning to discover not only the
mapping from representation to output but also the representation itself.This
approach is known as representation learning. Learned representations often result in
much better performance than can be obtained with hand-designedrepresentations.
They also allow AI systems to rapidly adapt to new tasks, with minimal human
intervention. A representation learning algorithm can discover a good set of features
for a simple task in minutes, or a complex task in hours tomonths. Manually designing
features for a complex task requires a great deal of human time and effort; it can take
decades for an entire community of researchers.

The quintessential example of a representation learning algorithm is the au-


toencoder. An autoencoder is the combination of an encoder function that converts the
input data into a different representation, and a decoder function that converts the
new representation back into the original format. Autoencoders are trained to
preserve as much information as possible when an input is run through the encoder and
then the decoder, but are also trained to make the new representation have various
nice properties. Different kinds of autoencoders aim to achieve different kinds of
properties.

When analyzing a speech recording, the factors of variation include the speaker’s age,
their sex, their accent and the words that they are speaking. When analyzing an image
of a car, the factors of variation include the position of the car, its color, and the angle
and brightness of the sun.
A major source of difficulty in many real-world artificial intelligence applications is that many of
the factors of variation influence every single piece of data we are able to observe. The
individual pixels in an image of a red car might be very close to black at night

Deep learning solves this central problem in representation learning by introduc- ing
learning allows the computer to build complex concepts out of simpler con- cepts. Fig.
1.2 shows how a deep learning system can represent the concept of an image of a
person by combining simpler concepts, such as corners and contours, which are in turn
defined in terms of edges.
The quintessential example of a deep learning model is the feedforward deep
network or multilayer perceptron (MLP). A multilayer perceptron is just a mathe-
matical function mapping some set of input values to output values. The function is
formed by composing many simpler functions. We can think of each application of a
different mathematical function as providing a new representation of the input.
The idea of learning the right representation for the data provides one perspec- tive
on deep learning. Another perspective on deep learning is that depth allows the computer
to learn a multi-step computer program. Each layer of the representation can be thought
of as the state of the computer’s memory after executing another set of instructions in
parallel. Networks with greater depth can execute more instructions in sequence.
Sequential instructions offer great power because later instructions can refer back to
the results of earlier instructions. According to this
Figure 1.2: Illustration of a deep learning model. It is difficult for a computer to understand
the meaning of raw sensory input data, such as this image represented as a
collection of pixel values. The function mapping from a set of pixels to an object identity is
very complicated. Learning or evaluating this mapping seems insurmountable if tackled
directly. Deep learning resolves this difficulty by breaking the desired complicated mapping
into a series of nested simple mappings, each described by a different layer of the model.
Theinput is presented at the visible layer, so named because it contains the variables that
we are able to observe. Then a series of hidden layers extracts increasingly abstract
features from the image. These layers are called “hidden” because their values are not
explaining the relationships in the observed data. The images here are visualizations
of the kind of feature represented by each hidden unit. Given the pixels, the first layer can
easily identify edges, by comparing the brightness of neighboring pixels. Given the first
hidden layer’s description of the edges, the second hidden layer can easily search for
corners and extended contours, which are recognizable as collections of edges. Given the
second hidden layer’s description of the image in terms of corners and contours, the
third hidden layer can detect entire parts of specific objects, by finding specific collections
of contours and corners. Finally, this description of the image in terms of the object parts
it contains can be used to recognize the objects present in the image.

Figure 1.3: Illustration of computational graphs mapping an input to an output where


each node performs an operation. Depth is the length of the longest path from input to
output but depends on the definition of what constitutes a possible computational
step. The computation depicted in these graphs is the output of a logistic regression
model, σ(wT x), where σ is the logistic sigmoid function. If we use addition, multiplication
and logistic sigmoids as the elements of our computer language, then this model has
depth three. If we view logistic regression as an element itself, then this model has
depth one.

view of deep learning, not all of the information in a layer’s activations necessarily encodes
factors of variation that explain the input. The representation also stores state
information could be analogous to a counter or pointer in a traditional computer
program. It has nothing to do with the content of the input specifically, but it helps the
model to organize its processing.
There are two main ways of measuring the depth of a model. The first view is
based on the number of sequential instructions that must be executed to evaluate the
architecture. We can think of this as the length of the longest path through a flow
chart that describes how to compute each of the model’s outputs given its inputs.
Just as two equivalent computer programs will have different lengths depending on
which language the program is written in, the same function may be drawn as a
flowchart with different depths depending on which functions we allow to be used as
individual steps in the flowchart. Fig. 1.3 illustrates how this choice of language can
give two different measurements for the same architecture.
Another approach, used by deep probabilistic models, regards the depth of a
model as being not the depth of the computational graph but the depth of the
graph describing how concepts are related to each other. In this case, the depth of
the flowchart of the computations needed to compute the representation of

each concept may be much deeper than the graph of the concepts themselves.
This is because the system’s understanding of the simpler concepts can be refined given
information about the more complex concepts. For example, an AI system observing an
image of a face with one eye in shadow may initially only see one eye.After detecting that
a face is present, it can then infer that a second eye is probably present as well. In this
case, the graph of concepts only includes two layers—a layer for eyes and a layer for
faces—but the graph of computations includes 2n layers if we refine our estimate of
each concept given the other n times.
Because it is not always clear which of these two views—the depth of the
computational graph, or the depth of the probabilistic modeling graph—is most
relevant, and because different people choose different sets of smallest elements
from which to construct their graphs, there is no single correct value for the
depth of an architecture, just as there is no single correct value for the length of a
computer program. Nor is there a consensus about how much depth a model requires
to qualify as “deep.” However, deep learning can safely be regarded as the study of
models that either involve a greater amount of composition of learned functions or
learned concepts than traditional machine learning does.
real world and make decisions that appear subjective. A simple machine learning
algorithm called logistic regression can determine whether to recommend cesarean
delivery (Mor-Yosef et al., 1990). A simple machine learning algorithm called naive
Bayes can separate legitimate e-mail from spam e-mail.
The performance of these simple machine learning algorithms depends heavily on
the representation of the data they are given. For example, when logisticregression is
used to recommend cesarean delivery, the AI system does not examine the patient
directly. Instead, the doctor tells the system several pieces of relevant information,
such as the presence or absence of a uterine scar. Each piece of information
included in the representation of the patient is known as a feature.Logistic
regression learns how each of these features of the patient correlates with various
outcomes. However, it cannot influence the way that the features are defined in
any way. If logistic regression was given an MRI scan of the patient,rather than the
doctor’s formalized report, it would not be able to make usefulpredictions. Individual
pixels in an MRI scan have negligible correlation with any complications that might
occur during delivery.
This dependence on representations is a general phenomenon that appears
throughout computer science and even daily life. In computer science, opera- tions
such as searching a collection of data can proceed exponentially faster if the
collection is structured and indexed intelligently. People can easily perform arithmetic
on Arabic numerals, but find arithmetic on Roman numerals much more time-
consuming. It is not surprising that the choice of representation has anenormous effect
on the performance of machine learning algorithms. For a simple visual example, see
Fig. 1.1.
Many artificial intelligence tasks can be solved by designing the right set of
features to extract for that task, then providing these features to a simple machine
learning algorithm. For example, a useful feature for speaker identification from
sound is an estimate of the size of speaker’s vocal tract. It therefore gives a strong clue
as to whether the speaker is a man, woman, or child.
However, for many tasks, it is difficult to know what features should be extracted. For
example, suppose that we would like to write a program to detect cars in photographs.
We know that cars have wheels, so we might like to use the presence of a wheel as a
feature. Unfortunately, it is difficult to describe exactly what a wheel looks like in terms
of pixel values. A wheel has a simple geometric shape but its image may be complicated
by shadows falling on the wheel, the sun glaring off the metal parts of the wheel, the
fender of the car or an object in the foreground obscuring part of the wheel, and so
on.
discover a good set of features for a simple task in minutes, or a complex task in hours to
months. Manually designing features for a complex task requires a great deal of human time
and effort; it can take decades for an entire community of researchers.

Historical Trends in Deep Learning

It is easiest to understand deep learning with some historical context. Rather thanproviding
a detailed history of deep learning, we identify a few key trends:

• Deep learning has had a long and rich history, but has gone by many names
reflecting different philosophical viewpoints, and has waxed and waned in
popularity.

• Deep learning has become more useful as the amount of available training data
has increased.

• Deep learning models have grown in size over time as computer hardware and
software infrastructure for deep learning has improved.

Deep learning has solved increasingly complicated applications with increasing accuracy over time

The Many Names and Changing Fortunes of Neural Net-works

1 Deep learning dates back to the 1940s. Deep learning only appears to be
new, because it was relatively unpopular for several years preceding its
current popularity, and because it has gone through many different names,
and has only recently become called “deep learning.” The field has been
rebranded many times, reflecting the influence of different researchers and
different perspectives.
2 However, some basic context is useful for understanding deep learning. Broadly
speaking, there have been three waves of development of deep learning: deep
learn- ing known as cybernetics in the 1940s–1960s, deep learning known as
connectionism in the 1980s–1990s, and the current resurgence under the name
deep learning beginning in 2006. This is quantitatively illustrated in Fig. 1.7.
3 Some of the earliest learning algorithms we recognize today were intended
to be computational models of biological learning, i.e. models of how learning
happens or could happen in the brain. As a result, one of the names that
sometimes been used to understand brain function (Hinton and Shallice,
1991), they are generally not designed to be realistic models of biological
function. The neural perspective on deep learning is motivated by two main
ideas. One idea is that the brain provides a proof by example that intelligent
behavior is possible, and a conceptually straightforward path to building
intelligence is to reverse engineer the computational principles behind the
brain and duplicate its functionality. Another perspective is that it would be
deeply interesting to understand the brain and the principles that underlie
human intelligence, so machine learning models that shed light on these basic
scientific questions are useful apart from their ability to solve engineering
applications.
5 The modern term “deep learning” goes beyond the neuroscientific
perspective on the current breed of machine learning models. It appeals to a
more general principle of learning multiple levels of composition, which can be
applied in machine learning frameworks that are not necessarily neurally
inspired.
Frequency of Word or Phrase

Figure 1.7: The figure shows two of the three historical waves of artificial neural nets
research, as measured by the frequency of the phrases “cybernetics” and “connectionism”
or “neural networks” according to Google Books (the third wave is too recent to appear).
The first wave started with cybernetics in the 1940s–1960s, with the development of
theories of biological learning (McCulloch and Pitts, 1943; Hebb, 1949) and
implementations of the first models such as the perceptron (Rosenblatt, 1958) allowing the
training of a single neuron. The second wave started with the connectionist approach of the
1980–1995 period, with back-propagation (Rumelhart et al., 1986a) to train a neural
The earliest predecessors of modern deep learning were simple linear models
motivated from a neuroscientific perspective. These models were designed to take a
set of n input values x1, . . . , xn and associate them with an output y. These models
would learn a set of weights w1, . . . , wn and compute their output f(x, w) = x1w1 + · · · +
xnwn . This first wave of neural networks research was known as cybernetics, as
illustrated in Fig. 1.7.
The McCulloch-Pitts Neuron (McCulloch and Pitts, 1943) was an early modelof brain
function. This linear model could recognize two different categories of inputs by testing
whether f (x, w) is positive or negative. Of course, for the model to correspond to the
desired definition of the categories, the weights needed to be set correctly. These
weights could be set by the human operator. In the 1950s, the perceptron
(Rosenblatt, 1958, 1962) became the first model that could learn the weights defining
the categories given examples of inputs from each category. The adaptive linear
element (ADALINE), which dates from about the same time, simply returned the value
of f (x) itself to predict a real number (Widrow and Hoff, 1960), and could also learn to
predict these numbers from data.

It is worth noting that the effort to understand how the brain works on an
algorithmic level is alive and well. This endeavor is primarily known as “computational
neuroscience” and is a separate field of study from deep learning. It is common for
researchers to move back and forth between both fields. The field of deep learning is
primarily concerned with how to build computer systems that are able to successfully
solve tasks requiring intelligence, while the field of computational neuroscience is
primarily concerned with building more accurate models of how the brain actually
works.
In the 1980s, the second wave of neural network research emerged in great part via a
movement called connectionism or parallel distributed processing (Rumelhart

et al., 1986c; McClelland et al., 1995). Connectionism arose in the context of


symbolic models were difficult to explain in terms of how the brain could actually
implement them using neurons. The connectionists began to study models of cognition
that could actually be grounded in neural implementations (Touretzky and Minton, 1985),
reviving many ideas dating back to the work of psychologist Donald Hebb in the 1940s
(Hebb, 1949).
The central idea in connectionism is that a large number of simple computational units
can achieve intelligent behavior when networked together. This insight applies equally
to neurons in biological nervous systems and to hidden units in computational
models.
Several key concepts arose during the connectionism movement of the 1980s that
remain central to today’s deep learning.
One of these concepts is that of distributed representation (Hinton et al., 1986). This is
the idea that each input to a system should be represented by many features, and each
feature should be involved in the representation of many possible inputs.
For example, suppose we have a vision system that can recognize cars, trucks, and
birds and these objects can each be red, green, or blue. One way of representing these
inputs would be to have a separate neuron or hidden unit that activates for each of the
nine possible combinations: red truck, red car, red bird, green truck, and so on. This
requires nine different neurons, and each neuron must independently learn the
concept of color and object identity. One way to improve on this situation is to use a
distributed representation, with three neurons describing the color and three neurons
describing the object identity. This requires only six neurons total instead of nine, and
the neuron describing redness is able to learn about redness from images of cars,
trucks and birds, not only from images of one specific category of objects.
though the focus of deep learning research has changed dramatically within the time of
this wave. The third wave began with a focus on new unsupervised learning techniques
and the ability of deep models to generalize well from small datasets, but today there is
more interest in much older supervised learning algorithms and the ability of deep
models to leverage large labeled datasets.
Deep Learning have been three waves of development: The first wave started with cybernetics in the
1940s-1960s, with the development of theories of biological learning and implementations of the first
models such as the perceptron allowing the training of a single neuron. The second wave started with
the connectionist approach of the 1980-1995 period, with back-propagation to train a neural network
with one or two hidden layers. The current and third wave, deep learning, started around 2006.

Increasing Dataset Sizes

One may wonder why deep learning has only recently become recognized as a crucial
technology though the first experiments with artificial neural networks were conducted in
the 1950s. Deep learning has been successfully used in commercial applications since
the 1990s, but was often regarded as being more of an art than a technology and
something that only an expert could use, until recently. It is true that some skill is
required to get good performance from a deep learning algorithm. Fortunately, the
amount of skill required reduces as the amount of training data increases. The learning
algorithms reaching human performance on complex tasks today are nearly identical to
the learning algorithms that struggled to solve toy problems in the 1980s, though the
models we train with these algorithms have undergone changes that simplify the
training of very deep architectures. The most important new development is that
today we can provide these algorithms with
The age of “Big Data” has made machine learning much easier because the key
burden of statistical estimation—generalizing well to new data after observing
only a small amount of data—has been considerably lightened. As of 2016, a
rough rule of thumb is that a supervised deep learning algorithm will generally
achieve acceptable performance with around 5,000 labeled examples per
category, and will match or exceed human performance when trained with a
dataset containing at least 10 million labeled examples. Working successfully
with datasets smaller than this is an important research area, focusing in
particular on how we can take advantage of large quantities of unlabeled
examples, with unsupervised or semi-supervised learning.

Figure 1.8: Dataset sizes have increased greatly over time. In the early 1900s, statisticians
studied datasets using hundreds or thousands of manually compiled measurements (Garson,
1900; Gosset, 1908; Anderson, 1935; Fisher, 1936). In the 1950s through 1980s, the pioneers
of biologically inspired machine learning often worked with small, synthetic datasets, such as
low-resolution bitmaps of letters, that were designed to incur low computational cost and
demonstrate that neural networks were able to learn specific kinds of functions (Widrow and
Hoff, 1960; Rumelhart et al., 1986b). In the 1980s and 1990s, machine learning became more
statistical in nature and began to leverage larger datasets containing tens of thousands of
examples such as the MNIST dataset (shown in Fig. 1.9) of scans of handwritten numbers
(LeCun et al., 1998b). In the first decade of the 2000s, more sophisticated datasets of this
same size, such as the CIFAR-10 dataset (Krizhevsky and Hinton, 2009) continued to be
changed what was possible with deep learning. These datasets included the public Street
View House Numbers dataset (Netzer et al., 2011), various versions of the ImageNet dataset
(Deng et al., 2009, 2010a; Russakovsky et al., 2014a), and the Sports-1M dataset (Karpathy et
al., 2014). At the top of the graph, we see that datasets of translated sentences, such as IBM’s
dataset constructed from the Canadian Hansard (Brown et al., 1990) and the WMT 2014
English to French dataset (Schwenk, 2014) are typically far ahead of other dataset sizes.

Feed Forward Network

its most basic form, a Feed-Forward Neural Network is a single layer perceptron. A sequence
of inputs enter the layer and are multiplied by the weights in this model. The weighted input
values are then summed together to form a total. If the sum of the values is more than a
predetermined threshold, which is normally set at zero, the output value is usually 1, and if the
sum is less than the threshold, the output value is usually -1.

The single-layer perceptron is a popular feed-forward neural network model that is frequently
used for classification. Single-layer perceptrons can also contain machine learning features.
using a property known as the delta rule, allowing the network to alter its weights
through training to create more accurate output values. This training and learning
procedure results in gradient descent. The technique of updating weights in multi-
layered perceptrons is virtually the same, however, the process is referred to as
back-propagation.

Feed forward neural networks are artificial neural networks in which nodes do not form loops.
This type of neural network is also known as a multi-layer neural network as all information is
only passed forward.

During data flow, input nodes receive data, which travel through hidden layers, and exit output
nodes. No links exist in the network that could get used to by sending information back from
the output node.

A feed forward neural network approximates functions in the following way:

 An algorithm calculates classifiers by using the formula y = f* (x).


 Input x is therefore assigned to category y.
 According to the feed forward model, y = f (x; θ). This value determines the closest
approximation of the function.
when the feed forward neural network gets simplified, it can appear as a single layer
perceptron.

This model multiplies inputs with weights as they enter the layer. Afterward, the weighted
input values get added together to get the sum. As long as the sum of the values rises above a
certain threshold, set at zero, the output value is usually 1, while if it falls below the threshold,
it is usually -1.

As a feed forward neural network model, the single-layer perceptron often gets used for
classification. Machine learning can also get integrated into single-layer perceptrons. Through
training, neural networks can adjust their weights based on a property called the delta rule,
which helps them compare their outputs with the intended values.

As a result of training and learning, gradient descent occurs. Similarly, multi-layered


perceptrons update their weights. But, this process gets known as back-propagation. If this is
the case, the network's hidden layers will get adjusted according to the output values produced
by the final layer.

Deep feedforward networks, also often called feedforward neural networks, or multi-
layer perceptrons (MLPs), are the quintessential deep learning models. The goal
of a feedforward network is to approximate some function f ∗ . For example, for
a classifier, y = f ∗(x) maps an input x to a category y. A feedforward network

defines a mapping y = f (x; θ) and learns the value of the parameters θ that result
in the best function approximation.

These models are called feedforward because information flows through the function
being evaluated from x, through the intermediate computations used to define f , and
finally to the output y. There are no feedback connections in which outputs of the model
are fed back into itself. When feedforward neural networks are extended to include
feedback connections, they are called recurrent neural networks

Layers of feed forward neural network

 input layer:

The neurons of this layer receive input and pass it on to the other layers of the network.
Feature or attribute numbers in the dataset must match the number of neurons in the
input layer.

 Output layer:
 Hidden layer:

Input and output layers get separated by hidden layers. Depending on the type of
model, there may be several hidden layers.

There are several neurons in hidden layers that transform the input before actually
transferring it to the next layer. This network gets constantly updated with weights in
order to make it easier to predict.

 Neuron weights:

Neurons get connected by a weight, which measures their strength or magnitude.


Similar to linear regression coefficients, input weights can also get compared.

Weight is normally between 0 and 1, with a value between 0 and 1.

 Neurons:

Artificial neurons get used in feed forward networks, which later get adapted from
biological neurons. A neural network consists of artificial neurons.

Neurons function in two ways: first, they create weighted input sums, and second, they
activate the sums to make them normal.

Activation functions can either be linear or nonlinear. Neurons have weights based on
their inputs. During the learning phase, the network studies these weights.

 Activation Function:

Neurons are responsible for making decisions in this area.

According to the activation function, the neurons determine whether to make a linear or
nonlinear decision. Since it passes through so many layers, it prevents the cascading
effect from increasing neuron outputs.

An activation function can be classified into three major categories: sigmoid, Tanh, and
Rectified Linear Unit (ReLu).

 Sigmoid(most Generally Used)

Input values between 0 and 1 get mapped to the output valueso better understand how
feedforward neural networks function, let’s solve a simple problem — predicting if it’s raining or not when given three
inputs.
 x2 - temperature
 x3 - month
Let’s assume the threshold value to be 20, and if the output is higher than 20 then it will be raining, otherwise it’s a sunny
day. Given a data tuple with inputs (x1, x2, x3) as (0, 12, 11), initial weights of the feedforward network (w1, w2, w3) as (0.1,
1, 1) and biases as (1, 0, 0).

Here’s how the neural network computes the data in three simple steps:

1. Multiplication of weights and inputs: The input is multiplied by the assigned weight values, which this case would be
the following:

(x1* w1) = (0 * 0.1) = 0

(x2* w2) = (1 * 12) = 12

(x3* w3) = (11 * 1) = 11

2. Adding the biases: In the next step, the product found in the previous step is added to their respective biases. The
modified inputs are then summed up to a single value.

(x1* w1) + b1 = 0 + 1

(x2* w2) + b2 = 12 + 0

(x3* w3) + b3 = 11 + 0

weighted_sum = (x1* w1) + b1 + (x2* w2) + b2 + (x3* w3) + b3 = 23

3. Activation: An activation function is the mapping of summed weighted input to the output of the neuron. It is called an
activation/transfer function because it governs the inception at which the neuron is activated and the strength of the output
signal.

4. Output signal: Finally, the weighted sum obtained is turned into an output signal by feeding the weighted sum into an
activation function (also called transfer function). Since the weighted sum in our example is greater than 20, the perceptron
predicts it to be a rainy day.
In simple terms, a loss function quantifies how “good” or “bad” a given model
is in classifying the input data. In most learning networks, the loss is
calculated as the difference between the actual output and the predicted
output.

Mathematically:

loss = y_{predicted} - y_{original}

The function that is used to compute this error is known as loss function J(.).
Different loss functions will return different errors for the same prediction,
having a considerable effect on the performance of the model.

Gradient Descent

Gradient descent is the most popular optimization technique for feedforward


neural networks. The term “gradient” refers to the quantity change of output
obtained from a neural network when the inputs change a little. Technically, it
measures the updated weights concerning the change in error. The gradient
can also be defined as the slope of a function. The higher the angle, the steeper
the slope and the faster a model can learn.

delta training rule, consider the task of training a threshold perception. That is, a linear unit
for which the output O is given by
Where,
D is the set of training examples,
td is the target output for training example d,
od is the output of the linear unit for training example d
E ( w→ ) is simply half the squared difference between the target output td and the
linear unit output od, summed over all training examples.

Gradient Descent Error Estimation in 3 dimensional plane


A gradient simply measures the change in all weights with regard to the change in error. You can
also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope
and the faster a model can learn. But if the slope is zero, the model stops learning.

Visualizing the Hypothesis Space

To understand the gradient descent algorithm, it is helpful to visualize the entire


hypothesis space of possible weight vectors and their associated E values as shown
in below figure.
Here the axes w0 and wl represent possible values for the two weights of a simple
linearunit. The w0, wl plane therefore represents the entire hypothesis space.
The vertical axis indicates the error E relative to some fixed set of training examples.
The arrow shows the negated gradient at one particular point, indicating the
direction inthe w0, wl plane producing steepest descent along the error surface.
The error surface shown in the figure thus summarizes the desirability of every
weight vector in the hypothesis space
Given the way in which we chose to define E, for linear units this error surface
mustalways be parabolic with a single global minimum.

Gradient descent search determines a weight vector that minimizes E by starting with an
arbitrary initial weight vector, then repeatedly modifying it in small steps.
At each step, the weight vector is altered in the direction that produces the steepest descent
along the error surface depicted in above figure. This process continues until the global
minimum error is reached.

Types of Gradient Descent


There are three popular types of gradient descent that mainly differ in the amount of data they
use:
BATCH GRADIENT DESCENT
Batch gradient descent, also called vanilla gradient descent, calculates the error for each example
within the training dataset, but only after all training examples have been evaluated does the
model get updated. This whole process is like a cycle and it's called a training epoch.
Some advantages of batch gradient descent are its computational efficient, it produces a stable
error gradient and a stable convergence. Some disadvantages are the stable error gradient can
sometimes result in a state of convergence that isn’t the best the model can achieve. It also
requires the entire training dataset be in memory and available to the algorithm.
STOCHASTIC GRADIENT DESCENT
By contrast, stochastic gradient descent (SGD) does this for each training example within the
Additionally, the frequency of those updates can result in noisy gradients, which may cause the
error rate to jump around instead of slowly decreasing.
MINI-BATCH GRADIENT DESCENT
Mini-batch gradient descent is the go-to method since it’s a combination of the concepts of SGD
and batch gradient descent. It simply splits the training dataset into small batches and performs
an update for each of those batches.
This creates a balance between the robustness of stochastic gradient descent and the efficiency of
batch gradient descent.

Issues in Gradient Descent Algorithm

 Can veer off in the wrong direction due to frequent updates.


 Frequent updates are computationally expensive in process due to using all resources for
processing one training sample at a time.
Backpropagation

The predicted value of the network is compared to the expected output, and an
error is calculated using a function. This error is then propagated back within
the whole network, one layer at a time, and the weights are updated according
to the value that they contributed to the error. This clever bit of math is called
a backpropagation algorithm. The process is repeated for all of the examples
in the training data. One round of updating the network for the entire training
dataset is called an epoch. A network may be trained for tens, hundreds or
many thousands of epochs.

How does back propagation work?

Let us take a look at how back propagation works. It has four layers: input layer, hidden layer,
hidden layer II and final output layer.

So, the main three layers are:

1. Input layer
2. Hidden layer
details needed to help summarizing this algorithm.

This image summarizes the functioning of the backpropagation approach.

1. Input layer receives x


2. Input is modeled using weights w
3. Each hidden layer calculates the output and data is ready at the output layer
4. Difference between actual output and desired output is known as the error
5. Go back to the hidden layers and adjust the weights so that this error is reduced in
future runs
This process is repeated till we get the desired output. The training phase is done with
supervision. Once the model is stable, it is used in production.

Why do we need back propagation?


Back propagation has many advantages, some of the important ones are listed below-

 Back propagation is fast, simple and easy to implement


 There are no parameters to be tuned
 Prior knowledge about the network is not needed thus becoming a flexible method
 This approach works very well in most cases

Disadvantages of using Backpropagation

 The actual performance of backpropagation on a specific problem is dependent on the


input data.
Backpropagation is implemented in deep learning frameworks like
Tensorflow, Torch, Theano, etc., by using computational graphs. More
significantly, understanding back propagation on computational graphs
combines several different algorithms and its variations such as backprop
through time and backprop with shared weights. Once everything is
converted into a computational graph, they are still the same algorithm −
just back propagation on computational graphs.

What is Computational Graph

A computational graph is defined as a directed graph where the nodes


correspond to mathematical operations. Computational graphs are a way of
expressing and evaluating a mathematical expression.

For example, here is a simple mathematical equation −

p=x+y

We can draw a computational graph of the above equation as follows.

The above computational graph has an addition node (node with "+" sign)
with two input variables x and y and one output q.

Let us take another example, slightly more complex. We have the following
equation.

g=(x+y)∗z

The above equation is represented by the following computational graph.


Computational Graphs and Backpropagation

Computational graphs and backpropagation, both are important core


concepts in deep learning for training neural networks.

1.Forward Pass

Forward pass is the procedure for evaluating the value of the mathematical
expression represented by computational graphs. Doing forward pass means
we are passing the value from variables in forward direction from the left
(input) to the right where the output is.

Let us consider an example by giving some value to all of the inputs.


Suppose, the following values are given to all of the inputs.

x=1,y=3,z=−3

By giving these values to the inputs, we can perform forward pass and get
the following values for the outputs on each node.

First, we use the value of x = 1 and y = 3, to get p = 4.

Then we use p = 4 and z = -3 to get g = -12. We go from left to right,


forwards.
To formalize our graphs, we also need to introduce the idea of an operation. An operation is a
simple function of one or more variables. Our graph language is accompanied by a set of
allowable operations. Functions more complicated than the operations in this set may be
described by composing many operations together. Without loss of generality, we define an
operation to return only a single output variable. This does not lose generality because the
output variable can have multiple entries, such as a vector. Software implementations of back-
propagation usually support operations with multiple outputs, but we avoid this case in our
description because it introduces many extra details that are not important to conceptual
understanding.

If a variable y is computed by applying an operation to a variable x, then we draw a directed


edge from x to y. We sometimes annotate the output node with the name of the operation
applied, and other times omit this label when the operation is clear from context.

Examples of computational graphs are shown in Fig. 6.8.


2. Backpropagation and the chain rule used in backward pass

The backpropagation algorithm is really just an example of the trusty chain-


rule from calculus. It states how to find the influence of a certain input, on
systems that are composed of multiple functions. So for example in the image
below, if you want to know the influence of x on the function g, we just
multiply the influence of f on g by the influence of x on f:

You might also like