ML Unit 4

CS 601- Machine
LearningUnit-4
Recurrent Neural Network
• Recurrent Neural Network(RNN) are a type of Neural
Network where the output from previous step are fed
as input to the current step.
• In traditional neural networks, all the inputs and outputs
are independent of each other, but in cases like when it is
required to predict the next word of a sentence, the
previous words are required and hence there is a need to
remember the previous words.
• Thus RNN came into existence, which solved this issue
with the help of a Hidden Layer. The main and most
important feature of RNN is Hidden state, which
remembers some information about a sequence.
RNN have a “memory” which remembers all information about what has been
calculated. It uses the same parameters for each input as it performs the same task on
all the inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
How RNN works
The working of a RNN can be understood with the help of below
example:
Example:
Suppose there is a deeper network with one input layer,
three hidden layers and one output layer. Then like other
neural networks, each hidden layer will have its own set
of weights and biases, let’s say, for hidden layer 1 the
weights and biases are (w1, b1), (w2, b2) for second
hidden layer and (w3, b3) for third hidden layer. This
means that each of these layers are independent of each
other, i.e. they do not memorize the previous outputs.
Now the RNN will do the following:
• RNN converts the independent activations into

dependent activations by providing the same weights and
biases to all the layers, thus reducing the complexity of
increasing parameters and memorizing each previous
outputs by giving each output as input to the next hidden
layer.
• Hence these three layers can be joined together such
that the weights and bias of all the hidden layers is the
same, into a single recurrent layer.
The formula for the current state can be written as –
+b
Here, ht is the new state; ht-1 is the previous state while xt is the current input. We now
have a state of the previous input instead of the input itself, because the input neuron
would have applied the transformations on our previous input. So each successive input
is Formula
called as
foraapplying
time step.
Activation function(tanh)
+b]
Now, once the current state is calculated we can calculate the output state as-
]+b
+b]
Training through RNN
• A single time step of the input is supplied to the network i.e. xt is supplied to the network
• We then calculate its current state using a combination of the current input and the
previous state i.e. we calculate ht
• The current ht becomes ht-1 for the next time step
• We can go as many time steps as the problem demands and combine the information from
all the previous states
• Once all the time steps are completed the final current state is used to calculate the output
yt
• The output is then compared to the actual output and the error is generated
• The error is then back propagated to the network to update the weights and the network is
trained.
Advantages of Recurrent Neural Network
• An RNN remembers each and every information through time. It is useful in time series
prediction only because of the feature to remember previous inputs as well.
• Recurrent neural network is even used with convolutional layers to extend the effective pixel
neighborhood.
Disadvantages of Recurrent Neural Network

• Gradient vanishing and exploding problems.
• Training an RNN is a very difficult task.
• It cannot process very long sequences if using tanh or ReLu as an activation function.
Long short-term memory
• Recurrent Neural Networks suffer from short-term memory.
• If a sequence is long enough, they’ll have hard time carrying information from earlier
time steps to later ones.
• So if you are trying to process a paragraph of text to do predictions, RNN’s may leave out
important information from the beginning.
• Long Short Term Memory is a kind of recurrent neural network is the solution of the
above problem.
• In RNN output from the last step is fed as input in the current step. LSTM was designed
by Hochreiter & Schmidhuber.
• It tackled the problem of long-term dependencies of RNN in which the RNN cannot
predict the word stored in the long-term memory but can give more accurate predictions
from the recent information.
• LSTM can by default retain the information for long period of time. It is used for
processing, predicting and classifying on the basis of time series data.
LSTM’s as a solution
An LSTM has a similar control flow as a recurrent neural network. It processes
data passing on information as it propagates forward. The differences are the
operations within the LSTM’s cells.
Forget gate
• First, we have the forget gate. This gate decides what information should be thrown away or
kept. Information from the previous hidden state and information from the current input is
passed through the sigmoid function. Values come out between 0 and 1. The closer to 0 means to
forget, and the closer to 1 means to keep.
Input gate
• To update the cell state, we have the input gate. First, we pass the previous hidden state
and current input into a sigmoid function. That decides which values will be updated
by transforming the values to be between 0 and 1.
• 0 means not important, and 1 means important. You also pass the hidden state and
current input into the tanh function to squish values between -1 and 1 to help regulate
the network.
• Then you multiply the tanh output with the sigmoid output. The sigmoid output will
decide which information is important to keep from the tanh output.
Output Gate
Last we have the output gate. The output gate decides what the next hidden
state should be.
Remember that the hidden state contains information on previous inputs. The
hidden state is also used for predictions.
First, we pass the previous hidden state and the current input into a sigmoid
function.
Then we pass the newly modified cell state to the tanh function.
We multiply the tanh output with the sigmoid output to decide what
information the hidden state should carry.
The output is the hidden state. The new cell state and the new hidden is then
carried over to the next time step.
Some of the famous applications of LSTM includes:
1. Language Modelling
2. Machine Translation
3. Image Captioning
4. Handwriting generation
5. Question Answering Chatbots

Gated recurrent unit
• Gated Recurrent Units (GRU) is one of the popular variants of recurrent neural
networks and has been widely used in the context of machine translation. GRUs
can also be regarded as a simpler version of LSTMs (Long Short-Term Memory).
• A gated recurrent unit (GRU) was introduced to produce each recurrent unit to
capture dependencies of various time scales.
Reset gate
• Essentially, this gate is used from the model to decide how much of the past
information to forget.
• The working of GRU proceeds such that when the reset gate is close to zero, the
hidden state is forced to ignore the previous hidden state and is reset with the current
input.
• This allows the hidden state to discard any data that is found to be irrelevant in the
future.
Update gate
• The update gate controls how much data from the previous hidden state will be
transferred to the current hidden state.
• This process performs in a similar manner to the memory cell in the Long Short-
Term Memory network and helps the RNN to remember long-term information.
Advantages of Gated Recurrent Unit
• Gated Recurrent Unit can be used to improve the memory capacity of a recurrent
neural network as well as provide the ease of training a model.
• The hidden unit can also be used for settling the vanishing gradient problem in
recurrent neural networks.
• It can be used in various applications, including speech signal modeling, machine
translation, and handwriting recognition, among others.
Translation
• Machine translation is the task of automatically converting source text in one
language to text in another language.
• In a machine translation task, the input already consists of a sequence of symbols
in some language, and the computer program must convert this into a sequence of
symbols in another language.
• Given a sequence of text in a source language, there is no one single best
translation of that text to another language. This is because of the natural
ambiguity and flexibility of human language.
• The fact is that accurate translation requires background knowledge in order to
resolve ambiguity and establish the content of the sentence.
• Examples: Google translate, Google Assistant,
Here is the list of machine translation:
• Statistical Machine Translation-
Statistical machine translation, or SMT for short, is the use of statistical models that learn to
translate text from a source language to a target language.
The approach is data-driven, requiring only a corpus of examples with both source and
target language text. This means linguists are no longer required to specify the rules of
translation.
• Neural Machine Translation-
Neural machine translation, or NMT for short, is the use of neural network models to learn a
statistical model for machine translation.
The key benefit to the approach is that a single system can be trained directly on source and
target text, no longer requiring the pipeline of specialized systems used in statistical
machine learning.
• Encoder-Decoder Model
 Multilayer Perceptron neural network models can be used for machine translation,
although the models are limited by a fixed-length input sequence where the output must
be the same length.
 These early models have been greatly improved upon recently through the use of
recurrent neural networks organized into encoder-decoder architecture that allow for
variable length input and output sequences
Beam search and width
• Another popular heuristic is the beam search that expands upon the greedy search and
returns a list of most likely output sequences.
• Instead of greedily choosing the most likely next step as the sequence is constructed, the
beam search expands all possible next steps and keeps the k most likely, where k is a user-
specified parameter and controls the number of beams or parallel searches through the
sequence of probabilities.
• The local beam search algorithm keeps track of k states rather than just one. It begins with k
randomly generated states.
• At each step, all the successors of all k states are generated. If one of them is a goal, the
algorithm halts. Otherwise, it selects the k best successors from the complete list and
repeats. We do not need to start with random states; instead, we start with the k most likely
words as the first step in the sequence.
• Common beam width values are 1 for a greedy search and values of 5 or 10 for common
benchmark problems in machine translation.
• Larger beam widths result in better performance of a model as the multiple candidate
sequences increase the likelihood of better matching a target sequence. This increased
performance results in a decrease in decoding speed.
• The search process can halt for each candidate separately either by reaching a maximum
length, by reaching an end-of-sequence token, or by reaching a threshold likelihood.
Example:
• We can define a function to perform the beam search for a given sequence of probabilities
and beam width parameter k.
• At each step, each candidate sequence is expanded with all possible next steps. Each
candidate step is scored by multiplying the probabilities together.
• The k sequences with the most likely probabilities are selected and all other candidates
are pruned. The process then repeats until the end of the sequence.
Reinforcement Learning
• Reinforcement Learning is a type of Machine Learning. It allows machines and software agents
to automatically determine the ideal behavior within a specific context, in order to maximize its
performance.
• Basically an RL does not know anything about the environment; it learns what to do by
exploring the environment.
• The environment, in return, provides rewards and a new state based on the actions of the
agent. So, in reinforcement learning, we do not teach an agent how it should do something but
presents it with rewards whether positive or negative based on its actions..
• Here we don't know which actions will produce rewards, also we don't know when an action will
produce rewards, and sometimes you do an action that will take time to produce rewards.
• Basically all is learned with interactions with the environment.
Reinforcement learning components:

• Agent: Software programs that make intelligent decisions and they are the learners in RL. These agents
interact with the environment by actions and receive rewards based on there actions.
• Environment: The game, or where the agent lives.
• States: This is the position of the agents at a specific time-step in the environment. So, whenever an
agent performs a action the environment gives the agent reward and a new state where the agent
reached by performing the action.
•
• Reward Function : Rewards are the numerical values that the agent receives on
performing some action at some state(s) in the environment. The numerical value can be
positive or negative based on the actions of the agent.
• Value Function: Gives the total amount of reward the agent can expect from a particular
state to all possible states from that state. With the value function you can find a policy.
• Model (Optional): Used to do planning, instead of simple trial-and-error approach
common to Reinforcement learning. Here means the possible state after we do an
action on the state
RL-Framework
Following are the top-5 RL-Frameworks available:
1. Acme
2. DeeR
3. Dopamine
4. Frap
5. RLgraph
1. Acme
About: Acme is a framework for distributed reinforcement learning introduced by
DeepMind. The framework is used to build readable, efficient, research-oriented RL
algorithms. At its core, Acme is designed to enable simple descriptions of RL agents
that can be run at various scales of execution, including distributed agents. This
framework aims to make the results of various RL algorithms developed in academia
and industrial labs easier to reproduce and extend for the machine learning community
at large.
2. DeeR
About: DeeR is a Python library for deep reinforcement learning. The framework is
built with modularity in mind so that it can easily be adapted to any need and provides
many possibilities such as Double Q-learning, Prioritized Experience Replay, Deep
deterministic Policy Gradient (DDPG), and Combined Reinforcement via Abstract
Representations (CRAR).
3. Dopamine
About: Dopamine is a popular research framework for fast prototyping of reinforcement
learning algorithms. The framework aims to fill the need for a small, easily codebase in
which users can freely experiment with research. The design principles of this framework
include flexible development, reproducibility, easy experimentation and more.
4. Frap
•About: Frap or Framework for Reinforcement learning And Planning is unifying that
identifies the underlying dimensions on which any planning or learning algorithm has to
decide. The framework provides deeper insight into the algorithmic space of planning
and reinforcement learning and also suggests new approaches to integrate both the
fields. The aim of this framework is to provide a common language to categorize
algorithms as well as it identifies new research directions.
5. RLgraph
About: RLgraph is a reinforcement learning framework that quickly prototypes,
defines and executes reinforcement learning algorithms both in research and
practice. The framework supports TensorFlow (or static graphs in general) or
eager/define-by-run execution (PyTorch) through a single component interface.
Using RLgraph, developers can combine high-level components in a space-
independent manner and define input spaces.
Policy Function and Value Function
• Value Function determines how good it is for the agent to be in a particular

state. Of course, to determine how good it will be to be in a particular state it must
depend on some actions that it will take. This is where policy comes in. A policy
defines what actions to perform in a particular state s. In Reinforcement Learning
the experience of the agent determines the change in policy.
MDP (Markov decision process)
• It provides a mathematical framework for modeling decision making in situations
where outcomes are partly random and partly under the control of a decision maker.
• MDPs are useful for studying optimization problems solved via dynamic
programming.
Bellman equations
• A Bellman equation , named after Richard E. Bellman, is a necessary condition

for optimality associated with the mathematical optimization method known as
dynamic programming.
• Bellman Equation helps us to find optimal policies and value functions. We

know that our policy changes with experience so we will have different value
functions according to different policies. The optimal value function is one that
gives maximum value compared to all other value functions.
Bellman Equation states that value function can be decomposed into two parts:
• Immediate Reward, R[t+1]

• Discounted value of successor states,
Mathematically, we can define Bellman Equation as :

In Reinforcement learning, we care about maximizing the cumulative reward (all the
rewards agent receives from the environment) instead of, the reward agent receives from the
current state(also called immediate reward). This total sum of reward the agent receives
from the environment is called returns.
We unroll the return Gt,
then substitute the return Gt+1, starting from time step t+1,
= 𝔼 [Rt+1 + 𝛾Gt+1 | St = s]
Finally, since the expected value function is a linear function, meaning that (aX+bY)= a𝔼(X)
+b𝔼(Y). The expected value of the return Gt+1 is the value of the state St+1,
= 𝔼 [Rt+1 + 𝛾v(St+1) | St = s]
That gives us the Bellman equation for MRPs,
v(s)= 𝔼 [Rt+1 + 𝛾 v(St+1) | St = s]

So, for each state in the state space, the Bellman equation gives us the value of that state,
The value of the state S is the reward we get upon leaving that state, plus a discounted average over
next possible successor states, where the value of each possible successor state is multiplied by the
probability that we land in it.
Actor-critic model
1. The “Critic” estimates the value function. This could be the action-value (the Q value)
or state-value (the V value).
2. The “Actor” updates the policy distribution in the direction suggested by the Critic
(such as with policy gradients).
Actor-critic aims to take advantage of all the good stuff from both value-based and
policy-based while eliminating all their drawbacks.
The principal idea is to split the model in two: one for computing an action based on a
state and another one to produce the Q values of the action.
Value Iteration and Policy Iteration
• The value-iteration and policy-iteration algorithms are two fundamental methods for solving
MDPs. Both value- iteration and policy-iteration assume that the agent knows the MDP model of
the world (i.e. the agent knows the state-transition and reward probability functions). Therefore,
they can be used by the agent to (offline) plan its actions given knowledge about the
environment before interacting with it.
• Value iteration computes the optimal state value function by iteratively improving the estimate
of V(s). The algorithm initialize V(s) to arbitrary random values. It repeatedly updates the Q(s,
a) and V(s) values until they converges. Value iteration is guaranteed to converge to the optimal
values.
• While value-iteration algorithm keeps improving the value function at each iteration until the
value-function converges.
• Since the agent only cares about the finding the optimal policy, sometimes the optimal policy
will converge before the value function. Therefore, another algorithm called policy-iteration
instead of repeated improving the value-function estimate, it will re-define the policy at each
step and compute the value according to this new policy until the policy converges.
• Policy iteration is also guaranteed to converge to the optimal policy and it often takes less
iterations to converge than the value-iteration algorithm.
Value-Iteration vs Policy-Iteration
•Both value-iteration and policy-iteration algorithms can be used for offline planning where the
agent is assumed to have prior knowledge about the effects of its actions on the environment
(they assume the MDP model is known). Comparing to each other, policy-iteration is
computationally efficient as it often takes considerably fewer number of iterations to converge
although each iteration is more computationally expensive.
How Actor-critic works
• Imagine you play a video game with a friend that provides you some
feedback. You’re the Actor and your friend is the Critic.
• At the beginning, you don’t know how to play, so you try some action randomly.
The Critic observes your action and provides feedback.
• Learning from this feedback, you’ll update your policy and be better at playing
that game.
• On the other hand, your friend (Critic) will also update their own way to provide
feedback so it can be better next time.
• The idea of Actor Critic is to have two neural networks. We estimate both:
a) ACTOR: A policy function controls how our agent acts.
b) CRITIC: A value function, measures how good these actions are.
• Both run in parallel. Because we have two models (Actor and Critic) that must be
trained, it means that we have two set of weights that must be optimized
separately.
Q-learning
• In the case where the agent does not know apriori i.e., what are the effects of its actions on the
environment (state transition and reward models are not known).
• The agent only knows what are the set of possible states and actions, and can observe the
environment current state. In this case, the agent has to actively learn through the experience of
interactions with the environment.
• There are two categories of learning algorithms:
Model-based learning:
• In model-based learning, the agent will interact to the environment and from the history of its
interactions; the agent will try to approximate the environment state transition and reward models.
• Afterwards, given the models it learnt, the agent can use value-iteration or policy-iteration to find
an optimal policy.
Model-free learning:
• In model-free learning, the agent will not try to learn explicit models of the
environment state transition and reward functions.
• However, it directly derives an optimal policy from the interactions with the
environment.
• Q-Learning is an example of model-free learning algorithm. It does not assume that agent
knows anything about the state-transition and reward models. However, the agent will
discover: what are the good and bad actions by trial and error.
• The basic idea of Q-Learning is to approximate the state-action pairs Q-function from the
samples of Q(s, a) that we observe during interaction with the environment. This
approach is known as Time-Difference Learning.
The Q-learning algorithm Process
Figure: 4.11 Q-learning Algorithm Process

SARSA
• The SARSA stands for State Action Reward State Action which symbolizes the tuple (s,
a, r, s’, a’) is an On-Policy algorithm for TD-Learning.
• The major difference between it and Q-Learning, is that the maximum reward for the next
state is not necessarily used for updating the Q-values. Instead, a new action, and
therefore reward, is selected using the same policy that determined the original action.
• The name SARSA actually comes from the fact that the updates are done using the
quintuple Q(s, a, r, s', a'). Where: s, a are the original state and action, r is the reward
observed in the following state and s', a' are the new state-action pair.
The procedural form of SARSA algorithm is as follows:
Initialize Q(s, a) arbitrarily
Repeat (for each episode):
Initialize s
Choose a from s using policy derived from Q
(e.g., -greedy)
Repeat (for each step of episode):
Take action a, observe r, s’
Choose a’ from s’ using policy derived from Q
(e.g., -greedy)
Q(s, a) <-- Q(s, a) + α [r + Q(s’, a’) – Q (s, a)]
s <-- s’, a <-- a’;
Until s is terminal;

ML Unit 4

Uploaded by

Copyright:

Available Formats

ML Unit 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Unit 4

Uploaded by

Copyright:

Available Formats

CS 601- Machine

• RNN converts the independent activations into

Disadvantages of Recurrent Neural Network

5. Question Answering Chatbots

• Basically all is learned with interactions with the environment.

Reinforcement learning components:

• Environment: The game, or where the agent lives.

• Value Function determines how good it is for the agent to be in a particular

• A Bellman equation , named after Richard E. Bellman, is a necessary condition

• Bellman Equation helps us to find optimal policies and value functions. We

• Immediate Reward, R[t+1]

Mathematically, we can define Bellman Equation as :

We unroll the return Gt,

v(s)= 𝔼 [Rt+1 + 𝛾 v(St+1) | St = s]

Figure: 4.11 Q-learning Algorithm Process

You might also like