ML Unit 4
ML Unit 4
ML Unit 4
LearningUnit-4
Recurrent Neural Network
• Recurrent Neural Network(RNN) are a type of Neural
Network where the output from previous step are fed
as input to the current step.
• In traditional neural networks, all the inputs and outputs
are independent of each other, but in cases like when it is
required to predict the next word of a sentence, the
previous words are required and hence there is a need to
remember the previous words.
• Thus RNN came into existence, which solved this issue
with the help of a Hidden Layer. The main and most
important feature of RNN is Hidden state, which
remembers some information about a sequence.
RNN have a “memory” which remembers all information about what has been
calculated. It uses the same parameters for each input as it performs the same task on
all the inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
How RNN works
The working of a RNN can be understood with the help of below
example:
Example:
Suppose there is a deeper network with one input layer,
three hidden layers and one output layer. Then like other
neural networks, each hidden layer will have its own set
of weights and biases, let’s say, for hidden layer 1 the
weights and biases are (w1, b1), (w2, b2) for second
hidden layer and (w3, b3) for third hidden layer. This
means that each of these layers are independent of each
other, i.e. they do not memorize the previous outputs.
Now the RNN will do the following:
Here, ht is the new state; ht-1 is the previous state while xt is the current input. We now
have a state of the previous input instead of the input itself, because the input neuron
would have applied the transformations on our previous input. So each successive input
is Formula
called as
foraapplying
time step.
Activation function(tanh)
+b]
Now, once the current state is calculated we can calculate the output state as-
]+b
+b]
Training through RNN
• A single time step of the input is supplied to the network i.e. xt is supplied to the network
• We then calculate its current state using a combination of the current input and the
previous state i.e. we calculate ht
• The current ht becomes ht-1 for the next time step
• We can go as many time steps as the problem demands and combine the information from
all the previous states
• Once all the time steps are completed the final current state is used to calculate the output
yt
• The output is then compared to the actual output and the error is generated
• The error is then back propagated to the network to update the weights and the network is
trained.
Advantages of Recurrent Neural Network
• An RNN remembers each and every information through time. It is useful in time series
prediction only because of the feature to remember previous inputs as well.
• Recurrent neural network is even used with convolutional layers to extend the effective pixel
neighborhood.
2. Machine Translation
3. Image Captioning
4. Handwriting generation
• Gated Recurrent Units (GRU) is one of the popular variants of recurrent neural
networks and has been widely used in the context of machine translation. GRUs
can also be regarded as a simpler version of LSTMs (Long Short-Term Memory).
• A gated recurrent unit (GRU) was introduced to produce each recurrent unit to
capture dependencies of various time scales.
Reset gate
• Essentially, this gate is used from the model to decide how much of the past
information to forget.
• The working of GRU proceeds such that when the reset gate is close to zero, the
hidden state is forced to ignore the previous hidden state and is reset with the current
input.
• This allows the hidden state to discard any data that is found to be irrelevant in the
future.
Update gate
• The update gate controls how much data from the previous hidden state will be
transferred to the current hidden state.
• This process performs in a similar manner to the memory cell in the Long Short-
Term Memory network and helps the RNN to remember long-term information.
Advantages of Gated Recurrent Unit
• Gated Recurrent Unit can be used to improve the memory capacity of a recurrent
neural network as well as provide the ease of training a model.
• The hidden unit can also be used for settling the vanishing gradient problem in
recurrent neural networks.
• It can be used in various applications, including speech signal modeling, machine
translation, and handwriting recognition, among others.
Translation
• Machine translation is the task of automatically converting source text in one
language to text in another language.
• In a machine translation task, the input already consists of a sequence of symbols
in some language, and the computer program must convert this into a sequence of
symbols in another language.
• Given a sequence of text in a source language, there is no one single best
translation of that text to another language. This is because of the natural
ambiguity and flexibility of human language.
• The fact is that accurate translation requires background knowledge in order to
resolve ambiguity and establish the content of the sentence.
• Examples: Google translate, Google Assistant,
Here is the list of machine translation:
• Statistical Machine Translation-
Statistical machine translation, or SMT for short, is the use of statistical models that learn to
translate text from a source language to a target language.
The approach is data-driven, requiring only a corpus of examples with both source and
target language text. This means linguists are no longer required to specify the rules of
translation.
• Neural Machine Translation-
Neural machine translation, or NMT for short, is the use of neural network models to learn a
statistical model for machine translation.
The key benefit to the approach is that a single system can be trained directly on source and
target text, no longer requiring the pipeline of specialized systems used in statistical
machine learning.
• Encoder-Decoder Model
Multilayer Perceptron neural network models can be used for machine translation,
although the models are limited by a fixed-length input sequence where the output must
be the same length.
These early models have been greatly improved upon recently through the use of
recurrent neural networks organized into encoder-decoder architecture that allow for
variable length input and output sequences
Beam search and width
• Another popular heuristic is the beam search that expands upon the greedy search and
returns a list of most likely output sequences.
• Instead of greedily choosing the most likely next step as the sequence is constructed, the
beam search expands all possible next steps and keeps the k most likely, where k is a user-
specified parameter and controls the number of beams or parallel searches through the
sequence of probabilities.
• The local beam search algorithm keeps track of k states rather than just one. It begins with k
randomly generated states.
• At each step, all the successors of all k states are generated. If one of them is a goal, the
algorithm halts. Otherwise, it selects the k best successors from the complete list and
repeats. We do not need to start with random states; instead, we start with the k most likely
words as the first step in the sequence.
• Common beam width values are 1 for a greedy search and values of 5 or 10 for common
benchmark problems in machine translation.
• Larger beam widths result in better performance of a model as the multiple candidate
sequences increase the likelihood of better matching a target sequence. This increased
performance results in a decrease in decoding speed.
• The search process can halt for each candidate separately either by reaching a maximum
length, by reaching an end-of-sequence token, or by reaching a threshold likelihood.
Example:
• We can define a function to perform the beam search for a given sequence of probabilities
and beam width parameter k.
• At each step, each candidate sequence is expanded with all possible next steps. Each
candidate step is scored by multiplying the probabilities together.
• The k sequences with the most likely probabilities are selected and all other candidates
are pruned. The process then repeats until the end of the sequence.
Reinforcement Learning
• Reinforcement Learning is a type of Machine Learning. It allows machines and software agents
to automatically determine the ideal behavior within a specific context, in order to maximize its
performance.
• Basically an RL does not know anything about the environment; it learns what to do by
exploring the environment.
• The environment, in return, provides rewards and a new state based on the actions of the
agent. So, in reinforcement learning, we do not teach an agent how it should do something but
presents it with rewards whether positive or negative based on its actions..
• Here we don't know which actions will produce rewards, also we don't know when an action will
produce rewards, and sometimes you do an action that will take time to produce rewards.
• States: This is the position of the agents at a specific time-step in the environment. So, whenever an
agent performs a action the environment gives the agent reward and a new state where the agent
reached by performing the action.
•
• Reward Function : Rewards are the numerical values that the agent receives on
performing some action at some state(s) in the environment. The numerical value can be
positive or negative based on the actions of the agent.
• Value Function: Gives the total amount of reward the agent can expect from a particular
state to all possible states from that state. With the value function you can find a policy.
• Model (Optional): Used to do planning, instead of simple trial-and-error approach
common to Reinforcement learning. Here means the possible state after we do an
action on the state
RL-Framework
Following are the top-5 RL-Frameworks available:
1. Acme
2. DeeR
3. Dopamine
4. Frap
5. RLgraph
1. Acme
About: Acme is a framework for distributed reinforcement learning introduced by
DeepMind. The framework is used to build readable, efficient, research-oriented RL
algorithms. At its core, Acme is designed to enable simple descriptions of RL agents
that can be run at various scales of execution, including distributed agents. This
framework aims to make the results of various RL algorithms developed in academia
and industrial labs easier to reproduce and extend for the machine learning community
at large.
2. DeeR
About: DeeR is a Python library for deep reinforcement learning. The framework is
built with modularity in mind so that it can easily be adapted to any need and provides
many possibilities such as Double Q-learning, Prioritized Experience Replay, Deep
deterministic Policy Gradient (DDPG), and Combined Reinforcement via Abstract
Representations (CRAR).
3. Dopamine
About: Dopamine is a popular research framework for fast prototyping of reinforcement
learning algorithms. The framework aims to fill the need for a small, easily codebase in
which users can freely experiment with research. The design principles of this framework
include flexible development, reproducibility, easy experimentation and more.
4. Frap
•About: Frap or Framework for Reinforcement learning And Planning is unifying that
identifies the underlying dimensions on which any planning or learning algorithm has to
decide. The framework provides deeper insight into the algorithmic space of planning
and reinforcement learning and also suggests new approaches to integrate both the
fields. The aim of this framework is to provide a common language to categorize
algorithms as well as it identifies new research directions.
5. RLgraph
About: RLgraph is a reinforcement learning framework that quickly prototypes,
defines and executes reinforcement learning algorithms both in research and
practice. The framework supports TensorFlow (or static graphs in general) or
eager/define-by-run execution (PyTorch) through a single component interface.
Using RLgraph, developers can combine high-level components in a space-
independent manner and define input spaces.
Policy Function and Value Function
• MDPs are useful for studying optimization problems solved via dynamic
programming.
Bellman equations
then substitute the return Gt+1, starting from time step t+1,
= 𝔼 [Rt+1 + 𝛾Gt+1 | St = s]
Finally, since the expected value function is a linear function, meaning that (aX+bY)= a𝔼(X)
+b𝔼(Y). The expected value of the return Gt+1 is the value of the state St+1,
= 𝔼 [Rt+1 + 𝛾v(St+1) | St = s]
That gives us the Bellman equation for MRPs,
The value of the state S is the reward we get upon leaving that state, plus a discounted average over
next possible successor states, where the value of each possible successor state is multiplied by the
probability that we land in it.
Actor-critic model
1. The “Critic” estimates the value function. This could be the action-value (the Q value)
or state-value (the V value).
2. The “Actor” updates the policy distribution in the direction suggested by the Critic
(such as with policy gradients).
Actor-critic aims to take advantage of all the good stuff from both value-based and
policy-based while eliminating all their drawbacks.
The principal idea is to split the model in two: one for computing an action based on a
state and another one to produce the Q values of the action.
Value Iteration and Policy Iteration
• The value-iteration and policy-iteration algorithms are two fundamental methods for solving
MDPs. Both value- iteration and policy-iteration assume that the agent knows the MDP model of
the world (i.e. the agent knows the state-transition and reward probability functions). Therefore,
they can be used by the agent to (offline) plan its actions given knowledge about the
environment before interacting with it.
• Value iteration computes the optimal state value function by iteratively improving the estimate
of V(s). The algorithm initialize V(s) to arbitrary random values. It repeatedly updates the Q(s,
a) and V(s) values until they converges. Value iteration is guaranteed to converge to the optimal
values.
• While value-iteration algorithm keeps improving the value function at each iteration until the
value-function converges.
• Since the agent only cares about the finding the optimal policy, sometimes the optimal policy
will converge before the value function. Therefore, another algorithm called policy-iteration
instead of repeated improving the value-function estimate, it will re-define the policy at each
step and compute the value according to this new policy until the policy converges.
• Policy iteration is also guaranteed to converge to the optimal policy and it often takes less
iterations to converge than the value-iteration algorithm.
Value-Iteration vs Policy-Iteration
•Both value-iteration and policy-iteration algorithms can be used for offline planning where the
agent is assumed to have prior knowledge about the effects of its actions on the environment
(they assume the MDP model is known). Comparing to each other, policy-iteration is
computationally efficient as it often takes considerably fewer number of iterations to converge
although each iteration is more computationally expensive.
How Actor-critic works
• Imagine you play a video game with a friend that provides you some
feedback. You’re the Actor and your friend is the Critic.
• At the beginning, you don’t know how to play, so you try some action randomly.
The Critic observes your action and provides feedback.
• Learning from this feedback, you’ll update your policy and be better at playing
that game.
• On the other hand, your friend (Critic) will also update their own way to provide
feedback so it can be better next time.
• The idea of Actor Critic is to have two neural networks. We estimate both:
a) ACTOR: A policy function controls how our agent acts.
b) CRITIC: A value function, measures how good these actions are.
• Both run in parallel. Because we have two models (Actor and Critic) that must be
trained, it means that we have two set of weights that must be optimized
separately.
Q-learning
• In the case where the agent does not know apriori i.e., what are the effects of its actions on the
environment (state transition and reward models are not known).
• The agent only knows what are the set of possible states and actions, and can observe the
environment current state. In this case, the agent has to actively learn through the experience of
interactions with the environment.
• There are two categories of learning algorithms:
Model-based learning:
• In model-based learning, the agent will interact to the environment and from the history of its
interactions; the agent will try to approximate the environment state transition and reward models.
• Afterwards, given the models it learnt, the agent can use value-iteration or policy-iteration to find
an optimal policy.
Model-free learning:
• In model-free learning, the agent will not try to learn explicit models of the
environment state transition and reward functions.
• However, it directly derives an optimal policy from the interactions with the
environment.
• Q-Learning is an example of model-free learning algorithm. It does not assume that agent
knows anything about the state-transition and reward models. However, the agent will
discover: what are the good and bad actions by trial and error.
• The basic idea of Q-Learning is to approximate the state-action pairs Q-function from the
samples of Q(s, a) that we observe during interaction with the environment. This
approach is known as Time-Difference Learning.
The Q-learning algorithm Process
• The SARSA stands for State Action Reward State Action which symbolizes the tuple (s,
a, r, s’, a’) is an On-Policy algorithm for TD-Learning.
• The major difference between it and Q-Learning, is that the maximum reward for the next
state is not necessarily used for updating the Q-values. Instead, a new action, and
therefore reward, is selected using the same policy that determined the original action.
• The name SARSA actually comes from the fact that the updates are done using the
quintuple Q(s, a, r, s', a'). Where: s, a are the original state and action, r is the reward
observed in the following state and s', a' are the new state-action pair.
The procedural form of SARSA algorithm is as follows:
Initialize Q(s, a) arbitrarily
Repeat (for each episode):
Initialize s
Choose a from s using policy derived from Q
(e.g., -greedy)
Repeat (for each step of episode):
Take action a, observe r, s’
Choose a’ from s’ using policy derived from Q
(e.g., -greedy)
Q(s, a) <-- Q(s, a) + α [r + Q(s’, a’) – Q (s, a)]
s <-- s’, a <-- a’;
Until s is terminal;