Module 1

Exploration and Exploitation
Revolutionising B.Tech
Module 1
Reinforcement Learning and Markov
Decision Process
Course Name: Reinforcement Learning [22CSE563]
Table of Content
• Aim
• Objectives
• Reinforcement Learning
• Examples Of Reinforcement Learning
• Elements of Reinforcement Learning
• Example: Tic-Tac-Toe
• History of Reinforcement Learning
• Multi-arm Bandit Problem
Aim
To equip students with the knowledge and skills to design intelligent

systems, the reinforcement learning subject introduces foundational
concepts like agent-environment interactions and reward
maximization. Students will explore key techniques such as dynamic
programming and temporal difference learning. They will also
understand the balance between exploration and exploitation in
decision-making.
The objective of reinforcement learning (RL) is to enable
agents to learn optimal decision-making strategies through
interaction with their environment. By receiving rewards or
penalties based on their actions, agents iteratively improve
their behavior to maximize cumulative rewards over time. RL
Objective aims to solve complex problems where explicit programming
is not feasible by using trial and error to discover effective
policies. Ultimately, it seeks to develop systems capable of
autonomous learning and adaptation in dynamic and
uncertain environments.
Reinforcement Learning
Branches of Machine Learning
Machine Learning
Supervised Vs. Unsupervised Vs. Reinforcement Learning
Aspect Supervised Learning Unsupervised Learning Reinforcement Learning
Learning to make decisions by performing

Learning from labeled data to Learning from unlabeled data to
Definition actions in an environment and receiving
predict outcomes for new data. identify patterns and structures.
rewards or penalties.
Requires a dataset with input- No predefined dataset; learns from

Data Works with unlabeled data. No need
output pairs. Data must be interactions with the environment through
Requirement for input-output pairs.
labeled. trial and error.
Model that identifies the data's

A predictive model that maps Policy or strategy that specifies the action
Output patterns, clusters, associations, or
inputs to outputs. to take in each state of the environment.
features.
Direct feedback (correct output is No explicit feedback. The algorithm Indirect feedback (rewards or penalties
Feedback
known). infers structures. after actions, not necessarily immediate).
Minimize the error between Discover the underlying structure of

Goal Maximize cumulative reward over time.
predicted and actual outputs. the data.
Supervised Vs. Unsupervised Vs. Reinforcement Learning
Aspect Supervised Learning Unsupervised Learning Reinforcement Learning
Image classification, spam Clustering, dimensionality reduction, Video game AI, robotic control, dynamic
Examples
detection, regression tasks. market basket analysis. pricing, personalized recommendations.
Learning Learns from examples provided Learns patterns or features from data Learns from the consequences of its
Approach during training. without specific guidance. actions rather than from direct instruction.
Typically evaluated on a separate Evaluated based on metrics like

Evaluated based on the amount of reward
Evaluation test set using accuracy, precision, silhouette score, within-cluster sum of
it can secure over time in the environment.
recall, etc. squares, etc.
Requires a large amount of Difficult to validate results as there is Requires a balance between exploration
Challenges labeled data, which can be no true benchmark. Interpretation is and exploitation and can be challenging in
expensive or impractical. often subjective. environments with sparse rewards.
Supervised Vs. Reinforcement Learning
Drawbacks of Supervised Learning
• Supervised learning requires the creation of a data set to train on, which is not always an
easy task.
• If you train your neural network model to simply imitate the actions of the human player
well, your agent can never be better at playing the game of pong than that human gamer.
• Reinforcement Learning is a branch of Machine Learning, also called Online Learning.
• Reinforcement Learning (RL) is an interesting domain of artificial intelligence that simulates
the learning process by trial and error, mimicking how humans and animals learn from the
consequences of their actions.
• At its core, RL involves an agent that makes decisions in a dynamic environment to achieve
a set of objectives, aiming to maximize cumulative rewards.
• Unlike traditional machine learning paradigms, where models learn from a fixed data set, RL
agents learn from continuous feedback and are refined as they interact with their
environment.
• It is used to decide what action to take at t+1 based on data up to time t.
• Reinforcement Learning (RL) is a branch of machine learning that teaches agents how to
make decisions by interacting with an environment to achieve a goal.
• In RL, an agent learns to perform tasks by trying different strategies to maximize cumulative
rewards based on feedback received through its actions.
• RL solves a specific type of problem where decision making is sequential, and the goal is
long-term, such as game-playing, robotics, etc.
• The agent interacts with the environment and explores it by itself.
• The primary goal of an agent in reinforcement learning is to improve the performance by
getting the maximum positive rewards.
• A popular example of reinforcement learning:
• Chess engine.
• The agent decides upon a series of moves depending on the state of the board
(the environment), and the reward can be defined as a win or lose at the end of
the game.
• How a Robotic dog learns the movement of his arms is an example of
Reinforcement learning.
• It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns from
its own experience without any human intervention.
• Example: Suppose there is an AI agent present within a maze environment, and his goal
is to find the diamond. The agent interacts with the environment by performing some
actions, and based on those actions, the state of the agent gets changed, and it also
receives a reward or penalty as feedback.
• The agent continues doing these three things (take action, change state/remain in the
same state, and get feedback), and by doing these actions, he learns and explores the
environment.
• The agent learns that what actions lead to positive feedback or rewards and what actions
lead to negative feedback penalty.
• As a positive reward, the agent gets a positive point, and as a penalty, it gets a negative
point.
Reinforcement Learning Terminologies
Reinforcement Learning (RL) involves a variety of terms and concepts that are fundamental to
understanding and implementing RL algorithms.
1. Agent
• The decision-maker in an RL setting interacts with the environment by performing
actions based on its policy to maximize cumulative rewards.
2. Environment
• The external system with which the agent interacts during the learning process.
• It responds to the agent's actions by presenting new states and rewards.
3. State
• A description of the current situation in the environment.
• States can vary in complexity from simple numerical values to complex sensory inputs
like images.
4. Action
• A specific step or decision taken by the agent to interact with the environment.
• The set of all possible actions available to the agent is known as the action space.
5. Reward
• A scalar feedback signal received by the agent from the environment indicates an action's
effectiveness.
• The agent's goal is to maximize the sum of these rewards over time.
6. Policy (π)
• A strategy or rule that defines the agent’s way of behaving at a given time.
• A policy maps states to actions, determining what action to take in each state.
7.Value Function
• A function that estimates how good it is for the agent to be in a particular state (State-
Value Function) or how good it is to perform a particular action in a particular state (Action-
Value Function).
• The "goodness" is defined in terms of expected future rewards.
8. Q-function (Action-Value Function)
• A function that estimates the total amount of rewards an agent can expect to accumulate
over the future, starting from a given state and taking a particular action under a specific
policy.
9. Model
• In model-based RL, the model predicts the next state and reward for each action
taken in each state.
• In model-free RL, the agent learns directly from the experience without this model.
10. Exploration
• The act of trying new actions to discover more about the environment.
• Exploration helps the agent to learn about rewards associated with lesser-known
actions.
11. Exploitation
• Using the known information to maximize the reward.
• Exploitation leverages the agent's current knowledge to perform the best-known
action to gain the highest reward.
Rewards
• A reward Rt is a scalar feedback signal

• Indicates how well agent is doing at step t
• The agent's job is to maximise cumulative reward
• Reinforcement learning is based on the reward hypothesis.
Definition (Reward Hypothesis)

• All goals can be described by the maximisation of expected cumulative reward
Examples of Rewards
Fly stunt manoeuvres in a helicopter Make a humanoid robot walk
 +ve reward for following desired  +ve reward for forward motion
trajectory
 -ve reward for falling over
 -ve reward for crashing
Defeat the world champion at Play many dierent Atari games better
Backgammon than humans
 +/-ve reward for winning/losing a game  +/-ve reward for
increasing/decreasing score
Manage an investment portfolio
 +ve reward for each $ in bank
Control a power station

 +ve reward for producing power
 -ve reward for exceeding safety
thresholds
Policy
• A policy is the agent's behaviour

• It is a map from state to action, e.g.
• Deterministic policy:
• Stochastic policy:
Value Function
• Value function is a prediction of future reward

• Used to evaluate the goodness/badness of states
• And therefore to select between actions, e.g.
Sequential Decision Making
• Goal: select actions to maximise total future reward
• Actions may have long term consequences
• Reward may be delayed
• It may be better to sacrifice immediate reward to gain more
• long-term reward
Examples:
• A financial investment (may take months to mature)
• Refuelling a helicopter (might prevent a crash in several hours)
• Blocking opponent moves (might help winning chances many
• moves from now)
Working on Reinforcement Learning
• Agent:
• A reinforcement learning agent is the entity
which we are training to make correct decisions.
• For example, a robot that is being trained to
move around a house without crashing.
• Environment:
• The environment is the surroundings with which
the agent interacts.
• For example, the house where the robot moves.
• The agent cannot manipulate the environment; it
can only control its own actions.
• In other words, the robot can’t control where a
table is in the house, but it can walk around it.
• State:
• The state defines the current situation of the
agent.
• This can be the exact position of the robot in the
house, the alignment of its two legs or its current
posture.
• It all depends on how you address the problem.
• Action:
• The choice that the agent makes at the current
time step.
• For example, the robot can move its right or left
leg, raise its arm, lift an object or turn right/left,
etc.
• We know the set of actions (decisions) that the
agent can perform in advance.
• Policy:
• A policy is the thought process behind picking
an action.
• In practice, it’s a probability distribution
assigned to the set of actions.
• Highly rewarding actions will have a high
probability and vice versa.
• If an action has a low probability, it doesn’t
mean it won’t be picked at all.
• It’s just less likely to be picked.
• Value
― It is expected long-term retuned with the discount factor and opposite to the short-
term reward.
• Q-value():
― It is mostly similar to the value, but it takes one additional parameter as a current
action
Key Features of Reinforcement Learning
• In RL, the agent is not instructed about the environment and what actions need to be
taken.
• It is based on the hit and trial process.
• The agent takes the next action and changes states according to the feedback of the
previous action.
• The agent may get a delayed reward.
• The environment is stochastic, and the agent needs to explore it to reach to get the
maximum positive rewards.
History of Reinforcement Learning
Approaches to implement Reinforcement Learning
• The world of reinforcement learning (RL) offers a diverse toolbox of algorithms.
• Some popular examples include Q-learning, policy gradient methods, and Monte Carlo
methods, along with temporal difference learning.
• Deep RL takes things a step further by incorporating powerful deep neural networks into
the RL framework.
• One such deep RL algorithm is Trust Region Policy Optimization (TRPO).
RL Agent Taxonomy
Value-Based
• Focuses on learning a value function that estimates the expected future reward for an
agent in a given state under a specific policy.
• The agent aims to maximize this value function to achieve long-term reward.
• The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy.
• Therefore, the agent expects the long-term return at any state(s) under policy π.
• Popular algorithms in this category include Q-Learning, SARSA, and Deep Q-Networks
(DQN).
Policy-Based
• Directly learns the policy function, which maps states to actions.
• Policy-based approach is to find the optimal policy for the maximum future rewards
without using the value function.
• In this approach, the agent tries to apply such a policy that the action performed in each
step helps to maximize the future reward.
• The policy-based approach has mainly two types of policy:
• Deterministic: The same action is produced by the policy (π) at any state.
• Stochastic: In this policy, probability determines the produced action.
• The goal is to find the optimal policy that leads to the highest expected future rewards.
• Examples of policy-based methods include REINFORCE, Proximal Policy Optimization
(PPO), and Actor-Critic methods.
Model-Based
• In the model-based approach, a virtual model is created for the environment,
and the agent explores that environment to learn it.
• There is no particular solution or algorithm for this approach because the
model representation is different for each environment.
• Attempts to learn a model of the environment dynamics.
• This model predicts the next state and reward for a given state-action pair.
• The agent can then use this model to plan and simulate actions in a virtual environment
before taking them in the real world.
• While conceptually appealing, this approach can be computationally expensive for
complex environments and often requires additional assumptions about the
environment’s behavior.
The choice of approach depends on several factors, including:
• The complexity of the environment:
• For simpler environments, value-based methods might be sufficient.
• Complex environments might benefit from policy-based or model-based approaches
(if feasible).
• Availability of computational resources: Model-based approaches can be
computationally expensive.
• The desired level of interpretability: Value-based methods often offer more
interpretability compared to policy-based methods.
Maze Example:
Types of Reinforcement learning
There are mainly two types of reinforcement learning:
• Positive Reinforcement
• Negative Reinforcement
Positive Reinforcement:
• The positive reinforcement learning means adding something to increase the tendency that
expected behavior would occur again.
• It impacts positively on the behavior of the agent and increases the strength of the behavior.
• This type of reinforcement can sustain the changes for a long time, but too much positive
reinforcement may lead to an overload of states that can reduce the consequences.
Negative Reinforcement:
• The negative reinforcement learning is opposite to the positive reinforcement as it increases
the tendency that the specific behavior will occur again by avoiding the negative condition.
• It can be more effective than the positive reinforcement depending on situation and behavior,
but it provides reinforcement only to meet minimum behavior.
Applications of Reinforcement Learning
• Gaming: Training AI to outperform humans in complex games like chess, Go, and multiplayer
online games.
• Autonomous Vehicles: Developing decision-making systems for self-driving cars, drones, and
other autonomous systems to navigate and operate safely.
• Robotics: Teaching robots to perform tasks such as assembly, walking, and complex
manipulation through adaptive learning.
• Finance: Enhancing strategies in trading, portfolio management, and risk assessment.
• Healthcare: Personalizing medical treatments, managing patient care, and assisting in surgeries
with robotic systems.
• Supply Chain Management: Optimizing logistics, inventory management, and distribution
networks.
• Energy Management: Managing and distributing renewable energy in smart grids to enhance
efficiency and sustainability.
• Advertising: Optimizing ad placements and bidding strategies in real-time to maximize
engagement and revenue.
• Manufacturing: Automating and optimizing production lines and processes.
Applications of Reinforcement Learning
• Education: Developing adaptive learning technologies that personalize content and pacing
according to the learner's needs.
• Natural Language Processing: Training dialogue agents and chatbots to improve
interaction capabilities.
• Entertainment: Creating more interactive and engaging AI characters and scenarios in
virtual reality (VR) and video games.
• E-commerce: Implementing dynamic pricing, personalized recommendations, and customer
experience enhancements.
• Environmental Protection: Managing and controlling systems for pollution control, wildlife
conservation, and sustainable exploitation of resources.
• Telecommunications: Network optimization, including traffic routing and resource allocation
• Netflix Item based recommender systems: Images related to movies/shows are
shown to users in such a way that they are more likely to watch it.
• Bidding and Stock Exchange: Predicting Stocks based on Current data of stock prices.
• Traffic Light Control: Predicting the delay in the signal.
• Reinforcement learning is like trial-and-error learning

• The agent should discover a good policy
• From its experiences of the environment
• Without losing too much reward along the way
• Exploration adds more information about the environment

• Exploitation exploits known information to maximise reward
• It is usually important to explore as well as exploit
The Exploration/Exploitation
Dilemma
Online decision-making involves a fundamental choice:
• Exploitation Make the best decision given current information
• Exploration Gather more information
The best long-term strategy may involve short-term sacrifices Gather
enough information to make the best overall decisions
295, class 2 3
• Restaurant Selection
• Exploitation Go to your favorite restaurant
• Exploration Try a new restaurant
• Online Banner Advertisements

• Exploitation Show the most successful advert
• Exploration Show a dierent advert
• Oil Drilling
• Exploitation Drill at the best known location
• Exploration Drill at a new location
• Game Playing
• Exploitation Play the move you believe is best
• Exploration Play an experimental move
Multi-armed Bandit
• The term bandit comes from gambling where slot machines can
be thought as one-armed bandits.
Problem: which slot machine should

we play at each turn when their
payoffs are not necessarily the same
and initially unknown?
k-armed Bandit
• Choosing slot machine levers to pull.
• Doctor choosing an experimental prescription.
• Objective is to maximize reward over a given number of time steps

k-armed Bandit
• On each of a sequence of time steps,t=1,2,3,…,
• you choose an action At from k possibilities, and receive a real- valued reward
Rt
• These true values are unknown.The distribution is unknown

• Nevertheless, you must maximize your total reward
• You must both try actions to learn their values (explore), and prefer those that
appear best (exploit)
k-armed Bandit
K-Armed Bandits  Reward Distributions
• Stationary and non-Stationary Rewards
• Does the rewards of pulling “Lever 2” change over
time?
Maximizing Rewards: Action-value Methods
• Through repeated actions you maximize your wining by pulling the best
levers?
• How to find the best levers?
• Estimating values of each action
Greedy action selection Rule
• Greedy action selection always exploits current knowledge to maximize

immediate reward
• it spends no time at all sampling apparently inferior actions to see if they
might really be better.
ε -greedy methods
• In greedy action selection, you always exploit
• In 𝜀-greedy, you are usually greedy, but with probability, 𝜀 you instead pick an
action at random (possibly the greedy action again)
• This is perhaps the simplest way to balance exploration and exploitation.
𝜀 time: exploration
1- 𝜀 time: exploitation
Greedy vs. ε -greedy Action Selection
Regret
Efficient Sample-Averaging
• Constant memory and constant computation
• If the average over 5 sample is 8, you only need to keep 5 and 40 to update
the average in the future .
greedy and ε -greedy action-value methods: 10 Armed Testbed
greedy and ε -greedy action-value methods: 10 Armed Testbed
Greedy vs. ε-Greedy Action selection
• Greedy:
• If the reward variance = 0, greedy selection knows the
true value of each action after trying it once.
• ε - Greedy:
• With the noiser rewards, it takes more exploration(eg.
Variance = 10 vs 1)
• Non-stationary rewards.
Simple Bandit Algorithm
Adjusting step-size for Non-Stationary Rewards
Optimistic Initial Values: Initialization of action-Values
• Hyper parameters of the algorithm.
• A way to supply prior knowledge of reward expectations
• Optimistic initialization encourages explorations
• Initializing rewards to be 5 in the Testbed case with mean 0 and variance 1
• Effective for stationary problems, does not matter with non-stationary
rewards.
Upper-Confidence-Bound Action Selection
• A clever way of reducing exploration over time
• Focus on actions whose estimate has large degree of uncertainty
• Estimate an upper bound on the true action values
• Select the action with the largest (estimated) upper bound
• How should we select amongst non-greedy actions?

• It would be better to weigh the actions based on how many times they have
been tested and their previous returns.
Upper-Confidence-Bound Action Selection
• ln t : the natural logarithm of t

• Nt(a) : the number of times that action a has been selected prior to time t
• The number c > 0 controls the degree of exploration.
• If Nt(a) = 0, then a is considered to be a maximizing action.
10-Arm Testbed, USB Vs. Epsilon-Greedy
UCB is more difficult than epsilon-greedy to extend beyond bandits to more

general RL problems.
(Non-stationary problems, large state spaces)
Gradient Bandit Algorithm
• Relative preferences of one action over another :probability of
• Softmax distribution over action taking action a at time
t.
: preference of
action a
Initially H1(a) = 0, for

• Updated with stochastic Gradient Ascent all a
so that all actions have

an equal probability of
being selected.
Contextual Bandits
Thank you

Module 1

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Module 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 1

Uploaded by

Copyright:

Available Formats

Exploration and Exploitation

To equip students with the knowledge and skills to design intelligent

Aspect Supervised Learning Unsupervised Learning Reinforcement Learning

Learning to make decisions by performing

Requires a dataset with input- No predefined dataset; learns from

Model that identifies the data's

Minimize the error between Discover the underlying structure of

Aspect Supervised Learning Unsupervised Learning Reinforcement Learning

Typically evaluated on a separate Evaluated based on metrics like

• A reward Rt is a scalar feedback signal

Definition (Reward Hypothesis)

Control a power station

• A policy is the agent's behaviour

• Value function is a prediction of future reward

• Reinforcement learning is like trial-and-error learning

• Exploration adds more information about the environment

• Online Banner Advertisements

Problem: which slot machine should

• Objective is to maximize reward over a given number of time steps

• These true values are unknown.The distribution is unknown

• Greedy action selection always exploits current knowledge to maximize

• This is perhaps the simplest way to balance exploration and exploitation.

• How should we select amongst non-greedy actions?

• ln t : the natural logarithm of t

UCB is more difficult than epsilon-greedy to extend beyond bandits to more

Initially H1(a) = 0, for

so that all actions have

You might also like