RL Class Mtech
RL Class Mtech
RL Class Mtech
Q-learning is a machine learning approach that enables a model to iteratively learn and improve over
time by taking the correct action.
Q-learning is a type of reinforcement learning.
With reinforcement learning, a machine learning model is trained to mimic the way animals or
children learn.
Good actions are rewarded or reinforced, while bad actions are discouraged and penalized.
With the state-action-reward-state-action form of reinforcement learning, the training regimen follows
a model to take the right actions.
Q-learning provides a model-free approach to reinforcement learning.
There is no model of the environment to guide the reinforcement learning process.
The agent -- which is the AI component that acts in the environment -- iteratively learns and makes
predictions about the environment on its own.
Q-learning also takes an off-policy approach to reinforcement learning.
A Q-learning approach aims to determine the optimal action based on its current state.
The Q-learning approach can accomplish this by either developing its own set of rules or deviating
from the prescribed policy.
Because Q-learning may deviate from the given policy, a defined policy is not needed.
Off-policy approach in Q-learning is achieved using Q-values -- also known as action values.
The Q-values are the expected future values for action and are stored in the Q-table.
Q- Learning
Q-learning models operate in an iterative process that involves multiple components working together
to help train a model.
The iterative process involves the agent learning by exploring the environment and updating the
model as the exploration continues.
The multiple components of Q-learning include the following:
• Agents. The agent is the entity that acts and operates within an environment.
• States. The state is a variable that identifies the current position in an environment of an agent.
• Actions. The action is the agent's operation when it is in a specific state.
• Rewards. A foundational concept within reinforcement learning is the concept of providing either
a positive or a negative response for the agent's actions.
• Episodes. An episode is when an agent can no longer take a new action and ends up
terminating.
• Q-values. The Q-value is the metric used to measure an action at a particular state.
Q- Learning
Here are the two methods to determine the Q-value:
• Temporal difference. The temporal difference formula calculates the Q-value by incorporating
the value of the current state and action by comparing the differences with the previous state and
action.
• Bellman's equation. Mathematician Richard Bellman invented this equation in 1957 as a
recursive formula for optimal decision-making. In the q-learning context, Bellman's equation is
used to help calculate the value of a given state and assess its relative position. The state with
the highest value is considered the optimal state.
Q-learning models work through trial-and-error experiences to learn the optimal behavior for a task.
The Q-learning process involves modeling optimal behavior by learning an optimal action value
function or q-function.
This function represents the optimal long-term value of action a in state s and subsequently follows
optimal behavior in every subsequent state.
Q- Learning
Bellman's equation
Q(s,a) = Q(s,a) + α * (r + γ * max(Q(s',a')) - Q(s,a))
The equation breaks down as follows:
•Q(s, a) represents the expected reward for taking action a in state s.
•The actual reward received for that action is referenced by r while s' refers to the next state.
•The learning rate is α and γ is the discount factor.
•The highest expected reward for all possible actions a' in state s' is represented by max(Q(s', a')).
Q- Table
The Q-table includes columns and rows with lists of rewards for the best actions of each state in a
specific environment.
A Q-table helps an agent understand what actions are likely to lead to positive outcomes in different
situations.
The table rows represent different situations the agent might encounter, and the columns represent
the actions it can take. As the agent interacts with the environment and receives feedback in the form
of rewards or penalties, the values in the Q-table are updated to reflect what the model has learned.
The purpose of reinforcement learning is to gradually improve performance through the Q-table to
help choose actions. With more feedback, the Q-table becomes more accurate so the agent can
make better decisions and achieve optimal results.
The Q-table is directly related to the concept of the Q-function. The Q-function is a mathematical
equation that looks at the current state of the environment and the action under consideration as
inputs. The Q-function then generates outputs along with expected future rewards for that action in
the specific state. The Q-table allows the agent to look up the expected future reward for any given
state-action pair to move toward an optimized state.
Q- Learning Process
The Q-learning algorithm process is an interactive method where the agent learns by exploring the
environment and updating the Q-table based on the rewards received.
The steps involved in the Q-learning algorithm process include the following:
•Q-table initialization. The first step is to create the Q-table as a place to track each action in each
state and the associated progress.
•Observation. The agent needs to observe the current state of the environment.
•Action. The agent chooses to act in the environment. Upon completion of the action, the model
observes if the action is beneficial in the environment.
•Update. After the action has been taken, it's time to update the Q-table with the results.
•Repeat. Repeat steps 2-4 until the model reaches a termination state for a desired objective.
Q- Learning advantages
The Q-learning approach to reinforcement learning can potentially be advantageous for several
reasons, including the following:
•Model-free. The model-free approach is the foundation of Q-learning and one of the biggest
potential advantages for some uses. Rather than requiring prior knowledge about an environment,
the Q-learning agent can learn about the environment as it trains. The model-free approach is
particularly beneficial for scenarios where the underlying dynamics of an environment are difficult to
model or completely unknown.
•Off-policy optimization. The model can optimize to get the best possible result without being
strictly tethered to a policy that might not enable the same degree of optimization.
•Flexibility. The model-free, off-policy approach enables Q-learning flexibility to work across a variety
of problems and environments.
•Offline training. A Q-learning model can be deployed on pre-collected, offline data sets.
Q- Learning Disadvantages
The Q-learning approach to reinforcement model machine learning also has some disadvantages,
such as the following:
•Exploration vs. exploitation tradeoff. It can be hard for a Q-learning model to find the right
balance between trying new actions and sticking with what's already known. It's a dilemma that is
commonly referred to as the exploration vs. exploitation tradeoff for reinforcement learning.
•Curse of dimensionality. Q-learning can potentially face a machine learning risk known as the
curse of dimensionality. The curse of dimensionality is a problem with high-dimensional data where
the amount of data required to represent the distribution increases exponentially. This can lead to
computational challenges and decreased accuracy.
•Overestimation. A Q-learning model can sometimes be too optimistic and overestimate how good a
particular action or strategy is.
•Performance. A Q-learning model can take a long time to figure out the best method if there are
several ways to approach a problem.
Q- Learning
Q-learning models can improve processes in various scenarios. Here are a few examples of Q-
learning uses:
•Energy management. Q-learning models help manage energy for different resources such as
electricity, gas and water utilities. A 2022 report from IEEE provides a precise approach for
integrating a Q-learning model for energy management.
•Finance. A Q-learning-based training model can build models for decision-making assistance, such
as determining optimal moments to buy or sell assets.
•Gaming. Q-learning models can train gaming systems to achieve an expert level of proficiency in
playing a wide range of games as the model learns the optimal strategy to advance.
•Recommendation systems. Q-learning models can help optimize recommendation systems, such
as advertising platforms. For example, an ad system that recommends products commonly bought
together can be optimized based on what users select.
•Robotics. Q-learning models can help train robots to execute various tasks, such as object
manipulation, obstacle avoidance and transportation.
•Self-driving cars. Autonomous vehicles use many different models, and Q-learning models help
train models to make driving decisions, such as when to switch lanes or stop.
•Supply chain management. The flow of goods and services as part of supply chain management
can be improved with Q-learning models to help find the optimized path for products to market.
Deep Q-Networks
In deep Q-learning, we use a neural network to approximate the Q-value function.
The state is given as the input and the Q-value of all possible actions is generated as
the output.
The comparison between Q-learning & deep Q-learning is illustrated below:
Deep Q-learning networks
The steps involved in reinforcement learning using deep Q-learning networks (DQNs)?
1.All the past experience is stored by the user in memory
2.The next action is determined by the maximum output of the Q-network
3.The loss function here is mean squared error of the predicted Q-value and the target Q-
value – Q*.
This is basically a regression problem. However, we do not know the target or actual
value here as we are dealing with a reinforcement learning problem. Going back to the Q-
value update equation derived fromthe Bellman equation. we have:
The section in green represents the target. We can argue that it is predicting its own
value, but since R is the unbiased true reward, the network is going to update its gradient
using backpropagation to finally converge.
Real World Applications of RL
• Automated Robots
• Natural Language Processing
• Marketing and Advertising
• Image Processing
• Recommendation Systems
• Gaming
• Energy Conservation
• Traffic Control
• Healthcare
Multi Agent Reinforcement Learning