RL Ref

Reinforcement Learning Toolbox™
Reference
R2022b
How to Contact MathWorks
Latest news: www.mathworks.com
Sales and services: www.mathworks.com/sales_and_services
User community: www.mathworks.com/matlabcentral
Technical support: www.mathworks.com/support/contact_us
Phone: 508-647-7000
The MathWorks, Inc.

1 Apple Hill Drive
Natick, MA 01760-2098
Reinforcement Learning Toolbox™ Reference
© COPYRIGHT 2019–2022 by The MathWorks, Inc.
The software described in this document is furnished under a license agreement. The software may be used or copied
only under the terms of the license agreement. No part of this manual may be photocopied or reproduced in any form
without prior written consent from The MathWorks, Inc.
FEDERAL ACQUISITION: This provision applies to all acquisitions of the Program and Documentation by, for, or through
the federal government of the United States. By accepting delivery of the Program or Documentation, the government
hereby agrees that this software or documentation qualifies as commercial computer software or commercial computer
software documentation as such terms are used or defined in FAR 12.212, DFARS Part 227.72, and DFARS 252.227-7014.
Accordingly, the terms and conditions of this Agreement and only those rights specified in this Agreement, shall pertain
to and govern the use, modification, reproduction, release, performance, display, and disclosure of the Program and
Documentation by the federal government (or other entity acquiring for or through the federal government) and shall
supersede any conflicting contractual terms or conditions. If this License fails to meet the government's needs or is
inconsistent in any respect with federal procurement law, the government agrees to return the Program and
Documentation, unused, to The MathWorks, Inc.
Trademarks
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See
www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand names may be
trademarks or registered trademarks of their respective holders.
Patents
MathWorks products are protected by one or more U.S. patents. Please see www.mathworks.com/patents for
more information.
Revision History
March 2019 Online only New for Version 1.0 (Release 2019a)
September 2019 Online only Revised for Version 1.1 (Release 2019b)
March 2020 Online only Revised for Version 1.2 (Release 2020a)
Contents
Apps
1
Functions
2
Objects
3
Blocks
4
iii
1
Apps
1 Apps
Reinforcement Learning Designer

Design, train, and simulate reinforcement learning agents
Description
The Reinforcement Learning Designer app lets you design, train, and simulate agents for existing
environments.
Using this app, you can:
• Import an existing environment from the MATLAB® workspace or create a predefined

environment.
• Automatically create or import an agent for your environment (DQN, DDPG, TD3, SAC, and PPO
agents are supported).
• Train and simulate the agent against the environment.
• Analyze simulation results and refine your agent parameters.
• Export the final agent to the MATLAB workspace for further use and deployment.
Limitations
The following features are not supported in the Reinforcement Learning Designer app.
• Multi-agent systems
• Q, SARSA, PG, AC, and SAC agents
• Custom agents
• Agents relying on table or custom basis function representations
If your application requires any of these features then design, train, and simulate your agent at the
command line.
1-2
Reinforcement Learning Designer
Open the Reinforcement Learning Designer App

• MATLAB Toolstrip: On the Apps tab, under Machine Learning and Deep Learning, click the
app icon.
• MATLAB command prompt: Enter reinforcementLearningDesigner.
Examples
• “Create MATLAB Environments for Reinforcement Learning Designer”
• “Create Simulink Environments for Reinforcement Learning Designer”
• “Create Agents Using Reinforcement Learning Designer”
• “Design and Train Agent Using Reinforcement Learning Designer”
Programmatic Use
reinforcementLearningDesigner opens the Reinforcement Learning Designer app. You can
then import an environment and start the design process, or open a saved design session.
Version History
Introduced in R2021a
1-3
1 Apps
See Also
Apps
Deep Network Designer | Simulation Data Inspector
Functions
rlDQNAgent | rlDDPGAgent | rlTD3Agent | rlPPOAgent | analyzeNetwork
Topics
“Create MATLAB Environments for Reinforcement Learning Designer”
“Create Simulink Environments for Reinforcement Learning Designer”
“Create Agents Using Reinforcement Learning Designer”
“Design and Train Agent Using Reinforcement Learning Designer”
“What Is Reinforcement Learning?”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
1-4
2
Functions
2 Functions
accelerate
Package: rl.function
Option to accelerate computation of gradient for approximator object based on neural network
Syntax
newAppx = accelerate(oldAppx,useAcceleration)
Description
newAppx = accelerate(oldAppx,useAcceleration) returns the new neural-network-based
function approximator object newAppx, which has the same configuration as the original object,
oldAppx, and the option to accelerate the gradient computation set to the logical value
useAcceleration.
Examples
Accelerate Gradient Computation for a Q-Value Function
Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, define
an observation space with two channels. The first channel carries an observation from a continuous
four-dimensional space. The second carries a discrete scalar observation that can be either zero or
one. Finally, the action space is a three-dimensional vector in a continuous action space.
obsInfo = [rlNumericSpec([4 1])

rlFiniteSetSpec([0 1])];
actInfo = rlNumericSpec([3 1]);
To approximate the Q-value function within the critic, create a recurrent deep neural network. The
output layer must be a scalar expressing the value of executing the action given the observation.
Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects, and specify a name for the input layers, so
you can later explicitly associate them with the appropriate environment channel. Since the network
is recurrent, use sequenceInputLayer as the input layer and include an lstmLayer as one of the
other network layers.
% Define paths
inPath1 = [ sequenceInputLayer( ...
prod(obsInfo(1).Dimension), ...
Name="netObsIn1")
fullyConnectedLayer(5,Name="infc1") ];

Name="netObsIn2")
2-2
accelerate

prod(actInfo(1).Dimension), ...
Name="netActIn")
% Concatenate 3 previous layer outputs along dim 1

jointPath = [ concatenationLayer(1,3,Name="cct")
tanhLayer
lstmLayer(8,"OutputMode","sequence")
fullyConnectedLayer(1,Name="jntfc") ];
% Add layers to network object

net = layerGraph;
net = addLayers(net,inPath1);
net = addLayers(net,jointPath);
% Connect layers
net = connectLayers(net,"infc1","cct/in1");
% Plot network
plot(net)
% Convert to dlnetwork and display number of weights

net = dlnetwork(net);
summary(net)
Initialized: true
Number of learnables: 832
2-3
2 Functions
Inputs:
1 'netObsIn1' Sequence input with 4 dimensions
3 'netActIn' Sequence input with 3 dimensions
Create the critic with rlQValueFunction, using the network, and the observation and action
specification objects.
critic = rlQValueFunction(net, ...
obsInfo, ...
actInfo, ...
ObservationInputNames=["netObsIn1","netObsIn2"], ...
ActionInputNames="netActIn");
To return the value of the actions as a function of the current observation, use getValue or
evaluate.
val = evaluate(critic, ...
{ rand(obsInfo(1).Dimension), ...
rand(obsInfo(2).Dimension), ...
rand(actInfo(1).Dimension) })
val = 1x1 cell array

{[0.1360]}
When you use evaluate, the result is a single-element cell array containing the value of the action in
the input, given the observation.
val{1}
ans = single
0.1360
Calculate the gradients of the sum of the three outputs with respect to the inputs, given a random
observation.
gro = gradient(critic,"output-input", ...
{ rand(obsInfo(1).Dimension) , ...
rand(obsInfo(2).Dimension) , ...
rand(actInfo(1).Dimension) } )
gro=3×1 cell array

{4x1 single}
{[ 0.0243]}
{3x1 single}
The result is a cell array with as many elements as the number of input channels. Each element
contains the derivatives of the sum of the outputs with respect to each component of the input
channel. Display the gradient with respect to the element of the second channel.
gro{2}
ans = single
0.0243
Obtain the gradient with respect of five independent sequences, each one made of nine sequential
observations.
2-4
accelerate
gro_batch = gradient(critic,"output-input", ...

{ rand([obsInfo(1).Dimension 5 9]) , ...
rand([obsInfo(2).Dimension 5 9]) , ...
rand([actInfo(1).Dimension 5 9]) } )
gro_batch=3×1 cell array

{4x5x9 single}
{1x5x9 single}
{3x5x9 single}
Display the derivative of the sum of the outputs with respect to the third observation element of the
first input channel, after the seventh sequential observation in the fourth independent batch.
gro_batch{1}(3,4,7)
ans = single
0.0108
Set the option to accelerate the gradient computations.
critic = accelerate(critic,true);
Calculate the gradients of the sum of the outputs with respect to the parameters, given a random
observation.
grp = gradient(critic,"output-parameters", ...

rand(actInfo(1).Dimension) } )
grp=11×1 cell array

{ 5x4 single }
{ 5x1 single }
{ 5x1 single }
{ 5x1 single }
{ 5x3 single }
{ 5x1 single }
{32x15 single }
{32x8 single }
{32x1 single }
{[0.0444 0.1280 -0.1560 0.0193 0.0262 0.0453 -0.0186 -0.0651]}
{[ 1]}
Each array within a cell contains the gradient of the sum of the outputs with respect to a group of
parameters.
grp_batch = gradient(critic,"output-parameters", ...

rand([actInfo(1).Dimension 5 9]) } )
grp_batch=11×1 cell array

{ 5x4 single }
{ 5x1 single }
{ 5x1 single }
{ 5x1 single }
{ 5x3 single }
2-5
2 Functions
{ 5x1 single }
{32x15 single }
{32x8 single }
{32x1 single }
{[2.6325 10.1821 -14.0886 0.4162 2.0677 5.3991 0.3904 -8.9048]}
{[ 45]}
If you use a batch of inputs, gradient uses the whole input sequence (in this case nine steps), and
all the gradients with respect to the independent batch dimensions (in this case five) are added
together. Therefore, the returned gradient always has the same size as the output from
getLearnableParameters.
Accelerate Gradient Computation for a Discrete Categorical Actor
an observation space with two channels. The first channel carries an observation from a continuous
four-dimensional space. The second carries a discrete scalar observation that can be either zero or
one. Finally, the action space consist of a scalar that can be -1, 0, or 1.

actInfo = rlFiniteSetSpec([-1 0 1]);
Create a deep neural network to be used as approximation model within the actor. The output layer
must have three elements, each one expressing the value of executing the corresponding action,
given the observation. To create a recurrent neural network, use sequenceInputLayer as the input
layer and include an lstmLayer as one of the other network layers.
% Define paths
inPath1 = [ sequenceInputLayer(prod(obsInfo(1).Dimension))
fullyConnectedLayer(prod(actInfo.Dimension),Name="fc1") ];
inPath2 = [ sequenceInputLayer(prod(obsInfo(2).Dimension))
fullyConnectedLayer(prod(actInfo.Dimension),Name="fc2") ];
% Concatenate previous paths outputs along first dimension

tanhLayer;
lstmLayer(8,OutputMode="sequence");
fullyConnectedLayer( ...
prod(numel(actInfo.Elements)), ...
Name="jntfc"); ];

net = layerGraph;
% Connect layers
net = connectLayers(net,"fc1","cct/in1");
2-6
accelerate
net = connectLayers(net,"fc2","cct/in2");
% Plot network
plot(net)
% Convert to dlnetwork and display the number of weights

summary(net)
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
2 'sequenceinput_1' Sequence input with 1 dimensions
Since each element of the output layer must represent the probability of executing one of the possible
actions the software automatically adds a softmaxLayer as a final output layer if you do not specify
it explicitly.
Create the actor with rlDiscreteCategoricalActor, using the network and the observations and
action specification objects. When the network has multiple input layers, they are automatically
associated with the environment observation channels according to the dimension specifications in
obsInfo.
actor = rlDiscreteCategoricalActor(net, obsInfo, actInfo);
2-7
2 Functions
To return a vector of probabilities for each possible action, use evaluate.

[prob,state] = evaluate(actor, ...
rand(obsInfo(2).Dimension) });
prob{1}
ans = 3x1 single column vector
0.3403
0.3114
0.3483
To return an action sampled from the distribution, use getAction.

act = getAction(actor, ...
act{1}
ans = 1

actor = accelerate(actor,true);
parameters.
grp_batch = gradient(actor,"output-parameters", ...
rand([obsInfo(2).Dimension 5 9])} )

{[-3.6041e-09 -3.5829e-09 -2.8805e-09 -3.2158e-09]}
{[ -9.0017e-09]}
{[ -1.5321e-08]}
{[ -3.0182e-08]}
{32x2 single }
{32x8 single }
{32x1 single }
{ 3x8 single }
{ 3x1 single }
If you use a batch of inputs, the gradient uses the whole input sequence (in this case nine steps),
and all the gradients with respect to the independent batch dimensions (in this case five) are added
Input Arguments
oldAppx — Function approximator object
function approximator object
Function approximator object, specified as:
2-8
accelerate
• rlValueFunction,
• rlQValueFunction,
• rlVectorQValueFunction,
• rlDiscreteCategoricalActor,
• rlContinuousDeterministicActor,
• rlContinuousGaussianActor,
• rlContinuousDeterministicTransitionFunction,
• rlContinuousGaussianTransitionFunction,
• rlContinuousDeterministicRewardFunction,
• rlContinuousGaussianRewardFunction,
• rlIsDoneFunction.
useAcceleration — Option to use acceleration for gradient computations

false (default) | true
Option to use acceleration for gradient computations, specified as a logical value. When
useAcceleration is true, the gradient computations are accelerated by optimizing and caching
some inputs needed by the automatic-differentiation computation graph. For more information, see
“Deep Learning Function Acceleration for Custom Training Loops”.
Output Arguments
newAppx — Actor or critic
approximator object
New actor or critic, returned as an approximator object with the same type as oldAppx but with the
gradient acceleration option set to useAcceleration.
Version History
See Also
evaluate | gradient | getLearnableParameters | rlValueFunction | rlQValueFunction |
rlVectorQValueFunction | rlContinuousDeterministicActor |
rlDiscreteCategoricalActor | rlContinuousGaussianActor |
rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction
Topics
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”
2-9
2 Functions
allExperiences
Package: rl.replay
Return all experiences in replay memory buffer
Syntax
experiences = allExperiences(buffer)
experience = allExperiences(buffer,ConcatenateMode=mode)
Description
experiences = allExperiences(buffer) returns all experiences stored in experience buffer
buffer as individual experiences, each with a batch size of 1 and a sequence length of 1.
experience = allExperiences(buffer,ConcatenateMode=mode) returns experiences

concatenated along the dimension specified by mode. You can concatenate experiences along the
batch dimension or the sequence dimension.
Examples
Extract All Experiences from Replay Memory Buffer
Define observation specifications for the environment. For this example, assume that the environment
has two observation channels: one channel with two continuous observations and one channel with a
three-valued discrete observation
obsContinuous = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[1;5]);
obsDiscrete = rlFiniteSetSpec([1 2 3]);
obsInfo = [obsContinuous obsDiscrete];
Define action specifications for the environment. For this example, assume that the environment has
a single action channel with one continuous action in a specified range.
actInfo = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[5;10]);
Create an experience buffer with a maximum length of 5,000.

buffer = rlReplayMemory(obsInfo,actInfo,5000);
Append a sequence of 10 random experiences to the buffer.

for i = 1:10
experience(i).Observation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
experience(i).Action = {actInfo.UpperLimit.*rand(2,1)};
experience(i).NextObservation = ...
2-10
allExperiences
experience(i).Reward = 10*rand(1);
experience(i).IsDone = 0;
end
append(buffer,experience);
After appending experiences to the buffer, you extract all of the experiences from the buffer. Extract
all of the experiences as individual experiences, each with a batch size of 1 and sequence size of 1.
experience = allExperiences(buffer)
experience=10×1 struct array with fields:

Observation
Action
NextObservation
Reward
IsDone
Alternatively, you can extract all of the experiences as a single experience batch.
expBatch = allExperiences(buffer,ConcatenateMode="batch")
expBatch = struct with fields:

Observation: {[2x1x10 double] [1x1x10 double]}
Action: {[2x1x10 double]}
Reward: [9.5751 9.1574 7.4313 8.2346 1.8687 1.6261 5.0596 ... ]
NextObservation: {[2x1x10 double] [1x1x10 double]}
IsDone: [0 0 0 0 0 0 0 0 0 0]
Input Arguments
buffer — Experience buffer
rlReplayMemory object | rlPrioritizedReplayMemory object
Experience buffer, specified as an rlReplayMemory or rlPrioritizedReplayMemory object.
mode — Concatenation mode

"none" (default) | "batch" | "sequence"
Concatenation mode specified as a one of the following values.
• "none" — Return experience as N individual experiences, each with a batch size of 1 and a
sequence length of 1.
• "batch" — Return experience as a single batch with a sequence length of 1.
• "sequence" — Return experience as a single sequence with a batch size of 1.
Output Arguments
experience — All buffered experiences
structure array | structure
All N buffered experiences, returned as a structure array or structure. When mode is:
2-11
2 Functions
• "none", experience is returned as a structure array of length N, where each element contains
one buffered experience (batchSize = 1 and SequenceLength = 1).
• "batch", experience is returned as a structure. Each field of experience contains all buffered
experiences concatenated along the batch dimension (batchSize = N and SequenceLength =
1).
• "sequence", experience is returned as a structure. Each field of experience contains all
buffered experiences concatenated along the batch dimension (batchSize = 1 and
SequenceLength = N).
experience contains the following fields.
Observation — Observation
cell array
Observation, returned as a cell array with length equal to the number of observation specifications
specified when creating the buffer. Each element of Observation contains a DO-by-batchSize-by-
SequenceLength array, where DO is the dimension of the corresponding observation specification.
Action — Agent action

cell array
Agent action, returned as a cell array with length equal to the number of action specifications
specified when creating the buffer. Each element of Action contains a DA-by-batchSize-by-
SequenceLength array, where DA is the dimension of the corresponding action specification.
Reward — Reward value

scalar | array
Reward value obtained by taking the specified action from the observation, returned as a 1-by-1-by-
SequenceLength array.
NextObservation — Next observation

cell array
Next observation reached by taking the specified action from the observation, returned as a cell array
with the same format as Observation.
IsDone — Termination signal

integer | array
Termination signal, returned as a 1-by-1-by-SequenceLength array of integers. Each element of

IsDone has one of the following values.
• 0 — This experience is not the end of an episode.

• 1 — The episode terminated because the environment generated a termination signal.
• 2 — The episode terminated by reaching the maximum episode length.
Version History
Introduced in R2022b
2-12
allExperiences
See Also
rlReplayMemory | rlPrioritizedReplayMemory
2-13
2 Functions
append
Package: rl.replay
Append experiences to replay memory buffer
Syntax
append(buffer,experience)
append(buffer,experience,dataSourceID)
Description
append(buffer,experience) appends the experiences in experience to the replay memory
buffer.
append(buffer,experience,dataSourceID) appends experiences for the specified data source

to the replay memory buffer.
Examples
Create Experience Buffer
has a single observation channel with three continuous signals in specified ranges.
obsInfo = rlNumericSpec([3 1],...
LowerLimit=0,...
UpperLimit=[1;5;10]);
a single action channel with two continuous signals in specified ranges.
LowerLimit=0,...
UpperLimit=[5;10]);

Append a single experience to the buffer using a structure. Each experience contains the following
elements: current observation, action, next observation, reward, and is-done.
For this example, create an experience with random observation, action, and reward values. Indicate
that this experience is not a terminal condition by setting the IsDone value to 0.
exp.Observation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Action = {actInfo.UpperLimit.*rand(2,1)};
exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Reward = 10*rand(1);
exp.IsDone = 0;
2-14
append
Append the experience to the buffer.
append(buffer,exp);
You can also append a batch of experiences to the experience buffer using a structure array. For this
example, append a sequence of 100 random experiences, with the final experience representing a
terminal condition.
for i = 1:100
expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)};
expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Reward = 10*rand(1);
expBatch(i).IsDone = 0;
end
expBatch(100).IsDone = 1;
append(buffer,expBatch);
After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.
miniBatch = sample(buffer,50);
You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive
experiences with a discount factor of 0.95.
horizonSample = sample(buffer,1,...
NStepHorizon=10,...
DiscountFactor=0.95);
The returned sample includes the following information.
• Observation and Action are the observation and action from the first experience in the
horizon.
• NextObservation and IsDone are the next observation and termination signal from the final
experience in the horizon.
• Reward is the cumulative reward across the horizon using the specified discount factor.
You can also sample a sequence of consecutive experiences. In this case, the structure fields contain
arrays with values for all sampled experiences.
sequenceSample = sample(buffer,1,...
SequenceLength=20);
Create Experience Buffer with Multiple Observation Channels
three-valued discrete observation.

LowerLimit=0,...
UpperLimit=[1;5]);
2-15
2 Functions


LowerLimit=0,...
UpperLimit=[5;10]);
for i = 1:50
exp(i).Observation = ...
exp(i).Action = {actInfo.UpperLimit.*rand(2,1)};
exp(i).NextObservation = ...
exp(i).Reward = 10*rand(1);
exp(i).IsDone = 0;
end
append(buffer,exp);
Input Arguments
rlReplayMemory object | rlPrioritizedReplayMemeory
experience — Experience to append to the buffer

structure | structure array
Experience to append to the buffer, specified as a structure or structure array with the following
fields.
cell array
Observation, specified as a cell array with length equal to the number of observation specifications
specified when creating the buffer. The dimensions of each element in Observation must match the
dimensions in the corresponding observation specification.

cell array
2-16
append
Action taken by the agent, specified as a cell array with length equal to the number of action
specifications specified when creating the buffer. The dimensions of each element in Action must
match the dimensions in the corresponding action specification.

scalar
Reward value obtained by taking the specified action from the starting observation, specified as a
scalar.

cell array
Next observation reached by taking the specified action from the starting observation, specified as a
cell array with the same format as Observation.

0|1|2
Termination signal, specified as one of the following values.

dataSourceID — Data source index

0 (default) | nonnegative integer | array of nonnegative integers
Data source index, specified as a nonnegative integer or array of nonnegative integers.
If experience is a scalar structure, specify dataSourceID as a scalar integer.
If experience is a structure array, specify dataSourceID as an array with length equal to the
length of experience. You can specify different data source indices for each element of
experience. If all elements in experience come from the same data source, you can specify
dataSourceID as a scalar integer.
Version History
See Also
rlReplayMemory | sample
2-17
2 Functions
barrierPenalty
Logarithmic barrier penalty value for a point with respect to a bounded region
Syntax
p = barrierPenalty(x,xmin,xmax)
p = barrierPenalty( ___ ,maxValue,curvature)
Description
p = barrierPenalty(x,xmin,xmax) calculates the nonnegative (logarithmic barrier) penalty
vector p for the point x with respect to the region bounded by xmin and xmax. p has the same
dimension as x. This syntax uses the default values of 1 and 0.1 for the maxValue and curvature
parameters of the barrier function, respectively.
p = barrierPenalty( ___ ,maxValue,curvature) specifies both the maxValue and

curvature parameters of the barrier function. If maxValue is an empty matrix its default value is
used. Likewise if curvature is an empty matrix or it is omitted, its default value is used.
Examples
Calculate Logarithmic Barrier Penalty for a Point
This example shows how to use the logarithmic barrierPenalty function to calculate the barrier
penalty for a given point, with respect to a bounded region.
Calculate the penalty value for the point 0.1 within the interval [-2,2] using default values for the
maximum value and curvature parameters.
barrierPenalty(0.1,-2,2)
ans = 2.5031e-04
Calculate the penalty value for the point 4 outside the interval [-2,2].
barrierPenalty(4,-2,2)
ans = 1
Calculate the penalty value for the point 4 outside the interval [-2,2], using a maximum value
parameter of 5.
barrierPenalty(4,-2,2,5)
ans = 5
Calculate the penalty value for the point 0.1 inside the interval [-2,2], using a curvature parameter of
0.5.
barrierPenalty(0.1,-2,2,5,0.5)
2-18
barrierPenalty
ans = 0.0013
Calculate the penalty value for the point [-2,0,4] with respect to the box defined by [0,1], [-1,1], and
[-2,2] along the x, y, and z dimensions, respectively, using the default value for maximum value and a
curvature parameter of 0.
barrierPenalty([-2 0 4],[0 -1 -2],[1 1 2],1,0)
ans = 3×1
1
0
1
Visualize Penalty Values for an Interval
Create a vector of 1001 equidistant points distributed between -5 and 5.
x = -5:0.01:5;
Calculate penalties for all the points in the vector, using the default value for the maximum value
parameter and a value of 0.01 for the curvature parameter.
p = barrierPenalty(x,-2,2,1,0.01);
Plot the points and add grid, labels and title.
plot(x,p)
grid
xlabel("point position");
ylabel("penalty value");
title("Penalty values over an interval");
2-19
2 Functions
Input Arguments
x — Point for which the penalty is calculated
scalar | vector | matrix
Point for which the penalty is calculated, specified as a numeric scalar, vector or matrix.
Example: [0.5; 1.6]
xmin — Lower bounds

Lower bounds for x, specified as a numeric scalar, vector or matrix. To use the same minimum value
for all elements in x, specify xmin as a scalar.
Example: -1
xmax — Upper bounds

Upper bounds for x, specified as a numeric scalar, vector or matrix. To use the same maximum value
for all elements in x, specify xmax as a scalar.
Example: 2
maxValue — Maximum value parameter of the barrier function

1 (default) | nonnegative scalar
2-20
barrierPenalty
Maximum value parameter of the barrier function, specified as a scalar.

Example: 2
curvature — Curvature parameter of the barrier function

0.1 (default) | nonnegative scalar
Curvature parameter of the barrier function, specified as a scalar.

Example: 0.2
Output Arguments
p — Penalty value
nonnegative vector
Penalty value, returned as a vector of nonnegative elements. Each element pi depends on the position
of xi with respect to the interval specified by xmini and xmaxi. The barrier penalty function returns
the value
2
p(x) = min pmax, C log 0.25 xmax − xmin − log x − xmin xmax − xmin
when xmin < x < xmax, and maxValue otherwise. Here, C is the argument curvature, and pmax is the
argument maxValue. Note that for positive values of C the returned penalty value is always positive.
If C is zero, then the returned penalty is zero inside the interval defined by the bounds, and pmax
outside this interval. If x is multidimensional, then the calculation is applied independently on each
dimension. Penalty functions are typically used to generate negative rewards when constraints are
violated, such as in generateRewardFunction.
Version History
Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
See Also
Functions
generateRewardFunction | exteriorPenalty | hyperbolicPenalty
Topics
“Generate Reward Function from a Model Predictive Controller for a Servomotor”
“Generate Reward Function from a Model Verification Block for a Water Tank System”
“Define Reward Signals”
2-21
2 Functions
bus2RLSpec
Create reinforcement learning data specifications for elements of a Simulink bus
Syntax
specs = bus2RLSpec(busName)
specs = bus2RLSpec(busName,Name,Value)
Description
specs = bus2RLSpec(busName) creates a set of reinforcement learning data specifications from
the Simulink® bus object specified by busName. One specification element is created for each leaf
element in the corresponding Simulink bus. Use these specifications to define actions and
observations for a Simulink reinforcement learning environment.
specs = bus2RLSpec(busName,Name,Value) specifies options for creating specifications using

one or more Name,Value pair arguments.
Examples
Create an observation specification object from a bus object
This example shows how to use the function bus2RLSpec to create an observation specification
object from a Simulink® bus object.
Create a bus object.
obsBus = Simulink.Bus();
Create three elements in the bus and specify their names.
obsBus.Elements(1) = Simulink.BusElement;
obsBus.Elements(1).Name = 'sin_theta';
obsBus.Elements(2).Name = 'cos_theta';
obsBus.Elements(3).Name = 'dtheta';
Create the observation specification objects using the Simulink bus object.
obsInfo = bus2RLSpec('obsBus');
You can then use obsInfo, together with the corresponding Simulink model, to create a
reinforcement learning environment. For an example, see “Train DDPG Agent to Swing Up and
Balance Pendulum with Bus Signal”.
2-22
bus2RLSpec
Create an action specification object from a bus object
This example shows how to call the function bus2RLSpec using name and value pairs to create an
action specification object from a Simulink® bus object.
Create a bus object.

actBus = Simulink.Bus();
Create one element in the bus and specify the name.

actBus.Elements(1) = Simulink.BusElement;
actBus.Elements(1).Name = 'actuator';
Create the observation specification objects using the Simulink bus object.
actInfo = bus2RLSpec('actBus','DiscreteElements',{'actuator',[-1 1]});
This specifies that the 'actuator' bus element can carry two possible values, -1, and 1.
You can then use actInfo, together with the corresponding Simulink model, to create a
reinforcement learning environment. Specifically the function that creates the environment uses
actInfo to determine the right bus output of the agent block.
For an example, see “Train DDPG Agent to Swing Up and Balance Pendulum with Bus Signal”.
Input Arguments
busName — Name of Simulink bus object
string | character vector
Name of Simulink bus object, specified as a string or character vector.
Name-Value Pair Arguments
Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the

argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose Name in quotes.
Example: 'DiscreteElements',{'force',[-5 0 5]} sets the 'force' bus element to be a
discrete data specification with three possible values, -5, 0, and 5
Model — Name of Simulink model

Name of the Simulink model, specified as the comma-separated pair consisting of 'Model' and a
string or character vector. Specify the model name when the bus object is defined in the model global
workspace (for example, in a data dictionary) instead of the MATLAB workspace.
BusElementNames — Names of bus leaf elements

string array
Names of bus leaf elements for which to create specifications, specified as the comma-separated pair
consisting of BusElementNames' and a string array. To create observation specifications for a subset
2-23
2 Functions
of the elements in a Simulink bus object, specify BusElementNames. If you do not specify
BusElementNames, a data specification is created for each leaf element in the bus.
Note Do not specify BusElementNames when creating specifications for action signals. The RL
Agent block must output the full bus signal.
DiscreteElements — Finite values for discrete bus elements

cell array of name-value pairs
Finite values for discrete bus elements, specified as the comma-separated pair consisting of
'DiscreteElements' and a cell array of name-value pairs. Each name-value pair consists of a bus
leaf element name and an array of discrete values. The specified discrete values must be castable to
the data type of the specified action signal.
If you do not specify discrete values for an element specification, the element is continuous.
Example: 'ActionDiscretElements',{'force',[-10 0 10],'torque',[-5 0 5]} specifies
discrete values for the 'force' and 'torque' leaf elements of a bus action signal.
Output Arguments
specs — Data specifications
rlNumericSpec object | rlFiniteSetSpec object | array of data specification objects
Data specifications for reinforcement learning actions or observations, returned as one of the
following:
• rlNumericSpec object for a single continuous bus element

• rlFiniteSetSpec object for a single discrete bus element
• Array of data specification objects for multiple bus elements
By default, all data specifications for bus elements are rlNumericSpec objects. To create a discrete
specification for one or more bus elements, specify the element names using the DiscreteElements
name-value pair.
Version History
See Also
Blocks
RL Agent
Functions
rlSimulinkEnv | createIntegratedEnv | rlNumericSpec | rlFiniteSetSpec
Topics
“Create Simulink Reinforcement Learning Environments”
2-24
cleanup
cleanup
Package: rl.env
Clean up reinforcement learning environment or data logger object
Syntax
cleanup(env)
cleanup(lgr)
Description
When you define a custom training loop for reinforcement learning, you can simulate an agent or
policy against an environment using the runEpisode function. Use the cleanup function to clean up
the environment after running simulations using multiple calls to runEpisode. To clean up the
environment after each simulation, you can configure runEpisode to automatically call the cleanup
function at the end of each episode.
Also use cleanup to perform clean up tasks for a FileLogger or MonitorLogger object after
logging data within a custom training loop.
Environment Objects
cleanup(env) cleans up the specified reinforcement learning environment after running multiple
simulations using runEpisode.
Data Logger Objects
cleanup(lgr) cleans up the specified data logger object after logging data within a custom training
loop. This task might involve for example transferring any remaining data from lgr internal memory
to a logging target (either“Log Data to Disk in a Custom Training Loop” on page 2-27 a MAT file or a
trainingProgressMonitor object).
Examples
Simulate Environment and Agent
Create a reinforcement learning environment and extract its observation and action specifications.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
To approximate the Q-value function withing the critic, use a neural network. Create a network as an
array of layer objects.
net = [...
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(24)
reluLayer
2-25
2 Functions
reluLayer
softmaxLayer];
Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
summary(net)
Initialized: true
Inputs:
1 'input' 4 features
Create a discrete categorical actor using the network.

actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);
Check your actor with a random observation.

act = getAction(actor,{rand(obsInfo.Dimension)})
act = 1x1 cell array

{[-10]}
Create a policy object from the actor.

policy = rlStochasticActorPolicy(actor);
Create an experience buffer.

buffer = rlReplayMemory(obsInfo,actInfo);
Set up the environment for running multiple simulations. For this example, configure the training to
log any errors rather than send them to the command window.
setup(env,StopOnError="off")
Simulate multiple episodes using the environment and policy. After each episode, append the
experiences to the buffer. For this example, run 100 episodes.
for i = 1:100
output = runEpisode(env,policy,MaxSteps=300);
append(buffer,output.AgentData.Experiences)
end
Clean up the environment.

cleanup(env)
Sample a mini-batch of experiences from the buffer. For this example, sample 10 experiences.
batch = sample(buffer,10);
You can then learn from the sampled experiences and update the policy and actor.
2-26
cleanup
Log Data to Disk in a Custom Training Loop
This example shows how to log data to disk when training an agent using a custom training loop.
Create a FileLogger object using rlDataLogger.
flgr = rlDataLogger();
Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.
setup(flgr);
Within a custom training loop, you can now store data to the logger object memory and write data to
file.
For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.
for iter = 1:10
% Store three random numbers in memory

% as elements of the variable "Context1"
for ct = 1:3
store(flgr, "Context1", rand, iter);
end
% Store a random number in memory

% as the variable "Context2"
% Write data to file every 4 iterations

if mod(iter,4)==0
write(flgr);
end
end
Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.
cleanup(flgr);
Input Arguments
env — Reinforcement learning environment
environment object | ...
Reinforcement learning environment, specified as one of the following objects.
2-27
2 Functions
• rlFunctionEnv — Environment defined using custom functions.

• SimulinkEnvWithAgent — Simulink environment created using rlSimulinkEnv or
createIntegratedEnv
• rlMDPEnv — Markov decision process environment
• rlNeuralNetworkEnvironment — Environment with deep neural network transition models
• Predefined environment created using rlPredefinedEnv
• Custom environment created from a template (rlCreateEnvTemplate)
If env is a SimulinkEnvWithAgent object and the associated Simulink model is configured to use
fast restart, then cleanup terminates the model compilation.
lgr — Date logger object

FileLogger object | MonitorLogger object | ...
Data logger object, specified as either a FileLogger or a MonitorLogger object.
Version History
See Also
Functions
runEpisode | setup | reset | store | write
Objects
rlFunctionEnv | rlMDPEnv | SimulinkEnvWithAgent | rlNeuralNetworkEnvironment |
FileLogger | MonitorLogger
Topics
“Custom Training Loop with Simulink Action Noise”
2-28
createGridWorld
createGridWorld
Create a two-dimensional grid world for reinforcement learning
Syntax
GW = createGridWorld(m,n)
GW = createGridWorld(m,n,moves)
Description
GW = createGridWorld(m,n) creates a grid world GW of size m-by-n with default actions of
['N';'S';'E';'W'].
GW = createGridWorld(m,n,moves) creates a grid world GW of size m-by-n with actions specified

by moves.
Examples
Create Grid World Environment
For this example, consider a 5-by-5 grid world with the following rules:
1 A 5-by-5 grid world bounded by borders, with 4 possible actions (North = 1, South = 2, East = 3,
West = 4).
2 The agent begins from cell [2,1] (second row, first column).
3 The agent receives reward +10 if it reaches the terminal state at cell [5,5] (blue).
4 The environment contains a special jump from cell [2,4] to cell [4,4] with +5 reward.
5 The agent is blocked by obstacles in cells [3,3], [3,4], [3,5] and [4,3] (black cells).
6 All other actions result in -1 reward.
2-29
2 Functions
First, create a GridWorld object using the createGridWorld function.

GW = createGridWorld(5,5)
GW =
GridWorld with properties:
GridSize: [5 5]
CurrentState: "[1,1]"
States: [25x1 string]
Actions: [4x1 string]
T: [25x25x4 double]
R: [25x25x4 double]
ObstacleStates: [0x1 string]
TerminalStates: [0x1 string]
ProbabilityTolerance: 8.8818e-16
Now, set the initial, terminal and obstacle states.

GW.CurrentState = '[2,1]';
GW.TerminalStates = '[5,5]';
GW.ObstacleStates = ["[3,3]";"[3,4]";"[3,5]";"[4,3]"];
Update the state transition matrix for the obstacle states and set the jump rule over the obstacle
states.
updateStateTranstionForObstacles(GW)
GW.T(state2idx(GW,"[2,4]"),:,:) = 0;
GW.T(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 1;
Next, define the rewards in the reward transition matrix.

nS = numel(GW.States);
nA = numel(GW.Actions);
2-30
createGridWorld
GW.R = -1*ones(nS,nS,nA);
GW.R(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 5;
GW.R(:,state2idx(GW,GW.TerminalStates),:) = 10;
Now, use rlMDPEnv to create a grid world environment using the GridWorld object GW.
env = rlMDPEnv(GW)
env =
rlMDPEnv with properties:
Model: [1x1 rl.env.GridWorld]

ResetFcn: []
You can visualize the grid world environment using the plot function.
plot(env)
Input Arguments
m — Number of rows of the grid world
scalar
Number of rows of the grid world, specified as a scalar.
2-31
2 Functions
n — Number of columns of the grid world

scalar
Number of columns of the grid world, specified as a scalar.
moves — Action names

'Standard' (default) | 'Kings'
Action names, specified as either 'Standard' or 'Kings'. When moves is set to
• 'Standard', the actions are ['N';'S';'E';'W'].

• 'Kings', the actions are ['N';'S';'E';'W';'NE';'NW';'SE';'SW'].
Output Arguments
GW — Two-dimensional grid world
GridWorld object
Two-dimensional grid world, returned as a GridWorld object with properties listed below. For more
information, see “Create Custom Grid World Environments”.
GridSize — Size of the grid world

[m,n] vector
Size of the grid world, specified as a [m,n] vector.
CurrentState — Name of the current state

string
Name of the current state, specified as a string.
States — State names

string vector
State names, specified as a string vector of length m*n.
Actions — Action names

string vector
Action names, specified as a string vector. The length of the Actions vector is determined by the
moves argument.
Actions is a string vector of length:
• Four, if moves is specified as 'Standard'.

• Eight, moves is specified as 'Kings'.
T — State transition matrix

3D array
State transition matrix, specified as a 3-D array, which determines the possible movements of the
agent in an environment. State transition matrix T is a probability matrix that indicates how likely the
agent will move from the current state s to any possible next state s' by performing action a. T is
given by,
2-32
createGridWorld
T s, s′, a = probability s′ s, a .
T is:
• A K-by-K-by-4 array, if moves is specified as 'Standard'. Here, K = m*n.

• A K-by-K-by-8 array, if moves is specified as 'Kings'.
R — Reward transition matrix

3D array
Reward transition matrix, specified as a 3-D array, determines how much reward the agent receives
after performing an action in the environment. R has the same shape and size as state transition
matrix T. Reward transition matrix R is given by,
r = R s, s′, a .
R is:
• A K-by-K-by-4 array, if moves is specified as 'Standard'. Here, K = m*n.

• A K-by-K-by-8 array, if moves is specified as 'Kings'.
ObstacleStates — State names that cannot be reached in the grid world

string vector
State names that cannot be reached in the grid world, specified as a string vector.
TerminalStates — Terminal state names in the grid world

string vector
Terminal state names in the grid world, specified as a string vector.
Version History
See Also
rlMDPEnv | rlPredefinedEnv
Topics
“Create Custom Grid World Environments”
“Train Reinforcement Learning Agent in Basic Grid World”
2-33
2 Functions
createIntegratedEnv
Create Simulink model for reinforcement learning, using reference model as environment
Syntax
env = createIntegratedEnv(refModel,newModel)
[env,agentBlock,obsInfo,actInfo] = createIntegratedEnv( ___ )
[ ___ ] = createIntegratedEnv( ___ ,Name,Value)
Description
env = createIntegratedEnv(refModel,newModel) creates a Simulink model with the name
specified by newModel and returns a reinforcement learning environment object, env, for this model.
The new model contains an RL Agent block and uses the reference model refModel as a
reinforcement learning environment for training the agent specified by this block.
[env,agentBlock,obsInfo,actInfo] = createIntegratedEnv( ___ ) returns the block path

to the RL Agent block in the new model and the observation and action data specifications for the
reference model, obsInfo and actInfo, respectively.
[ ___ ] = createIntegratedEnv( ___ ,Name,Value) creates a model and environment interface

using port, observation, and action information specified using one or more Name,Value pair
arguments.
Examples
Create Environment from Simulink Model
This example shows how to use createIntegratedEnv to create an environment object starting
from a Simulink model that implements the system with which the agent. Such a system is often
referred to as plant, open-loop system, or reference system, while the whole (integrated) system
including the agent is often referred to as the closed-loop system.
For this example, use the flying robot model described in “Train DDPG Agent to Control Flying
Robot” as the reference (open-loop) system.
Open the flying robot model.

open_system('rlFlyingRobotEnv')
Initialize the state variables and sample time.

% initial model state variables
theta0 = 0;
x0 = -15;
y0 = 0;
% sample time
Ts = 0.4;
2-34
createIntegratedEnv
Create the Simulink model myIntegratedEnv containing the flying robot model connected in a
closed loop to the agent block. The function also returns the reinforcement learning environment
object env to be used for training.
env = createIntegratedEnv('rlFlyingRobotEnv','myIntegratedEnv')
env =
SimulinkEnvWithAgent with properties:
Model : myIntegratedEnv
AgentBlock : myIntegratedEnv/RL Agent
ResetFcn : []
UseFastRestart : on
The function can also return the block path to the RL Agent block in the new integrated model, as
well as the observation and action specifications for the reference model.
[~,agentBlk,observationInfo,actionInfo] = createIntegratedEnv( ...

'rlFlyingRobotEnv','myIntegratedEnv')
agentBlk =
'myIntegratedEnv/RL Agent'
observationInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "observation"
Description: [0x0 string]
Dimension: [7 1]
DataType: "double"
actionInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "action"
Dimension: [2 1]
DataType: "double"
Returning the block path and specifications is useful in cases in which you need to modify
descriptions, limits, or names in observationInfo and actionInfo. After modifying the
specifications, you can then create an environment from the integrated model IntegratedEnv using
the rlSimulinkEnv function.
Create Integrated Environment with Specified Port Names
This example shows how to call createIntegratedEnv using name-value pairs to specify port
names.
2-35
2 Functions
The first argument of createIntegratedEnv is the name of the reference Simulink model that
contains the system with which the agent must interact. Such a system is often referred to as plant,
or open-loop system. For this example, the reference system is the model of a water tank.
Open the open-loop water tank model.
open_system('rlWatertankOpenloop')
Set the sample time of the discrete integrator block used to generate the observation, so the
simulation can run.
Ts = 1;
The input port is called u (instead of action), and the first and third output ports are called y and
stop (instead of observation and isdone). Specify the port names using name-value pairs.
env = createIntegratedEnv('rlWatertankOpenloop','IntegratedWatertank',...
'ActionPortName','u','ObservationPortName','y','IsDonePortName','stop')
env =
Model : IntegratedWatertank
AgentBlock : IntegratedWatertank/RL Agent
ResetFcn : []
UseFastRestart : on
The new model IntegratedWatertank contains the reference model connected in a closed-loop
with the agent block. The function also returns the reinforcement learning environment object to be
used for training.
Input Arguments
refModel — Reference model name
Reference model name, specified as a string or character vector. This is the Simulink model
implementing the system that the agent needs to interact with. Such a system is often referred to as
plant, open loop system or reference system, while the whole (integrated) system including the agent
is often referred to as the closed loop system. The new Simulink model uses this reference model as
the dynamic model of the environment for reinforcement learning.
newModel — New model name

New model name, specified as a string or character vector. createIntegratedEnv creates a

Simulink model with this name, but does not save the model.

2-36
createIntegratedEnv
Example: 'IsDonePortName',"stopSim" sets the stopSim port of the reference model as the
source of the isdone signal.
ObservationPortName — Reference model observation output port name

"observation" (default) | string | character vector
Reference model observation output port name, specified as the comma-separated pair consisting of
'ObservationPortName' and a string or character vector. Specify ObservationPortName when
the name of the observation output port of the reference model is not "observation".
ActionPortName — Reference model action input port name

"action" (default) | string | character vector
Reference model action input port name, specified as the comma-separated pair consisting of
'ActionPortName' and a string or character vector. Specify ActionPortName when the name of
the action input port of the reference model is not "action".
RewardPortName — Reference model reward output port name

"reward" (default) | string | character vector
Reference model reward output port name, specified as the comma-separated pair consisting of
'RewardPortName' and a string or character vector. Specify RewardPortName when the name of
the reward output port of the reference model is not "reward".
IsDonePortName — Reference model done flag output port name

"isdone" (default) | string | character vector
Reference model done flag output port name, specified as the comma-separated pair consisting of
'IsDonePortName' and a string or character vector. Specify IsDonePortName when the name of
the done flag output port of the reference model is not "isdone".
ObservationBusElementNames — Names of observation bus leaf elements

string array
Names of observation bus leaf elements for which to create specifications, specified as a string array.
To create observation specifications for a subset of the elements in a Simulink bus object, specify
BusElementNames. If you do not specify BusElementNames, a data specification is created for each
leaf element in the bus.
ObservationBusElementNames is applicable only when the observation output port is a bus signal.
Example: 'ObservationBusElementNames',["sin" "cos"] creates specifications for the
observation bus elements with the names "sin" and "cos".
ObservationDiscreteElements — Finite values for observation specifications

Finite values for discrete observation specification elements, specified as the comma-separated pair
consisting of 'ObservationDiscreteElements' and a cell array of name-value pairs. Each name-
value pair consists of an element name and an array of discrete values.
If the observation output port of the reference model is:
2-37
2 Functions
• A bus signal, specify the name of one of the leaf elements of the bus specified in by
ObservationBusElementNames
• Nonbus signal, specify the name of the observation port, as specified by ObservationPortName
The specified discrete values must be castable to the data type of the specified observation signal.
If you do not specify discrete values for an observation specification element, the element is
continuous.
Example: 'ObservationDiscretElements',{'observation',[-1 0 1]} specifies discrete
values for a nonbus observation signal with default port name observation.
Example: 'ObservationDiscretElements',{'gear',[-1 0 1 2],'direction',[1 2 3 4]}
specifies discrete values for the 'gear' and 'direction' leaf elements of a bus action signal.
ActionDiscreteElements — Finite values for action specifications

Finite values for discrete action specification elements, specified as the comma-separated pair
consisting of 'ActionDiscreteElements' and a cell array of name-value pairs. Each name-value
pair consists of an element name and an array of discrete values.
If the action input port of the reference model is:
• A bus signal, specify the name of a leaf element of the bus

• Nonbus signal, specify the name of the action port, as specified by ActionPortName
The specified discrete values must be castable to the data type of the specified action signal.
If you do not specify discrete values for an action specification element, the element is continuous.
Example: 'ActionDiscretElements',{'action',[-1 0 1]} specifies discrete values for a
nonbus action signal with default port name 'action'.
Example: 'ActionDiscretElements',{'force',[-10 0 10],'torque',[-5 0 5]} specifies
discrete values for the 'force' and 'torque' leaf elements of a bus action signal.
Output Arguments
SimulinkEnvWithAgent object
Reinforcement learning environment interface, returned as an SimulinkEnvWithAgent object.
agentBlock — Block path to the agent block

character vector
Block path to the agent block in the new model, returned as a character vector. To train an agent in
the new Simulink model, you must create an agent and specify the agent name in the RL Agent block
indicated by agentBlock.
For more information on creating agents, see “Reinforcement Learning Agents”.
obsInfo — Observation data specifications

2-38
createIntegratedEnv
Observation data specifications, returned as one of the following:
• rlNumericSpec object for a single continuous observation specification

• rlFiniteSetSpec object for a single discrete observation specification
• Array of data specification objects for multiple specifications
actInfo — Action data specifications

Action data specifications, returned as one of the following:
• rlNumericSpec object for a single continuous action specification

• rlFiniteSetSpec object for a single discrete action specification
• Array of data specification objects for multiple action specifications
Version History
See Also
Blocks
RL Agent
Functions
rlSimulinkEnv | bus2RLSpec | rlNumericSpec | rlFiniteSetSpec
Topics
2-39
2 Functions
createMDP
Create Markov decision process model
Syntax
MDP = createMDP(states,actions)
Description
MDP = createMDP(states,actions) creates a Markov decision process model with the specified
states and actions.
Examples
Create MDP Model
Create an MDP model with eight states and two possible actions.
MDP = createMDP(8,["up";"down"]);
Specify the state transitions and their associated rewards.
% State 1 Transition and Reward

MDP.T(1,2,1) = 1;
MDP.R(1,2,1) = 3;
MDP.T(1,3,2) = 1;
MDP.R(1,3,2) = 1;

MDP.T(2,4,1) = 1;
MDP.R(2,4,1) = 2;
MDP.T(2,5,2) = 1;
MDP.R(2,5,2) = 1;

MDP.T(3,5,1) = 1;
MDP.R(3,5,1) = 2;
MDP.T(3,6,2) = 1;
MDP.R(3,6,2) = 4;

MDP.T(4,7,1) = 1;
MDP.R(4,7,1) = 3;
MDP.T(4,8,2) = 1;
MDP.R(4,8,2) = 2;

MDP.T(5,7,1) = 1;
MDP.R(5,7,1) = 1;
MDP.T(5,8,2) = 1;
MDP.R(5,8,2) = 9;
2-40
createMDP

MDP.T(6,7,1) = 1;
MDP.R(6,7,1) = 5;
MDP.T(6,8,2) = 1;
MDP.R(6,8,2) = 1;

MDP.T(7,7,1) = 1;
MDP.R(7,7,1) = 0;
MDP.T(7,7,2) = 1;
MDP.R(7,7,2) = 0;

MDP.T(8,8,1) = 1;
MDP.R(8,8,1) = 0;
MDP.T(8,8,2) = 1;
MDP.R(8,8,2) = 0;
Specify the terminal states of the model.

MDP.TerminalStates = ["s7";"s8"];
Input Arguments
states — Model states
positive integer | string vector
Model states, specified as one of the following:
• Positive integer — Specify the number of model states. In this case, each state has a default name,
such as "s1" for the first state.
• String vector — Specify the state names. In this case, the total number of states is equal to the
length of the vector.
actions — Model actions

positive integer | string vector
Model actions, specified as one of the following:
• Positive integer — Specify the number of model actions. In this case, each action has a default
name, such as "a1" for the first action.
• String vector — Specify the action names. In this case, the total number of actions is equal to the
length of the vector.
Output Arguments
MDP — MDP model
GenericMDP object
MDP model, returned as a GenericMDP object with the following properties.
CurrentState — Name of the current state

string
2-41
2 Functions
Name of the current state, specified as a string.
States — State names

string vector
State names, specified as a string vector with length equal to the number of states.
Actions — Action names

string vector
Action names, specified as a string vector with length equal to the number of actions.
T — State transition matrix

3D array
State transition matrix, specified as a 3-D array, which determines the possible movements of the
agent in an environment. State transition matrix T is a probability matrix that indicates how likely the
agent will move from the current state s to any possible next state s' by performing action a. T is an
S-by-S-by-A array, where S is the number of states and A is the number of actions. It is given by:
T s, s′, a = probability s′ s, a .
The sum of the transition probabilities out from a nonterminal state s following a given action must
sum up to one. Therefore, all stochastic transitions out of a given state must be specified at the same
time.
For example, to indicate that in state 1 following action 4 there is an equal probability of moving to
states 2 or 3, use the following:
MDP.T(1,[2 3],4) = [0.5 0.5];
You can also specify that, following an action, there is some probability of remaining in the same
state. For example:
MDP.T(1,[1 2 3 4],1) = [0.25 0.25 0.25 0.25];
R — Reward transition matrix

3D array
Reward transition matrix, specified as a 3-D array, which determines how much reward the agent
receives after performing an action in the environment. R has the same shape and size as state
transition matrix T. The reward for moving from state s to state s' by performing action a is given
by:
r = R s, s′, a .
TerminalStates — Terminal state names in the grid world

string vector
Terminal state names in the grid world, specified as a string vector of state names.
Version History
2-42
createMDP
See Also
rlMDPEnv | createGridWorld
Topics
“Train Reinforcement Learning Agent in MDP Environment”
2-43
2 Functions
evaluate
Evaluate function approximator object given observation (or observation-action) input data
Syntax
outData = evaluate(fcnAppx,inData)
[outData,state] = evaluate(fcnAppx,inData)
Description
outData = evaluate(fcnAppx,inData) evaluates the function approximator object (that is, the
actor or critic) fcnAppx given the input value inData. It returns the output value outData.
[outData,state] = evaluate(fcnAppx,inData) also returns the updated state of fcnAppx

when it contains a recurrent neural network.
Examples
Evaluate a Function Approximator Object
This example shows you how to evaluate a function approximator object (that is, an actor or a critic).
For this example, the function approximator object is a discrete categorical actor and you evaluate it
given some observation data, obtaining in return the action probability distribution and the updated
network state.
Load the same environment used in “Train PG Agent to Balance Cart-Pole System”, and obtain the
observation and action specifications.
obsInfo = getObservationInfo(env)
obsInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "CartPole States"
Description: "x, dx, theta, dtheta"
Dimension: [4 1]
DataType: "double"
actInfo = getActionInfo(env)
actInfo =
rlFiniteSetSpec with properties:
Elements: [-10 10]

Name: "CartPole Action"
2-44
evaluate

Dimension: [1 1]
DataType: "double"
To approximate the policy within the actor, use a recurrent deep neural network. Define the network
as an array of layer objects. Get the dimensions of the observation space and the number of possible
actions directly from the environment specification objects.
net = [
sequenceInputLayer(prod(obsInfo.Dimension))
reluLayer
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer(numel(actInfo.Elements)) ];
Convert the network to a dlnetwork object and display the number of weights.
summary(net)
Initialized: true
Inputs:
Create a stochastic actor representation for the network.

Use evaluate to return the probability of each of the two possible actions. Note that the type of the
returned numbers is single, not double.
[prob,state] = evaluate(actor,{rand(obsInfo.Dimension)});
prob{1}
0.4847
0.5153
Since a recurrent neural network is used for the actor, the second output argument, representing the
updated state of the neural network, is not empty. In this case, it contains the updated (cell and
hidden) states for the eight units of the lstm layer used in the network.
state{:}
-0.0833
0.0619
-0.0066
-0.0651
0.0714
-0.0957
0.0614
2-45
2 Functions
-0.0326
-0.1367
0.1142
-0.0158
-0.1820
0.1305
-0.1779
0.0947
-0.0833
You can use getState and setState to extract and set the current state of the actor.
getState(actor)
ans=2×1 cell array

{8x1 single}
{8x1 single}
actor = setState(actor, ...

{-0.01*single(rand(8,1)), ...
0.01*single(rand(8,1))});
You can obtain action probabilities and updated states for a batch of observations. For example, use a
batch of five independent observations.
obsBatch = reshape(1:20,4,1,5,1);
[prob,state] = evaluate(actor,{obsBatch})
prob = 1x1 cell array

{2x5 single}
state=2×1 cell array

{8x5 single}
{8x5 single}
The output arguments contain action probabilities and updated states for each observation in the
batch.
Note that the actor treats observation data along the batch length dimension independently, not
sequentially.
prob{1}
ans = 2x5 single matrix
0.5187 0.5869 0.6048 0.6124 0.6155

0.4813 0.4131 0.3952 0.3876 0.3845
prob = evaluate(actor,{obsBatch(:,:,[5 4 3 1 2])});

prob{1}
2-46
evaluate
0.6155 0.6124 0.6048 0.5187 0.5869

0.3845 0.3876 0.3952 0.4813 0.4131
To evaluate the actor using sequential observations, use the sequence length (time) dimension. For
example, obtain action probabilities for five independent sequences, each one made of nine
sequential observations.
[prob,state] = evaluate(actor, ...

{rand([obsInfo.Dimension 5 9])})
prob = 1x1 cell array

{2x5x9 single}

{8x5 single}
{8x5 single}
The first output argument contains a vector of two probabilities (first dimension) for each element of
the observation batch (second dimension) and for each time element of the sequence length (third
dimension).
The second output argument contains two vectors of final states for each observation batch (that is,
the network maintains a separate state history for each observation batch).
Display the probability of the second action, after the seventh sequential observation in the fourth
independent batch.
prob{1}(2,4,7)
ans = single
0.5675
For more information on input and output format for recurrent neural networks, see the Algorithms
section of lstmLayer.
Input Arguments
fcnAppx — Function approximator object
Function approximator object, specified as:
2-47
2 Functions
• rlIsDoneFunction object.
inData — Input data for function approximator

cell array
Input data for the function approximator, specified as a cell array with as many elements as the
number of input channels of fcnAppx. In the following section, the number of observation channels is
indicated by NO.
• If fcnAppx is an rlQValueFunction, an
rlContinuousDeterministicTransitionFunction or an
rlContinuousGaussianTransitionFunction object, then each of the first NO elements of
inData must be a matrix representing the current observation from the corresponding
observation channel. They must be followed by a final matrix representing the action.
• If fcnAppx is a function approximator object representing an actor or critic (but not an
rlQValueFunction object), inData must contain NO elements, each one a matrix representing
the current observation from the corresponding observation channel.
• If fcnAppx is an rlContinuousDeterministicRewardFunction, an
rlContinuousGaussianRewardFunction, or an rlIsDoneFunction object, then each of the
first NO elements of inData must be a matrix representing the current observation from the
corresponding observation channel. They must be followed by a matrix representing the action,
and finally by NO elements, each one being a matrix representing the next observation from the
corresponding observation channel.
Each element of inData must be a matrix of dimension MC-by-LB-by-LS, where:
• MC corresponds to the dimensions of the associated input channel.

• LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of (independent)
inputs, specify LB > 1. If inData has multiple elements, then LB must be the same for all elements
of inData.
• LS specifies the sequence length (length of the sequence of inputs along the time dimension) for
recurrent neural network. If fcnAppx does not use a recurrent neural network (which is the case
for environment function approximators, as they do not support recurrent neural networks), then
LS = 1. If inData has multiple elements, then LS must be the same for all elements of inData.
For more information on input and output formats for recurrent neural networks, see the Algorithms
Example: {rand(8,3,64,1),rand(4,1,64,1),rand(2,1,64,1)}
Output Arguments
outData — Output data from evaluation of function approximator object
cell array
2-48
evaluate
Output data from the evaluation of the function approximator object, returned as a cell array. The size
and contents of outData depend on the type of object you use for fcnAppx, and are shown in the
following list. Here, NO is the number of observation channels.
• rlContinuousDeterministicTransitionFunction - NO matrices, each one representing the

predicted observation from the corresponding observation channel.
• rlContinuousGaussianTransitionFunction - NO matrices representing the mean value of
the predicted observation for the corresponding observation channel, followed by NO matrices
representing the standard deviation of the predicted observation for the corresponding
observation channel.
• rlContinuousGaussianActor - Two matrices representing the mean value and standard
deviation of the action, respectively.
• rlDiscreteCategoricalActor - A matrix with the probabilities of each action.
• rlContinuousDeterministicActor A matrix with the action.
• rlVectorQValueFunction - A matrix with the values of each possible action.
• rlQValueFunction - A matrix with the value of the action.
• rlValueFunction - A matrix with the value of the current observation.
• rlContinuousDeterministicRewardFunction - A matrix with the predicted reward as a
function of current observation, action, and next observation following the action.
• rlContinuousGaussianRewardFunction - Two matrices representing the mean value and
standard deviation, respectively, of the predicted reward as a function of current observation,
action, and next observation following the action.
• rlIsDoneFunction - A vector with the probabilities of the predicted termination status.
Termination probabilities range from 0 (no termination predicted) or 1 (termination predicted),
and depend (in the most general case) on the values of observation, action, and next observation
following the action.
Each element of outData is a matrix of dimensions D-by-LB-by-LS, where:
• D is the vector of dimensions of the corresponding output channel of fcnAppx. Depending on the
type of approximator function, this channel can carry a predicted observation (or its mean value or
standard deviation), an action (or its mean value or standard deviation), the value (or values) of an
observation (or observation-action couple), a predicted reward, or a predicted termination status.
• LB is the batch size (length of a batch of independent inputs).
• LS is the sequence length (length of the sequence of inputs along the time dimension) for a
for environment function approximators, as they do not support recurrent neural networks), then
LS = 1.
Note If fcnAppx is a critic, then evaluate behaves identically to getValue except that it returns
results inside a single-cell array. If fcnAppx is an rlContinuousDeterministicActor actor, then
evaluate behaves identically to getAction. If fcnAppx is a stochastic actor such as an
rlDiscreteCategoricalActor or rlContinuousGaussianActor, then evaluate returns the
action probability distribution, while getAction returns a sample action. Specifically, for an
rlDiscreteCategoricalActor actor object, evaluate returns the probability of each possible
action. For an rlContinuousGaussianActor actor object, evaluate returns the mean and
standard deviation of the Gaussian distribution. For these kinds of actors, see also the note in
getAction regarding the enforcement of constraints set by the action specification.
2-49
2 Functions
Note If fcnAppx is an rlContinuousDeterministicRewardFunction object, then evaluate

behaves identically to predict except that it returns results inside a single-cell array. If fcnAppx is
an rlContinuousDeterministicTransitionFunction object, then evaluate behaves
identically to predict. If fcnAppx is an rlContinuousGaussianTransitionFunction object,
then evaluate returns the mean value and standard deviation the observation probability
distribution, while predict returns an observation sampled from this distribution. Similarly, for an
rlContinuousGaussianRewardFunction object, evaluate returns the mean value and standard
deviation the reward probability distribution, while predict returns a reward sampled from this
distribution. Finally, if fcnAppx is an rlIsDoneFunction object, then evaluate returns the
probabilities of the termination status being false or true, respectively, while predict returns a
predicted termination status sampled with these probabilities.
state — Updated state of function approximator object

cell array
Next state of the function approximator object, returned as a cell array. If fcnAppx does not use a
recurrent neural network (which is the case for environment function approximators), then state is
an empty cell array.
You can set the state of the representation to state using the setState function. For example:
critic = setState(critic,state);
Version History
See Also
getValue | getAction | getMaxQValue | rlValueFunction | rlQValueFunction |
rlVectorQValueFunction | rlContinuousDeterministicActor |
rlIsDoneFunction | accelerate | gradient | predict
Topics
2-50
exteriorPenalty
exteriorPenalty
Exterior penalty value for a point with respect to a bounded region
Syntax
p = exteriorPenalty(x,xmin,xmax,method)
Description
p = exteriorPenalty(x,xmin,xmax,method) uses the specified method to calculate the
nonnegative (exterior) penalty vector p for the point x with respect to the region bounded by xmin
and xmax. p has the same dimension as x.
Examples
Calculate Exterior Penalty for Point
This example shows how to use the exteriorPenalty function to calculate the exterior penalty for
a given point, with respect to a bounded region.
Calculate the penalty value for the point 0.1 within the interval [-2,2], using the step method.
exteriorPenalty(0.1,-2,2,'step')
ans = 0
Calculate the penalty value for the point 4 outside the interval [-2,2], using the step method.
exteriorPenalty(4,-2,2,'step')
ans = 1
Calculate the penalty value for the point 4 outside the interval [-2,2], using the quadratic method.
exteriorPenalty(4,-2,2,'quadratic')
ans = 4
Calculate the penalty value for the point [-2,0,4] with respect to the box defined by the intervals
[0,1], [-1,1], and [-2,2] along the x, y, and z dimensions, respectively, using the quadratic method.
exteriorPenalty([-2 0 4],[0 -1 -2],[1 1 2],'quadratic')
ans = 3×1
4
0
4
2-51
2 Functions
x = -5:0.01:5;
Calculate penalties for all the points in the vector, using the quadratic method.
p = exteriorPenalty(x,-2,2,'quadratic');
Plot the points and add grid, labels, and title.
plot(x,p)
grid
Input Arguments
x — Point for which penalty is calculated
Point for which the exterior penalty is calculated, specified as a numeric scalar, vector, or matrix.
Example: [-0.1, 1.3]
2-52
exteriorPenalty

Lower bounds for x, specified as a numeric scalar, vector, or matrix. To use the same minimum value
for all elements in x, specify xmin as a scalar.
Example: -2

Upper bounds for x, specified as a numeric scalar, vector, or matrix. To use the same maximum value
for all elements in x, specify xmax as a scalar.
Example: [5 10]
method — Function used to calculate the penalty

'step' | 'quadratic'
Function used to calculate the penalty, specified either as 'step' or 'quadratic'. You can also use
strings instead of character vectors.
Example: "quadratic"
Output Arguments
p — Penalty value
nonnegative vector
Penalty value, returned as a vector of nonnegative elements. With either of the two methods, each
element pi is zero if the corresponding xi is within the region specified by xmini and xmaxi, and it is
positive otherwise. Penalty functions are typically used to generate negative rewards when
constraints are violated, such as in generateRewardFunction.
Version History
See Also
Functions
generateRewardFunction | hyperbolicPenalty | barrierPenalty
Topics
2-53
2 Functions
generatePolicyBlock
Generate Simulink block that evaluates policy of an agent or policy object
Syntax
generatePolicyBlock(agent)
generatePolicyBlock(policy)
generatePolicyBlock( ___ ,MATFileName=dataFileName)
Description
This function generates a Simulink Policy evaluation block from an agent or policy object. It also
creates a data file which stores policy information. The generated policy block loads this data file to
properly initialize itself prior to simulation. You can use the block to simulate the policy and generate
code for deployment purposes.
For more information on policies and value functions, see “Create Policies and Value Functions”.
generatePolicyBlock(agent) creates a block that evaluates the policy of the specified agent
using the default block name, policy name, and data file name.
generatePolicyBlock(policy) creates a block that evaluates the learned policy of the specified
policy object using the default block name, policy name, and data file name.
generatePolicyBlock( ___ ,MATFileName=dataFileName) specifies the file name of the data

file.
Examples
Create Policy Evaluation Block from PG Agent
First, create and train a reinforcement learning agent. For this example, load the PG agent trained in
“Train PG Agent to Balance Cart-Pole System”.
load("MATLABCartpolePG.mat","agent")
Then, create a policy evaluation block from this agent using default names.
generatePolicyBlock(agent);
This command creates an untitled Simulink® model, containing the policy block, and the
blockAgentData.mat file, containing information needed to create and initialize the policy block,
(such as the trained deep neural network used by the actor within the agent). The block loads this
data file to properly initialize itself prior to simulation.
2-54
generatePolicyBlock
You can now drag and drop the block in a Simulink® model and connect it so that it takes the
observation from the environment as input and so that the calculated action is returned to the
environment. This allows you to simulate the policy in a closed loop. You can then generate code for
deployment purposes. For more information, see “Deploy Trained Reinforcement Learning Policies”.
Close the model.
bdclose("untitled")
Create Policy Block from Deterministic Actor Policy Object
Create observation and action specification objects. For this example, define the observation and
action spaces as continuous four- and two-dimensional spaces, respectively.
obsInfo = rlNumericSpec([4 1]);

Alternatively use getObservationInfo and getActionInfo to extract the specification objects

from an environment.
Create a continuous deterministic actor. This actor must accept an observation as input and return an
action as output.
To approximate the policy function within the actor, use a recurrent deep neural network model.
Define the network as an array of layer objects, and get the dimension of the observation and action
spaces from the environment specification objects. To create a recurrent network, use a
sequenceInputLayer as the input layer (with size equal to the number of dimensions of the
observation channel) and include at least one lstmLayer.
layers = [
sequenceInputLayer(obsInfo.Dimension(1))
reluLayer
fullyConnectedLayer(actInfo.Dimension(1))
tanhLayer
];
2-55
2 Functions
model = dlnetwork(layers);
summary(model)
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions (CTB)
Create the actor using model, and the observation and action specifications.
actor = rlContinuousDeterministicActor(model,obsInfo,actInfo)
actor =
rlContinuousDeterministicActor with properties:
ObservationInfo: [1×1 rl.util.rlNumericSpec]

ActionInfo: [1×1 rl.util.rlNumericSpec]
UseDevice: "cpu"
Check the actor with a random observation input.
act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}
ans = 2×1 single column vector
-0.0742
0.0158
Create a policy object from actor.
policy = rlDeterministicActorPolicy(actor)
policy =
rlDeterministicActorPolicy with properties:
Actor: [1×1 rl.function.rlContinuousDeterministicActor]

SampleTime: -1
You can access the policy options using dot notation. Check the policy with a random observation
input.
act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}
ans = 2×1
-0.0060
-0.0161
2-56
generatePolicyBlock
You can train the policy with a custom training loop.
Then, create a policy evaluation block from this policy object using the default name for the
generated MAT-file.
generatePolicyBlock(policy);
This command creates an untitled Simulink® model, containing the policy block, and the
blockAgentData.mat file, containing information needed to create and initialize the policy block,
(such as the trained deep neural network used by the actor within the agent). The block loads this
data file to properly initialize itself prior to simulation.
You can now drag and drop the block in a Simulink® model and connect it so that it takes the
observation from the environment as input and so that the calculated action is returned to the
environment. This allows you to simulate the policy in a closed loop. You can then generate code for
deployment purposes. For more information, see “Deploy Trained Reinforcement Learning Policies”.
Close the model.
bdclose("untitled")
Input Arguments
agent — Reinforcement learning agent
reinforcement learning agent object
Trained reinforcement learning agent, specified as one of the following agent objects. To train your
agent, use the train function.
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlPGAgent
2-57
2 Functions
• rlPPOAgent
• rlTRPOAgent
• rlSACAgent
For agents with a stochastic actor (PG, PPO, SAC, TRPO, AC), the action returned by the generated
policy function depends on the value of the UseExplorationPolicy property of the agent. By
default, UseExplorationPolicy is false and the generated action is deterministic. If
UseExplorationPolicy is true, the generated action is stochastic.
policy — Reinforcement learning policy

rlMaxQPolicy | rlDeterministicActorPolicy | rlStochasticActorPolicy
Reinforcement learning policy, specified as one of the following objects:
• rlMaxQPolicy
• rlDeterministicActorPolicy
• rlStochasticActorPolicy
Note rlAdditiveNoisePolicy and rlEpsilonGreedyPolicy policy objects are not supported.
dataFileName — Name of generated data file

"blockAgentData" (default) | string | character vector
Name of generated data file, specified as a string or character vector. If a file with the specified name
already exists in the current MATLAB folder, then an appropriate digit is added to the name so that
no existing file is overwritten.
The generated data file contains four structures that store data needed to fully characterize the
policy. Prior to simulation, the block (which is generated with the data file name as mask parameter)
loads this data file to properly initialize itself.
Version History
See Also
Policy | generatePolicyFunction | train | dlnetwork
Topics
“Generate Policy Block for Deployment”
“Create Policies and Value Functions”
“Deploy Trained Reinforcement Learning Policies”
2-58
generatePolicyFunction
Package: rl.policy
Generate function that evaluates policy of an agent or policy object
Syntax
generatePolicyFunction(agent)
generatePolicyFunction(policy)
generatePolicyFunction( ___ ,Name=Value)
Description
This function generates a policy evaluation function which you can use to:
• Generate code for deployment purposes using MATLAB Coder™ or GPU Coder™. For more
information, see “Deploy Trained Reinforcement Learning Policies”.
• Simulate the trained agent in Simulink using a MATLAB Function block.
This function also creates a data file which stores policy information. The evaluation function loads
this data file to properly initialize itself the first time it is called.
For more information on policies and value functions, see “Create Policies and Value Functions”.
generatePolicyFunction(agent) creates a function that evaluates the learned policy of the

specified agent using the default function name, policy name, and data file name.
generatePolicyFunction(policy) creates a function that evaluates the learned policy of the

specified policy object using the default function name, policy name, and data file name.
generatePolicyFunction( ___ ,Name=Value) specifies the function name, policy name, and data
file name using one or more name-value pair arguments.
Examples
Create Policy Evaluation Function for PG Agent
This example shows how to create a policy evaluation function for a PG Agent.
First, create and train a reinforcement learning agent. For this example, load the PG agent trained in
Then, create a policy evaluation function for this agent using default names.
generatePolicyFunction(agent);
2-59
2 Functions
This command creates the evaluatePolicy.m file, which contains the policy function, and the
agentData.mat file, which contains the trained deep neural network actor.
View the generated function.
type evaluatePolicy.m
function action1 = evaluatePolicy(observation1)

%#codegen
% Reinforcement Learning Toolbox

% Generated on: 26-Nov-2022 18:54:46
persistent policy;
if isempty(policy)
policy = coder.loadRLPolicy("agentData.mat");
end
% evaluate the policy
action1 = getAction(policy,observation1);
Evaluate the policy for a random observation.
evaluatePolicy(rand(agent.ObservationInfo.Dimension))
ans = 10
You can now generate code for this policy function using MATLAB® Coder™. For more information,
see “Deploy Trained Reinforcement Learning Policies”.
Create Policy Evaluation Function from Policy Object
You can create and train a policy object in a custom training loop or extract a trained object from a
trained agent. For this example, load the PG agent trained in “Train PG Agent to Balance Cart-Pole
System”, and extract its greedy policy using getGreedyPolicy. Alternatively, you can extract an
explorative policy using getExplorationPolicy.
policy = getGreedyPolicy(agent)
policy =
rlStochasticActorPolicy with properties:
Actor: [1x1 rl.function.rlDiscreteCategoricalActor]

UseMaxLikelihoodAction: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1
Then, create a policy evaluation function for this policy using default names.
generatePolicyFunction(policy);
agentData.mat file, which contains the trained deep neural network actor.
2-60

%#codegen

% Generated on: 26-Nov-2022 18:54:53
persistent policy;
if isempty(policy)
policy = coder.loadRLPolicy("agentData.mat");
end
evaluatePolicy(rand(policy.ObservationInfo.Dimension))
ans = 10
Create Policy Evaluation Function for Q-Learning Agent
This example shows how to create a policy evaluation function for a Q-Learning Agent.
For this example, load the Q-learning agent trained in “Train Reinforcement Learning Agent in Basic
Grid World”
load("basicGWQAgent.mat","qAgent")
Create a policy evaluation function for this agent and specify the name of the agent data file.
generatePolicyFunction(qAgent,"MATFileName","policyFile.mat")
policyFile.mat file, which contains the trained Q table value function.

%#codegen

% Generated on: 05-May-2022 17:43:27
persistent policy;
if isempty(policy)
policy = coder.loadRLPolicy("policyFile.mat");
end
2-61
2 Functions

evaluatePolicy(randi(25))
ans = 3
Input Arguments
Trained reinforcement learning agent, specified as one of the following agent objects. To train your
agent, use the train function.
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlPGAgent
• rlPPOAgent
• rlTRPOAgent
• rlSACAgent
For agents with a stochastic actor (PG, PPO, SAC, TRPO, AC), the action returned by the generated
policy function depends on the value of the UseExplorationPolicy property of the agent. By
default, UseExplorationPolicy is false and the generated action is deterministic. If
UseExplorationPolicy is true, the generated action is stochastic.

rlMaxQPolicy | rlDeterministicActorPolicy | rlStochasticActorPolicy
• rlMaxQPolicy
Note rlAdditiveNoisePolicy and rlEpsilonGreedyPolicy policy objects are not supported.
2-62

Example: FunctionName="computeAction"
FunctionName — Name of the generated function

"evaluatePolicy" (default) | string | character vector
Name of the generated function, specified as a string or character vector.
PolicyName — Name of the policy object within the generated function

"policy" (default) | string | character vector
Name of the policy object within the generated function, specified as a string or character vector.
MATFileName — Name of generated data file

"agentData" (default) | string | character vector
Name of generated data file, specified as a string or character vector. If a file with the specified name
already exists in the current MATLAB folder, then an appropriate digit is added to the name so that
no existing file is overwritten.
The generated data file contains four structures that store data needed to fully characterize the
policy. The evaluation function loads this data file to properly initialize itself the first time it is called.
Version History
Code generated by generatePolicyFunction now uses policy objects

Behavior change in future release
The code generated by generatePolicyFunction now loads a deployable policy object from a
reinforcement learning agent. The results from running the generated policy function remain the
same.
See Also
generatePolicyBlock | Policy | train | dlnetwork
Topics
“Deploy Trained Reinforcement Learning Policies”
2-63
2 Functions
generateRewardFunction
Generate a reward function from control specifications to train a reinforcement learning agent
Syntax
generateRewardFunction(mpcobj)
generateRewardFunction(blks)
generateRewardFunction( ___ ,'FunctionName',myFcnName)
Description
generateRewardFunction(mpcobj) generates a MATLAB reward function based on the cost and
constraints defined in the linear or nonlinear MPC object mpcobj. The generated reward function is
displayed in a new editor window and you can use it as a starting point for reward design. You can
tune the weights, use a different penalty function, and then use the resulting reward function within
an environment to train an agent.
This syntax requires Model Predictive Control Toolbox™ software.
generateRewardFunction(blks) generates a MATLAB reward function based on performance

constraints defined in the model verification blocks specified in the array of block paths blks.
This syntax requires Simulink Design Optimization™ software.
generateRewardFunction( ___ ,'FunctionName',myFcnName) specifies the name of the

generated reward function, and saves it into a file with the same name. It also overwrites any
preexisting file with the same name in the current directory. Provide this name after either of the
previous input arguments.
Examples
Generate a Reward Function from MPC object
This example shows how to generate a reinforcement learning reward function from an MPC object.
Define Plant and Create MPC Controller
Create a random plant using the rss function and set the feedthrough matrix to zero.
plant = rss(4,3,2);
plant.d = 0;
Specify which of the plant signals are manipulated variables, measured disturbances, measured
outputs and unmeasured outputs.
plant = setmpcsignals(plant,'MV',1,'MD',2,'MO',[1 2],'UO',3);
Create an MPC controller with a sample time of 0.1 and prediction and control horizons of 10 and 3
steps, respectively.
2-64
mpcobj = mpc(plant,0.1,10,3);
-->The "Weights.ManipulatedVariables" property is empty. Assuming default 0.00000.

-->The "Weights.ManipulatedVariablesRate" property is empty. Assuming default 0.10000.
-->The "Weights.OutputVariables" property is empty. Assuming default 1.00000.
for output(s) y1 and zero weight for output(s) y2 y3
Set limits and a scale factor for the manipulated variable.

mpcobj.ManipulatedVariables.Min = -2;
mpcobj.ManipulatedVariables.Max = 2;
mpcobj.ManipulatedVariables.ScaleFactor = 4;
Set weights for the quadratic cost function.

mpcobj.Weights.OutputVariables = [10 1 0.1];
mpcobj.Weights.ManipulatedVariablesRate = 0.2;
Generate the Reward Function
Generate the reward function code from specifications in the mpc object using
generateRewardFunction. The code is displayed in the MATLAB Editor.
generateRewardFunction(mpcobj)
For this example, the is saved in the MATLAB function file myMpcRewardFcn.m. Display the
generated reward function.
type myMpcRewardFcn.m
function reward = myMpcRewardFcn(y,refy,mv,refmv,lastmv)

% MYMPCREWARDFCN generates rewards from MPC specifications.
%
% Description of input arguments:
%
% y : Output variable from plant at step k+1
% refy : Reference output variable at step k+1
% mv : Manipulated variable at step k
% refmv : Reference manipulated variable at step k
% lastmv : Manipulated variable at step k-1
%
% Limitations (MPC and NLMPC):
% - Reward computed based on first step in prediction horizon.
% Therefore, signal previewing and control horizon settings are ignored.
% - Online cost and constraint update is not supported.
% - Custom cost and constraint specifications are not considered.
% - Time varying cost weights and constraints are not supported.
% - Mixed constraint specifications are not considered (for the MPC case).

% 27-May-2021 14:47:29
%#codegen
%% Specifications from MPC object

% Standard linear bounds as specified in 'States', 'OutputVariables', and
% 'ManipulatedVariables' properties
ymin = [-Inf -Inf -Inf];
ymax = [Inf Inf Inf];
2-65
2 Functions
mvmin = -2;
mvmax = 2;
mvratemin = -Inf;
mvratemax = Inf;
% Scale factors as specified in 'States', 'OutputVariables', and

% 'ManipulatedVariables' properties
Sy = [1 1 1];
Smv = 4;
% Standard cost weights as specified in 'Weights' property

Qy = [10 1 0.1];
Qmv = 0;
Qmvrate = 0.2;
%% Compute cost
dy = (refy(:)-y(:)) ./ Sy';
dmv = (refmv(:)-mv(:)) ./ Smv';
dmvrate = (mv(:)-lastmv(:)) ./ Smv';
Jy = dy' * diag(Qy.^2) * dy;
Jmv = dmv' * diag(Qmv.^2) * dmv;
Jmvrate = dmvrate' * diag(Qmvrate.^2) * dmvrate;
Cost = Jy + Jmv + Jmvrate;
%% Penalty function weight (specify nonnegative)

Wy = [1 1 1];
Wmv = 10;
Wmvrate = 10;
%% Compute penalty
% Penalty is computed for violation of linear bound constraints.
%
% To compute exterior bound penalty, use the exteriorPenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% Alternaltively, use the hyperbolicPenalty or barrierPenalty function for
% computing hyperbolic and barrier penalties.
%
% For more information, see help for these functions.
%
% Set Pmv value to 0 if the RL agent action specification has
% appropriate 'LowerLimit' and 'UpperLimit' values.
Py = Wy * exteriorPenalty(y,ymin,ymax,'step');
Pmv = Wmv * exteriorPenalty(mv,mvmin,mvmax,'step');
Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'step');
Penalty = Py + Pmv + Pmvrate;
%% Compute reward
reward = -(Cost + Penalty);
end
The calculated reward depends only on the current values of the plant input and output signals and
their reference values, and it is composed of two parts.
The first is a negative cost that depends on the squared difference between desired and current plant
inputs and outputs. This part uses the cost function weights specified in the MPC object. The second
2-66
part is a penalty that acts as a negative reward whenever the current plant signals violate the
constraints.
The generated reward function is a starting point for reward design. You can tune the weights or use
a different penalty function to define a more appropriate reward for your reinforcement learning
agent.
Generate Reward Function from Verification Block
This example shows how to generate a reinforcement learning reward function from a Simulink
Design Optimization model verification block.
For this example, open the Simulink model LevelCheckBlock.slx, which contains a Check Step
Response Characteristics block named Level Check.
open_system('LevelCheckBlock')
2-67
2 Functions
Generate the reward function code from specifications in the Level Check block, using
generateRewardFunction. The code is displayed in the MATLAB Editor.
generateRewardFunction('LevelCheckBlock/Level Check')
For this example, the code is saved in the MATLAB function file myBlockRewardFcn.m.
Display the generated reward function.
type myBlockRewardFcn.m
function reward = myBlockRewardFcn(x,t)

% MYBLOCKREWARDFCN generates rewards from Simulink block specifications.
%
% x : Input of LevelCheckBlock/Level Check
% t : Simulation time (s)

% 27-May-2021 16:45:27
%#codegen
%% Specifications from LevelCheckBlock/Level Check
2-68
Block1_InitialValue = 1;
Block1_FinalValue = 2;
Block1_StepTime = 0;
Block1_StepRange = Block1_FinalValue - Block1_InitialValue;
Block1_MinRise = Block1_InitialValue + Block1_StepRange * 80/100;
Block1_MaxSettling = Block1_InitialValue + Block1_StepRange * (1+2/100);
Block1_MinSettling = Block1_InitialValue + Block1_StepRange * (1-2/100);
Block1_MaxOvershoot = Block1_InitialValue + Block1_StepRange * (1+10/100);
Block1_MinUndershoot = Block1_InitialValue - Block1_StepRange * 5/100;
if t >= Block1_StepTime
if Block1_InitialValue <= Block1_FinalValue
Block1_UpperBoundTimes = [0,5; 5,max(5+1,t+1)];
Block1_UpperBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettlin
Block1_LowerBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
Block1_LowerBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,
else
Block1_UpperBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
Block1_UpperBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,
Block1_LowerBoundTimes = [0,5; 5,max(5+1,t+1)];
Block1_LowerBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettlin
end
Block1_xmax = zeros(1,size(Block1_UpperBoundTimes,1));
for idx = 1:numel(Block1_xmax)
tseg = Block1_UpperBoundTimes(idx,:);
xseg = Block1_UpperBoundAmplitudes(idx,:);
Block1_xmax(idx) = interp1(tseg,xseg,t,'linear',NaN);
end
if all(isnan(Block1_xmax))
Block1_xmax = Inf;
else
Block1_xmax = max(Block1_xmax,[],'omitnan');
end
Block1_xmin = zeros(1,size(Block1_LowerBoundTimes,1));
for idx = 1:numel(Block1_xmin)
tseg = Block1_LowerBoundTimes(idx,:);
xseg = Block1_LowerBoundAmplitudes(idx,:);
Block1_xmin(idx) = interp1(tseg,xseg,t,'linear',NaN);
end
if all(isnan(Block1_xmin))
Block1_xmin = -Inf;
else
Block1_xmin = max(Block1_xmin,[],'omitnan');
end
else
Block1_xmin = -Inf;
Block1_xmax = Inf;
end
%% Penalty function weight (specify nonnegative)

Weight = 1;
%% Compute penalty
2-69
2 Functions
% Penalty is computed for violation of linear bound constraints.

%
% To compute exterior bound penalty, use the exteriorPenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% Alternaltively, use the hyperbolicPenalty or barrierPenalty function for
% computing hyperbolic and barrier penalties.
%
% For more information, see help for these functions.
Penalty = sum(exteriorPenalty(x,Block1_xmin,Block1_xmax,'step'));
%% Compute reward
reward = -Weight * Penalty;
end
The generated reward function takes as input arguments the current value of the verification block
input signals and the simulation time. A negative reward is calculated using a weighted penalty that
acts whenever the current block input signals violate the linear bound constraints defined in the
verification block.
The generated reward function is a starting point for reward design. You can tune the weights or use
a different penalty function to define a more appropriate reward for your reinforcement learning
agent.
Close the Simulink model.
close_system('LevelCheckBlock')
Input Arguments
mpcobj — Linear or nonlinear MPC object
mpc object | nlmpc object
Linear or nonlinear MPC object, specified as an mpc object or an nlmpc object, respectively.
Note that:
• The generated function calculates rewards using signal values at the current time only. Predicted
future values, signal previewing, and control horizon settings are not used in the reward
calculation.
• Using time-varying cost weights and constraints, or updating them online, is not supported.
• Only the standard quadratic cost function, as described in “Optimization Problem” (Model
Predictive Control Toolbox), is supported. Therefore, for mpc objects, using mixed constraint
specifications is not supported. Similarly, for nlmpc objects, custom cost and constraint
specifications are not supported.
Example: mpc(tf([1 1],[1 2 0]),0.1)
blks — Path to model verification blocks

char array | cell array | string array
Path to model verification blocks, specified as character array, cell array or string array. The
supported Simulink Design Optimization model verification blocks are the following ones.
2-70
• Check Against Reference (Simulink Design Optimization)

• Check Custom Bounds (Simulink Design Optimization)
• Check Step Response Characteristics (Simulink Design Optimization)
• Any block belonging to the Simulink “Model Verification” (Simulink) library
The generated reward function takes as input arguments the current value of the verification block
input signals and the simulation time. A negative reward is calculated using a weighted penalty that
acts whenever the current block input signals violate the linear bound constraints defined in the
verification block.
Example: "mySimulinkModel02/Check Against Reference"
myFcnName — Function name

string | char vector
Function name, specified as a string object or character vector.

Example: "reward03epf_step"
Tips
By default, the exterior bound penalty function exteriorPenalty is used to calculate the penalty.
Alternatively, to calculate hyperbolic and barrier penalties, you can use the hyperbolicPenalty or
barrierPenalty functions.
Version History
See Also
Functions
exteriorPenalty | hyperbolicPenalty | barrierPenalty
Objects
mpc | nlmpc
Topics
“Generate Reward Function from a Model Verification Block for a Water Tank System”
“Create MATLAB Reinforcement Learning Environments”
2-71
2 Functions
getActionInfo
Obtain action data specifications from reinforcement learning environment, agent, or experience
buffer
Syntax
actInfo = getActionInfo(agent)
actInfo = getActionInfo(buffer)
Description
actInfo = getActionInfo(env) extracts action information from reinforcement learning
environment env.
actInfo = getActionInfo(agent) extracts action information from reinforcement learning

agent agent.
actInfo = getActionInfo(buffer) extracts action information from experience buffer buffer.
Examples
Extract Action and Observation Information from Reinforcement Learning Environment
Extract action and observation information that you can use to create other environments or agents.
The reinforcement learning environment for this example is the simple longitudinal dynamics for ego
car and lead car. The training goal is to make the ego car travel at a set velocity while maintaining a
safe distance from lead car by controlling longitudinal acceleration (and braking). This example uses
the same vehicle model as the “Adaptive Cruise Control System Using Model Predictive Control”
(Model Predictive Control Toolbox) example.
Open the model and create the reinforcement learning environment.
mdl = 'rlACCMdl';
open_system(mdl);
agentblk = [mdl '/RL Agent'];
% create the observation info
obsInfo = rlNumericSpec([3 1],'LowerLimit',-inf*ones(3,1),'UpperLimit',inf*ones(3,1));
obsInfo.Name = 'observations';
obsInfo.Description = 'information on velocity error and ego velocity';
% action Info
actInfo = rlNumericSpec([1 1],'LowerLimit',-3,'UpperLimit',2);
actInfo.Name = 'acceleration';
% define environment
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo)
env =
2-72
getActionInfo
Model : rlACCMdl
AgentBlock : rlACCMdl/RL Agent
ResetFcn : []
UseFastRestart : on
The reinforcement learning environment env is a SimulinkWithAgent object with the above
properties.
Extract the action and observation information from the reinforcement learning environment env.
actInfoExt = getActionInfo(env)
actInfoExt =
LowerLimit: -3
UpperLimit: 2
Name: "acceleration"
Dimension: [1 1]
DataType: "double"
obsInfoExt = getObservationInfo(env)
obsInfoExt =
LowerLimit: [3x1 double]

UpperLimit: [3x1 double]
Name: "observations"
Description: "information on velocity error and ego velocity"
Dimension: [3 1]
DataType: "double"
The action information contains acceleration values while the observation information contains the
velocity and velocity error values of the ego vehicle.
Input Arguments
rlFunctionEnv object | SimulinkEnvWithAgent object | rlNeuralNetworkEnvironment object
| predefined MATLAB environment object
Reinforcement learning environment from which to extract the action information, specified as one of
the following:
• MATLAB environment represented as one of the following objects.
• rlFunctionEnv
• rlNeuralNetworkEnvironment
• Predefined MATLAB environment created using rlPredefinedEnv
2-73
2 Functions
• Simulink environment represented as a SimulinkEnvWithAgent object.
For more information on reinforcement learning environments, see “Create MATLAB Reinforcement
Learning Environments” and “Create Simulink Reinforcement Learning Environments”.

rlQAgent object | rlSARSAAgent object | rlDQNAgent object | rlPGAgent object | rlDDPGAgent
object | rlTD3Agent object | rlACAgent object | rlPPOAgent object | rlTRPOAgent object |
rlSACAgent object | rlMBPOAgent object
Reinforcement learning agent from which to extract the action information, specified as one of the
following objects.
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlPGAgent
• rlACAgent
• rlPPOAgent
• rlTRPOAgent
• rlSACAgent
• rlMBPOAgent
For more information on reinforcement learning agents, see “Reinforcement Learning Agents”.

Experience buffer from which to extract the action information, specified as an rlReplayMemory or
rlPrioritizedReplayMemory object.
Output Arguments
actInfo — Action data specifications
array of rlNumericSpec objects | array of rlFiniteSetSpec objects
Action data specifications extracted from the reinforcement learning environment, returned as an
array of one of the following:
• rlNumericSpec objects
• rlFiniteSetSpec objects
• A mix of rlNumericSpec and rlFiniteSetSpec objects
Version History
2-74
getActionInfo
See Also
rlNumericSpec | rlFiniteSetSpec | getObservationInfo | rlQAgent | rlSARSAAgent |
rlDQNAgent | rlPGAgent | rlACAgent | rlDDPGAgent
Topics
2-75
2 Functions
getAction
Package: rl.policy
Obtain action from agent, actor, or policy object given environment observations
Syntax
action = getAction(agent,obs)
[action,agent] = getAction(agent,obs)
action = getAction(actor,obs)
[action,nextState] = getAction(actor,obs)
action = getAction(policy,obs)
[action,updatedPolicy] = getAction(policy,obs)
Description
Agent
action = getAction(agent,obs) returns the action generated from the policy of a

reinforcement learning agent, given environment observations. If agent contains internal states,
they are updated.
[action,agent] = getAction(agent,obs) also returns the updated agent as an output

argument.
Actor
action = getAction(actor,obs) returns the action generated from the policy represented by
the actor actor, given environment observations obs.
[action,nextState] = getAction(actor,obs) also returns the updated state of the actor

when it uses a recurrent neural network.
Policy
action = getAction(policy,obs) returns the action generated from the policy object policy,
given environment observations.
[action,updatedPolicy] = getAction(policy,obs) also returns the updated policy as an

output argument (any internal state of the policy, if used, is updated).
Examples
Get Actions from Agent
Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”.
2-76
getAction
% load predefined environment

env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");
Obtain the observation and action specifications for this environment.
Create a TRPO agent from the environment observation and action specifications.
agent = rlTRPOAgent(obsInfo,actInfo);
Use getAction to return the action from a random observation.
getAction(agent, ...
{rand(obsInfo(1).Dimension), ...
rand(obsInfo(2).Dimension)})
ans = 1x1 cell array

{[-2]}
You can also obtain actions for a batch of observations. For example, obtain actions for a batch of 10
observations.
actBatch = getAction(agent, ...

{rand([obsInfo(1).Dimension 10]), ...
rand([obsInfo(2).Dimension 10])});
size(actBatch{1})
ans = 1×3
1 1 10
actBatch{1}(1,1,7)
ans = -2
actBatch contains one action for each observation in the batch.
Get Action from Deterministic Actor
Create observation and action information. You can also obtain these specifications from an
environment.
obsinfo = rlNumericSpec([4 1]);

actinfo = rlNumericSpec([2 1]);
Create a deep neural network for the actor.
net = [featureInputLayer(obsinfo.Dimension(1), ...

'Normalization','none','Name','state')
fullyConnectedLayer(10,'Name','fc1')
reluLayer('Name','relu1')
2-77
2 Functions
fullyConnectedLayer(20,'Name','CriticStateFC2')
fullyConnectedLayer(actinfo.Dimension(1),'Name','fc2')
tanhLayer('Name','tanh1')];
Create a deterministic actor representation for the network.
actor = rlContinuousDeterministicActor(net, ...

obsinfo,actinfo,...
'ObservationInputNames',{'state'});
Obtain an action from this actor for a random batch of 20 observations.
act = getAction(actor,{rand(4,1,10)})

{2x1x10 single}
act is a single cell array that contains the two computed actions for all 10 observations in the batch.
act{1}(:,1,7)
0.2643
-0.2934
Get Actions from Policy Object

Alternatively, you can use getObservationInfo and getActionInfo to extract the specification
objects from an environment.
action as output.
To approximate the policy function within the actor, use a recurrent deep neural network model.
Define the network as an array of layer objects, and get the dimension of the observation and action
spaces from the environment specification objects. To create a recurrent network, use a
layers = [
lstmLayer(2)
reluLayer
];
2-78
getAction
summary(model)
Initialized: true
Inputs:
actor =

ActionInfo: [1x1 rl.util.rlNumericSpec]
UseDevice: "cpu"

act{1}
0.0568
0.0691
Create an additive noise policy object from actor.

policy = rlAdditiveNoisePolicy(actor)
policy =
rlAdditiveNoisePolicy with properties:
Actor: [1x1 rl.function.rlContinuousDeterministicActor]

NoiseType: "gaussian"
NoiseOptions: [1x1 rl.option.GaussianActionNoise]
EnableNoiseDecay: 1
UseNoisyAction: 1
SampleTime: -1
Use dot notation to set the standard deviation decay rate.

policy.NoiseOptions.StandardDeviationDecayRate = 0.9;
Use getAction to generate an action from the policy, given a random observation input.
act{1}
2-79
2 Functions
ans = 2×1
0.5922
-0.3745
Display the state of the recurrent neural network in the policy object.
xNN = getRNNState(policy);
xNN{1}
0
0
Use getAction to also return the updated policy as a second argument.
[act, updatedPolicy] = getAction(policy,{rand(obsInfo.Dimension)});
Display the state of the recurrent neural network in the updated policy object.
xpNN = getRNNState(updatedPolicy);
xpNN{1}
0.3327
-0.2479
As expected, the state is updated.
Input Arguments
Reinforcement learning agent, specified as one of the following objects:
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• Custom agent — For more information, see “Create Custom Reinforcement Learning Agents”.
2-80
getAction
Note agent is an handle object, so it is updated whether it is returned as an output argument or not.
For more information about handle objects, see “Handle Object Behavior”.
actor — Actor
rlContinousDeterministicActor object | rlContinousGaussianActor object |
rlDiscreteCategoricalActor object
Actor, specified as an rlContinuousDeterministicActor, rlDiscreteCategoricalActor or

rlContinuousGaussianActor object.

rlMaxQPolicy | rlEpsilonGreedyPolicy | rlDeterministicActorPolicy |
rlAdditiveNoisePolicy | rlStochasticActorPolicy
• rlMaxQPolicy
• rlEpsilonGreedyPolicy
• rlAdditiveNoisePolicy
obs — Environment observations

cell array
Environment observations, specified as a cell array with as many elements as there are observation
input channels. Each element of obs contains an array of observations for a single observation input
channel.
The dimensions of each element in obs are MO-by-LB-by-LS, where:
• MO corresponds to the dimensions of the associated observation input channel.

• LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of observations,
specify LB > 1. If valueRep or qValueRep has multiple observation input channels, then LB must
be the same for all elements of obs.
• LS specifies the sequence length for a recurrent neural network. If valueRep or qValueRep does
not use a recurrent neural network, then LS = 1. If valueRep or qValueRep has multiple
observation input channels, then LS must be the same for all elements of obs.
LB and LS must be the same for both act and obs.
Output Arguments
action — Action
single-element cell array
Action, returned as single-element cell array containing an array with dimensions MA-by-LB-by-LS,
where:
2-81
2 Functions
• MA corresponds to the dimensions of the associated action specification.

• LB is the batch size.
• LS is the sequence length for recurrent neural networks. If the agent, actor, or policy calculating
action do not use recurrent neural networks, then LS = 1.
Note The following continuous action-space actor, policy and agent objects do not enforce the
constraints set by the action specification:
• rlContinuousDeterministicActor
• rlACAgent
• rlPGAgent
• rlPPOAgent
In these cases, you must enforce action space constraints within the environment.
nextState — Next state of the actor

cell array
Next state of the actor, returned as a cell array. If actor does not use a recurrent neural network,
then state is an empty cell array.
You can set the state of the representation to state using the setState function. For example:
actor = setState(actor,state);
agent — Updated agent

Updated agent, returned as the same agent object as the agent in the input argument. Note that
agent is an handle object. Therefore, its internal states (if any) are updated whether agent is
returned as an output argument or not. For more information about handle objects, see “Handle
Object Behavior”.
updatedPolicy — Updated policy object

reinforcement learning policy object
Updated policy object. It is identical to the policy object supplied as first input argument, except that
its internal states (if any) are updated.
Tips
The function evaluate behaves, for actor objects, similarly to getAction except for the following
differences.
• For an rlDiscreteCategoricalActor actor object, evaluate returns the probability of each

possible actions, (instead of a sample action as getAction).
• For an rlContinuousGaussianActor actor object, evaluate returns the mean and standard
deviation of the Gaussian distribution, (instead of a sample action as getAction).
para
2-82
getAction
Version History
See Also
evaluate | getValue | getMaxQValue
Topics
2-83
2 Functions
getActor
Package: rl.agent
Get actor from reinforcement learning agent
Syntax
actor = getActor(agent)
Description
actor = getActor(agent) returns the actor object from the specified reinforcement learning
agent.
Examples
Modify Actor Parameter Values
Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.
load('DoubleIntegDDPG.mat','agent')
Obtain the actor function approximator from the agent.
actor = getActor(agent);
Obtain the learnable parameters from the actor.
params = getLearnableParameters(actor)
params=2×1 cell array

{[-15.4622 -7.2252]}
{[ 0]}
Modify the parameter values. For this example, simply multiply all of the parameters by 2.
modifiedParams = cellfun(@(x) x*2,params,'UniformOutput',false);
Set the parameter values of the actor to the new modified values.
actor = setLearnableParameters(actor,modifiedParams);
Set the actor in the agent to the new modified actor.
setActor(agent,actor);
Display the new parameter values.
getLearnableParameters(getActor(agent))
2-84
getActor
ans=2×1 cell array

{[-30.9244 -14.4504]}
{[ 0]}
Modify Deep Neural Networks in Reinforcement Learning Agent
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.
Load the predefined environment.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
Obtain observation and action specifications.
Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.
agent = rlPPOAgent(obsInfo,actInfo);
To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.
critic = getCritic(agent);
Extract the deep neural networks from both the actor and critic function approximators.
actorNet = getModel(actor);
criticNet = getModel(critic);
The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.
For example, view the actor network.
plot(layerGraph(actorNet))
2-85
2 Functions
To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)
You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))
In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.
To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.
For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
createModifiedNetworks
2-86
getActor
Each of the modified networks includes an additional fullyConnectedLayer and reluLayer in

their main common path. View the modified actor network.
plot(layerGraph(modifiedActorNet))
After exporting the networks, insert the networks into the actor and critic function approximators.
actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);
Finally, insert the modified actor and critic function approximators into the actor and critic objects.
agent = setActor(agent,actor);
agent = setCritic(agent,critic);
Input Arguments
rlDDPGAgent object | rlTD3Agent object | rlPGAgent object | rlACAgent object | rlPPOAgent
object | rlSACAgent object
Reinforcement learning agent that contains an actor, specified as one of the following:
• rlPGAgent object
• rlDDPGAgent object
2-87
2 Functions
• rlTD3Agent object
• rlACAgent object
• rlSACAgent object
• rlPPOAgent object
• rlTRPOAgent object
Output Arguments
actor — Actor
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object
Actor object, returned as one of the following:
• rlContinuousDeterministicActor object — Returned when agent is an rlDDPGAgent or

rlTD3Agent object
• rlDiscreteCategoricalActor object — Returned when agent is an rlACAgent, rlPGAgent,
rlPPOAgent, rlTRPOAgent or rlSACAgent object for an environment with a discrete action
space.
• rlContinuousGaussianActor object — Returned when agent is an rlACAgent, rlPGAgent,
rlPPOAgent, rlTRPOAgent or rlSACAgent object for an environment with a continuous action
space.
Version History
See Also
getCritic | setActor | setCritic | getModel | setModel | getLearnableParameters |
setLearnableParameters
Topics
“Import Neural Network Models”
2-88
getCritic
getCritic
Package: rl.agent
Get critic from reinforcement learning agent
Syntax
critic = getCritic(agent)
Description
critic = getCritic(agent) returns the critic object from the specified reinforcement learning
agent.
Examples
Modify Critic Parameter Values
Obtain the critic function approximator from the agent.
Obtain the learnable parameters from the critic.
params = getLearnableParameters(critic)

{[-4.9889 -1.5548 -0.3434 -0.1111 -0.0500 -0.0035]}
{[ 0]}
Set the parameter values of the critic to the new modified values.
critic = setLearnableParameters(critic,modifiedParams);
Set the critic in the agent to the new modified critic.
setCritic(agent,critic);
getLearnableParameters(getCritic(agent))
2-89
2 Functions
ans=2×1 cell array

{[-9.9778 -3.1095 -0.6867 -0.2223 -0.1000 -0.0069]}
{[ 0]}
2-90
getCritic
following commands.
Observations”.
2-91
2 Functions

Input Arguments
rlQAgent | rlSARSAAgent | rlDQNAgent | rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent
| rlSACAgent | rlPPOAgent | rlTRPOAgent
Reinforcement learning agent that contains a critic, specified as one of the following objects:
• rlQAgent
• rlSARSAAgent
2-92
getCritic
• rlDQNAgent
• rlPGAgent (when using a critic to estimate a baseline value function)
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
Output Arguments
critic — Critic
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object | two-
element row vector of rlQValueFunction objects
Critic object, returned as one of the following:
• rlValueFunction object — Returned when agent is an rlACAgent, rlPGAgent, or

rlPPOAgent object.
• rlQValueFunction object — Returned when agent is an rlQAgent, rlSARSAAgent,
rlDQNAgent, rlDDPGAgent, or rlTD3Agent object with a single critic.
• rlVectorQValueFunction object — Returned when agent is an rlQAgent, rlSARSAAgent,
rlDQNAgent, object with a discrete action space, vector Q-value function critic.
• Two-element row vector of rlQValueFunction objects — Returned when agent is an
rlTD3Agent or rlSACAgent object with two critics.
Version History
See Also
getActor | setActor | setCritic | getModel | setModel | getLearnableParameters |
Topics
2-93
2 Functions
getLearnableParameters
Package: rl.policy
Obtain learnable parameter values from agent, function approximator, or policy object
Syntax
pars = getLearnableParameters(agent)
pars = getLearnableParameters(fcnAppx)
pars = getLearnableParameters(policy)
Description
Agent
pars = getLearnableParameters(agent) returns the learnable parameter values from the

agent object agent.
Actor or Critic
pars = getLearnableParameters(fcnAppx) returns the learnable parameter values from the

actor or critic function approximator object fcnAppx.
Policy
pars = getLearnableParameters(policy) returns the learnable parameters values from the

policy object policy.
Examples

{[-4.9889 -1.5548 -0.3434 -0.1111 -0.0500 -0.0035]}
{[ 0]}
2-94
ans=2×1 cell array

{[-9.9778 -3.1095 -0.6867 -0.2223 -0.1000 -0.0069]}
{[ 0]}

{[-15.4622 -7.2252]}
{[ 0]}
2-95
2 Functions
ans=2×1 cell array

{[-30.9244 -14.4504]}
{[ 0]}
Input Arguments
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
fcnAppx — Actor or critic function object

rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
Actor or critic function object, specified as one of the following:
• rlValueFunction object — Value function critic

• rlQValueFunction object — Q-value function critic
• rlVectorQValueFunction object — Multi-output Q-value function critic with a discrete action
space
• rlContinuousDeterministicActor object — Deterministic policy actor with a continuous
action space
• rlDiscreteCategoricalActor — Stochastic policy actor with a discrete action space
• rlContinuousGaussianActor object — Stochastic policy actor with a continuous action space
To create an actor or critic function object, use one of the following methods.
• Create a function object directly.

• Obtain the existing critic from an agent using getCritic.
• Obtain the existing actor from an agent using getActor.
2-96

• rlMaxQPolicy
Output Arguments
pars — Learnable parameters
Learnable parameter values for the function object, returned as a cell array. You can modify these
parameter values and set them in the original agent or a different agent using the
setLearnableParameters function.
Version History
getLearnableParameters now uses approximator objects instead of representation objects

Behavior changed in R2022a
Using representation objects to create actors and critics for reinforcement learning agents is no
longer recommended. Therefore, getLearnableParameters now uses function approximator
objects instead.
getLearnableParameterValues is now getLearnableParameters

getLearnableParameterValues is now getLearnableParameters. To update your code, change

the function name from getLearnableParameterValues to getLearnableParameters. The
syntaxes are equivalent.
See Also
setLearnableParameters | getActor | getCritic | setActor | setCritic
Topics
2-97
2 Functions
getMaxQValue
Obtain maximum estimated value over all possible actions from a Q-value function critic with discrete
action space, given environment observations
Syntax
[maxQ,maxActionIndex] = getMaxQValue(qValueFcnObj,obs)
[maxQ,maxActionIndex,state] = getMaxQValue( ___ )
Description
[maxQ,maxActionIndex] = getMaxQValue(qValueFcnObj,obs) evaluates the discrete-action-
space Q-value function critic qValueFcnObj and returns the maximum estimated value over all
possible actions maxQ, with the corresponding action index maxActionIndex, given environment
observations obs.
[maxQ,maxActionIndex,state] = getMaxQValue( ___ ) also returns the updated state of

qValueFcnObj when it contains a recurrent neural network.
Examples
Obtain Maximum Q-Value Function Estimates
Create an observation and action specification objects (or alternatively use getObservationInfo
and getActionInfo to extract the specification objects from an environment. For this example,
define the observation space as a continuous three-dimensional space, and the action space as a finite
set consisting of three possible values (named -1, 0, and 1).
Create a custom basis function to approximate the Q-value function within the critic, and define an
initial parameter vector.
myBasisFcn = @(myobs,myact) [ ...
ones(4,1);
myobs(:); myact;
myobs(:).^2; myact.^2;
sin(myobs(:)); sin(myact);
cos(myobs(:)); cos(myact) ];
W0 = rand(20,1);
Create the critic.

critic = rlQValueFunction({myBasisFcn,W0}, ...
obsInfo,actInfo);
Use getMaxQValue to return the maximum value, among the possible actions, given a random
observation. Also return the index corresponding to the action that maximizes the value.
[v,i] = getMaxQValue(critic,{rand(3,1)})
2-98
getMaxQValue
v = 9.0719
i = 3
Create a batch set of 64 random independent observations. The third dimension is the batch size,
while the fourth is the sequence length for any recurrent neural network used by the critic (in this
case not used).
batchobs = rand(3,1,64,1);
Obtain maximum values for all the observations.
bv = getMaxQValue(critic,{batchobs});
size(bv)
ans = 1×2
1 64
Select the maximum value corresponding to the 44th observation.
bv(44)
ans = 10.4138
Input Arguments
qValueFcnObj — Q-value function critic
rlQValueFunction object | rlVectorQValueFunction object
Q-value function critic, specified as an rlQValueFunction or rlVectorQValueFunction object.
obs — Environment observations

cell array
Environment observations, specified as a cell array with as many elements as there are observation
input channels. Each element of obs contains an array of observations for a single observation input
channel.

specify LB > 1. If qValueFcnObj has multiple observation input channels, then LB must be the
same for all elements of obs.
• LS specifies the sequence length for a recurrent neural network. If qValueFcnObj does not use a
recurrent neural network, then LS = 1. If qValueFcnObj has multiple observation input channels,
then LS must be the same for all elements of obs.
2-99
2 Functions
Output Arguments
maxQ — Maximum Q-value estimate
array
Maximum Q-value estimate across all possible discrete actions, returned as a 1-by-LB-by-LS array,
where:

recurrent neural network, then LS = 1.
maxActionIndex — Action index

array
Action index corresponding to the maximum Q value, returned as a 1-by-LB-by-LS array, where:

recurrent neural network, then LS = 1.
state — Updated state of the critic

cell array
Updated state of qValueFcnObj, returned as a cell array. If qValueFcnObj does not use a recurrent
neural network, then state is an empty cell array.
You can set the state of the critic to state using the setState function. For example:
qValueFcnObj = setState(qValueFcnObj,state);
Version History
See Also
getValue | evaluate | getAction
Topics
2-100
getModel
getModel
Get function approximator model from actor or critic
Syntax
model = getModel(fcnAppx)
Description
model = getModel(fcnAppx) returns the function approximation model used by the actor or critic
function object fcnAppx.
Examples
2-101
2 Functions
following commands.
Observations”.
2-102
getModel

Input Arguments
fcnAppx — Actor or critic function object
Actor or critic function object, specified as one of the following:
2-103
2 Functions

space
action space

Note For agents with more than one critic, such as TD3 and SAC agents, you must call getModel for
each critic representation individually, rather than calling getModel for the array returned by
getCritic.
critics = getCritic(myTD3Agent);
criticNet1 = getModel(critics(1));
Output Arguments
model — Function approximation model
dlnetwork object | rlTable object | 1-by-2 cell array
Function approximation model, returned as one of the following:
• Deep neural network defined as a dlnetwork object

• rlTable object
• 1-by-2 cell array that contains the function handle for a custom basis function and the basis
function parameters
Version History
getModel now uses approximator objects instead of representation objects

longer recommended. Therefore, getModel now uses function approximator objects instead.
getModel returns a dlnetwork object

Behavior changed in R2021b
2-104
getModel
Starting from R2021b, built-in agents use dlnetwork objects as actor and critic representations, so
getModel returns a dlnetwork object.
• Due to numerical differences in the network calculations, previously trained agents might behave
differently. If this happens, you can retrain your agents.
• To use Deep Learning Toolbox™ functions that do not support dlnetwork, you must convert the
network to layerGraph. For example, to use deepNetworkDesigner, replace
deepNetworkDesigner(network) with deepNetworkDesigner(layerGraph(network)).
See Also
getActor | setActor | getCritic | setCritic | setModel | dlnetwork
Topics
2-105
2 Functions
getObservationInfo
Obtain observation data specifications from reinforcement learning environment, agent, or
experience buffer
Syntax
obsInfo = getObservationInfo(agent)
obsInfo = getObservationInfo(buffer)
Description
obsInfo = getObservationInfo(env) extracts observation information from reinforcement
learning environment env.
obsInfo = getObservationInfo(agent) extracts observation information from reinforcement

learning agent agent.
obsInfo = getObservationInfo(buffer) extracts observation information from experience

buffer buffer.
Examples
Extract Action and Observation Information from Reinforcement Learning Environment
Extract action and observation information that you can use to create other environments or agents.
The reinforcement learning environment for this example is the simple longitudinal dynamics for ego
car and lead car. The training goal is to make the ego car travel at a set velocity while maintaining a
safe distance from lead car by controlling longitudinal acceleration (and braking). This example uses
the same vehicle model as the “Adaptive Cruise Control System Using Model Predictive Control”
(Model Predictive Control Toolbox) example.
Open the model and create the reinforcement learning environment.
mdl = 'rlACCMdl';
open_system(mdl);
agentblk = [mdl '/RL Agent'];
% create the observation info
obsInfo = rlNumericSpec([3 1],'LowerLimit',-inf*ones(3,1),'UpperLimit',inf*ones(3,1));
obsInfo.Description = 'information on velocity error and ego velocity';
% action Info
actInfo.Name = 'acceleration';
% define environment
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo)
env =
2-106
getObservationInfo
Model : rlACCMdl
AgentBlock : rlACCMdl/RL Agent
ResetFcn : []
UseFastRestart : on
The reinforcement learning environment env is a SimulinkWithAgent object with the above
properties.
Extract the action and observation information from the reinforcement learning environment env.
actInfoExt = getActionInfo(env)
actInfoExt =
LowerLimit: -3
UpperLimit: 2
Name: "acceleration"
Dimension: [1 1]
DataType: "double"
obsInfoExt = getObservationInfo(env)
obsInfoExt =
LowerLimit: [3x1 double]

UpperLimit: [3x1 double]
Name: "observations"
Description: "information on velocity error and ego velocity"
Dimension: [3 1]
DataType: "double"
The action information contains acceleration values while the observation information contains the
velocity and velocity error values of the ego vehicle.
Input Arguments
| predefined MATLAB environment object
Reinforcement learning environment from which to extract the observation information, specified as
one of the following objects.
• MATLAB environment represented as one of the following objects.
• rlFunctionEnv
• rlNeuralNetworkEnvironment
• Predefined MATLAB environment created using rlPredefinedEnv
2-107
2 Functions
• Simulink environment represented as a SimulinkEnvWithAgent object.
For more information on reinforcement learning environments, see “Create MATLAB Reinforcement
Learning Environments” and “Create Simulink Reinforcement Learning Environments”.

rlQAgent object | rlSARSAAgent object | rlDQNAgent object | rlPGAgent object | rlDDPGAgent
object | rlTD3Agent object | rlACAgent object | rlPPOAgent object | rlTRPOAgent object |
rlSACAgent object | rlMBPOAgent object
Reinforcement learning agent from which to extract the observation information, specified as one of
the following objects.
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlPGAgent
• rlACAgent
• rlPPOAgent
• rlTRPOAgent
• rlSACAgent
• rlMBPOAgent
For more information on reinforcement learning agents, see “Reinforcement Learning Agents”.

Experience buffer from which to extract the observation information, specified as an

rlReplayMemory or rlPrioritizedReplayMemory object.
Output Arguments
obsInfo — Observation data specifications
array of rlNumericSpec objects | array of rlFiniteSetSpec objects
Observation data specifications extracted from the reinforcement learning environment, returned as
an array of one of the following:
• rlNumericSpec objects
• rlFiniteSetSpec objects
• A mix of rlNumericSpec and rlFiniteSetSpec objects
Version History
2-108
getObservationInfo
See Also
rlNumericSpec | rlFiniteSetSpec | getActionInfo | rlQAgent | rlSARSAAgent |
rlDQNAgent | rlPGAgent | rlACAgent | rlDDPGAgent
Topics
2-109
2 Functions
getValue
Obtain estimated value from a critic given environment observations and actions
Syntax
value = getValue(valueFcnAppx,obs)
value = getValue(vqValueFcnAppx,obs)
value = getValue(qValueFcnAppx,obs,act)
[value,state] = getValue( ___ )
Description
Value Function Critic
value = getValue(valueFcnAppx,obs) evaluates the value function critic valueFcnAppx and

returns the value corresponding to the observation obs. In this case, valueFcnAppx is an
rlValueFunction approximator object.
Q-Value Function Critics
value = getValue(vqValueFcnAppx,obs) evaluates the discrete-action-space Q-value function

critic vqValueFcnAppx and returns the vector value, in which each element represents the
estimated value given the state corresponding to the observation obs and the action corresponding to
the element number of value. In this case, vqValueFcnAppx is an rlVectorQValueFunction
approximator object.
value = getValue(qValueFcnAppx,obs,act) evaluates the Q-value function critic

qValueFcnAppx and returns the scalar value, representing the value given the observation obs and
action act. In this case, qValueFcnAppx is an rlQValueFunction approximator object.
Return Recurrent Neural Network State
[value,state] = getValue( ___ ) also returns the updated state of the critic object when it
contains a recurrent neural network.
Examples
Obtain Value Function Estimates
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
To approximate the value function within the critic, create a neural network. Define a single path
from the network input (the observation) to its output (the value), as an array of layer objects.
2-110
getValue
net = [ featureInputLayer(4) ...

fullyConnectedLayer(1)];
summary(net);
Initialized: true
Inputs:
Create a critic using the network and the observation specification object. When you use this syntax
the network input layer is automatically associated with the environment observation according to
the dimension specifications in obsInfo.
critic = rlValueFunction(net,obsInfo);
Obtain a value function estimate for a random single observation. Use an observation array with the
same dimensions as the observation specification.
val = getValue(critic,{rand(4,1)})
val = single
0.7904
You can also obtain value function estimates for a batch of observations. For example obtain value
functions for a batch of 20 observations.
batchVal = getValue(critic,{rand(4,1,20)});
size(batchVal)
ans = 1×2
1 20
valBatch contains one value function estimate for each observation in the batch.
Obtain Vector Q-Value Function Estimates
the observation space as a continuous four-dimensional space, so that a single observation is a
column vector containing four doubles, and the action space as a finite set consisting of three
possible values (named 7, 5, and 3 in this case).

actInfo = rlFiniteSetSpec([7 5 3]);
To approximate the Q-value function within the critic, create a neural network. Define a single path
from the network input to its output as array of layer objects. The input of the network must accept a
2-111
2 Functions
four-element vector, as defined by obsInfo. The output must be a single output layer having as many
elements as the number of possible discrete actions (three in this case, as defined by actInfo).
net = [featureInputLayer(4)
summary(net)
Initialized: true
Inputs:
Create the critic using the network, as well as the names of the observation and action specification
objects. The network input layers are automatically associated with the components of the
observation signals according to the dimension specifications in obsInfo.
critic = rlVectorQValueFunction(net,obsInfo,actInfo);
Use getValue to return the values of a random observation, using the current network weights.
v = getValue(critic,{rand(obsInfo.Dimension)})
v = 3x1 single column vector
0.7232
0.8177
-0.2212
v contains three value function estimates, one for each possible discrete action.
You can also obtain value function estimates for a batch of observations. For example, obtain value
function estimates for a batch of 10 observations.
batchV = getValue(critic,{rand([obsInfo.Dimension 10])});

size(batchV)
ans = 1×2
3 10
batchV contains three value function estimates for each observation in the batch.
Obtain Single-Output Q-Value Function Estimates
getObservationInfo to extract the specification object from an environment). For this example,
define the observation space as having two continuous channels, the first one carrying an 8 by 3
2-112
getValue
matrix, and the second one a continuous four-dimensional vector. The action specification is a
continuous column vector containing 2 doubles.
obsInfo = [rlNumericSpec([8 3]), rlNumericSpec([4 1])];

Create a custom basis function and its initial weight matrix.
myBasisFcn = @(obsA,obsB,act) [...

ones(30,1);
obsA(:); obsB(:); act(:);
obsA(:).^2; obsB(:).^2; act(:).^2;
sin(obsA(:)); sin(obsB(:)); sin(act(:));
cos(obsA(:)); cos(obsB(:)); cos(act(:))];
W0 = rand(150,1);
The output of the critic is the scalar W'*myBasisFcn(obs,act), representing the Q-value function
to be approximated.
Create the critic.

obsInfo,actInfo);
Use getValue to return the value of a random observation-action pair, using the current parameter
matrix.
v = getValue(critic,{rand(8,3),(1:4)'},{rand(2,1)})
v = 68.8628
Create a random observation set of batch size 64 for each channel. The third dimension is the batch
size, while the fourth is the sequence length for any recurrent neural network used by the critic (in
this case not used).
batchobs_ch1 = rand(8,3,64,1);
batchobs_ch2 = rand(4,1,64,1);
Create a random action set of batch size 64.
batchact = rand(2,1,64,1);
Obtain the state-action value function estimate for the batch of observations and actions.
bv = getValue(critic,{batchobs_ch1,batchobs_ch2},{batchact});
size(bv)
ans = 1×2
1 64
bv(23)
ans = 46.6310
2-113
2 Functions
Input Arguments
valueFcnAppx — Value function critic
rlValueFunction object
Value function critic, specified as an rlValueFunction approximator object.
vqValueFcnAppx — Vector Q-value function critic

rlVectorQValueFunction object
Vector Q-value function critic, specified as an rlVectorQValueFunction approximator object.
qValueFcnAppx — Q-value function critic

rlQValueFunction object
Q-value function critic, specified as an rlQValueFunction object.
obs — Observations
cell array
Observations, specified as a cell array with as many elements as there are observation input
channels. Each element of obs contains an array of observations for a single observation input
channel.

specify LB > 1. If the critic object given as first input argument has multiple observation input
channels, then LB must be the same for all elements of obs.
• LS specifies the sequence length for a recurrent neural network. If the critic object given as first
input argument does not use a recurrent neural network, then LS = 1. If the critic object has
multiple observation input channels, then LS must be the same for all elements of obs.
act — Action
Action, specified as a single-element cell array that contains an array of action values.
The dimensions of this array are MA-by-LB-by-LS, where:

specify LB > 1.
• LS specifies the sequence length for a recurrent neural network. If the critic object given as a first
input argument does not use a recurrent neural network, then LS = 1.
2-114
getValue
Output Arguments
value — Estimated value function
array
Estimated value function, returned as array with dimensions N-by-LB-by-LS, where:
• N is the number of outputs of the critic network.
• For a state value critics (valueFcnAppx), N = 1.

• For a single-output state-action value function critics (qValueFcnAppx), N = 1.
• For a multi-output state-action value function critics (vqValueFcnAppx), N is the number of
discrete actions.
• LS is the sequence length for a recurrent neural network.
state — Updated state of critic

cell array
Updated state of the critic, returned as a cell array. If the critic does not use a recurrent neural
network, then state is an empty cell array.
You can set the state of the critic to state using the setState function. For example:
valueFcnAppx = setState(valueFcnAppx,state);
Tips
The more general function evaluate behaves, for critic objects, similarly to getValue except that
evaluate returns results inside a single-cell array.
para
Version History
See Also
evaluate | getAction | getMaxQValue
Topics
2-115
2 Functions
gradient
Evaluate gradient of function approximator object given observation and action input data
Syntax
grad = gradient(fcnAppx,'output-input',inData)
grad = gradient(fcnAppx,'output-parameters',inData)
grad = gradient(fcnAppx,lossFcn,inData,fcnData)
Description
grad = gradient(fcnAppx,'output-input',inData) evaluates the gradient of the sum of the
outputs of the function approximator object fcnAppx with respect to its inputs. It returns the value of
the gradient grad when the input of fcnAppx is inData.
grad = gradient(fcnAppx,'output-parameters',inData) evaluates the gradient of the sum

of the outputs of fcnAppx with respect to its parameters.
grad = gradient(fcnAppx,lossFcn,inData,fcnData) evaluates the gradient of a loss

function associated to the function handle lossFcn, with respect to the parameters of fcnAppx. The
last optional argument fcnData can contain additional inputs for the loss function.
Examples
Calculate Gradients for Continuous Gaussian Actor
an observation space with of three channels. The first channel carries an observation from a
continuous three-dimensional space, so that a single observation is a column vector containing three
doubles. The second channel carries a discrete observation made of a two-dimensional row vector
that can take one of five different values. The third channel carries a discrete scalar observation that
can be either zero or one. Finally, the action space is a continuous four-dimensional space, so that a
single action is a column vector containing four doubles, each between -10 and 10.

rlFiniteSetSpec({[1 2],[3 4],[5 6],[7 8],[9 10]})
actInfo = rlNumericSpec([4 1], ...

UpperLimit= 10*ones(4,1), ...
LowerLimit=-10*ones(4,1) );
To approximate the policy within the actor, use a recurrent deep neural network. For a continuous
Gaussian actor, the network must have two output layers (one for the mean values the other for the
standard deviation values), each having as many elements as the dimension of the action space.
2-116
gradient
Create a the network, defining each path as an array of layer objects. Use sequenceInputLayer as
the input layer and include an lstmLayer as one of the other network layers. Also use a softplus
layer to enforce nonnegativity of the standard deviations and a ReLU layer to scale the mean values
to the desired output range. Get the dimensions of the observation and action spaces from the
environment specification objects, and specify a name for the input layers, so you can later explicitly
associate them with the appropriate environment channel.

Name="netObsIn1")
fullyConnectedLayer(prod(actInfo.Dimension), ...
Name="infc1") ];

Name="netObsIn2")
prod(actInfo.Dimension), ...
Name="infc2") ];

Name="netObsIn3")
Name="infc3") ];
% Concatenate inputs along the first available dimension

jointPath = [ concatenationLayer(1,3,Name="cat")
tanhLayer(Name="tanhJnt");
lstmLayer(8,OutputMode="sequence",Name="lstm")
Name="jntfc");
];
% Path layers for mean value

% Using scalingLayer to scale range from (-1,1) to (-10,10)
meanPath = [ tanhLayer(Name="tanhMean");
fullyConnectedLayer(prod(actInfo.Dimension));
scalingLayer(Name="scale", ...
Scale=actInfo.UpperLimit) ];
% Path layers for standard deviations

% Using softplus layer to make them nonnegative
sdevPath = [ tanhLayer(Name="tanhStdv");
softplusLayer(Name="splus") ];

net = layerGraph;
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
2-117
2 Functions
% Connect layers
net = connectLayers(net,"infc1","cat/in1");
net = connectLayers(net,"jntfc","tanhMean/in");
net = connectLayers(net,"jntfc","tanhStdv/in");
% Plot network
plot(net)
% Convert to dlnetwork
% Display the number of weights

summary(net)
Initialized: true
Inputs:
2-118
gradient
Create the actor with rlContinuousGaussianActor, using the network, the observations and
action specification objects, as well as the names of the network input layer and the options object.
actor = rlContinuousGaussianActor(net, obsInfo, actInfo, ...

ActionMeanOutputNames="scale",...
ActionStandardDeviationOutputNames="splus",...
ObservationInputNames=["netObsIn1","netObsIn2","netObsIn3"]);
To return mean value and standard deviations of the Gaussian distribution as a function of the
current observation, use evaluate.
[prob,state] = evaluate(actor, {rand([obsInfo(1).Dimension 1 1]) , ...

rand([obsInfo(3).Dimension 1 1]) });
The result is a cell array with two elements, the first one containing a vector of mean values, and the
second containing a vector of standard deviations.
prob{1}
-1.5454
0.4908
-0.1697
0.8081
prob{2}
0.6913
0.6141
0.7291
0.6475
To return an action sampled from the distribution, use getAction.
act = getAction(actor, {rand(obsInfo(1).Dimension) , ...

act{1}
-3.2003
-0.0534
-1.0700
-0.4032
Calculate the gradients of the sum of the outputs (all the mean values plus all the standard
deviations) with respect to the inputs, given a random observation.
gro = gradient(actor,"output-input", ...

{rand(obsInfo(1).Dimension) , ...
2-119
2 Functions
rand(obsInfo(3).Dimension)} )
gro=3×1 cell array

{3x1 single}
{2x1 single}
{[ 0.1311]}
contains the derivatives of the sum of the outputs with respect to each component of the input
gro{2}
-1.3404
0.6642
Obtain the gradient with respect of five independent sequences, each one made of nine sequential
observations.
gro_batch = gradient(actor,"output-input", ...

{rand([obsInfo(1).Dimension 5 9]) , ...

{3x5x9 single}
{2x5x9 single}
{1x5x9 single}
gro_batch{1}(3,4,7)
ans = single
0.2020
actor = accelerate(actor,true);
observation.
grp = gradient(actor,"output-parameters", ...

rand(obsInfo(3).Dimension)} )
grp=15×1 cell array

{ 4x3 single}
{ 4x1 single}
{ 4x2 single}
2-120
gradient
{ 4x1 single}
{ 4x1 single}
{ 4x1 single}
{32x12 single}
{32x8 single}
{32x1 single}
{ 4x8 single}
{ 4x1 single}
{ 4x4 single}
{ 4x1 single}
{ 4x4 single}
{ 4x1 single}
parameters.
grp_batch = gradient(actor,"output-parameters", ...

{ 4x3 single}
{ 4x1 single}
{ 4x2 single}
{ 4x1 single}
{ 4x1 single}
{ 4x1 single}
{32x12 single}
{32x8 single}
{32x1 single}
{ 4x8 single}
{ 4x1 single}
{ 4x4 single}
{ 4x1 single}
{ 4x4 single}
{ 4x1 single}
together. Therefore, the returned gradient has always the same size as the output from
Calculate Gradients for Vector Q-Value Function
an observation space made of two channels. The first channel carries an observation from a
continuous four-dimensional space. The second carries a discrete scalar observation that can be
either zero or one. Finally, the action space consist of a scalar that can be -1, 0, or 1.
2-121
2 Functions
To approximate the vector Q-value function within the critic, use a recurrent deep neural network.
The output layer must have three elements, each one expressing the value of executing the
corresponding action, given the observation.
Create the neural network, defining each network path as an array of layer objects. Get the
dimensions of the observation and action spaces from the environment specification objects, use
sequenceInputLayer as the input layer, and include an lstmLayer as one of the other network
layers.

Name="netObsIn1")
Name="infc1") ];

Name="netObsIn2")
Name="infc2") ];
% Concatenate inputs along first available dimension

tanhLayer(Name="tanhJnt")
fullyConnectedLayer(prod(numel(actInfo.Elements))) ];

net = layerGraph;
% Connect layers
% Plot network
plot(net)
2-122
gradient

summary(net)
Initialized: true
Inputs:
Create the critic with rlVectorQValueFunction, using the network and the observation and action
To return the value of the actions as a function of the current observation, use getValue or
evaluate.
val = evaluate(critic, ...

{rand(obsInfo(1).Dimension), ...
rand(obsInfo(2).Dimension)})
2-123
2 Functions
val = 1x1 cell array

{3x1 single}
When you use evaluate, the result is a single-element cell array, containing a vector with the values
of all the possible actions, given the observation.
val{1}
-0.0054
-0.0943
0.0177
Calculate the gradients of the sum of the outputs with respect to the inputs, given a random
observation.
gro = gradient(critic,"output-input", ...

rand(obsInfo(2).Dimension) } )
gro=2×1 cell array

{4x1 single}
{[ -0.0396]}
contains the derivative of the sum of the outputs with respect to each component of the input
gro{2}
ans = single
-0.0396
Obtain the gradient with respect of five independent sequences each one made of nine sequential
observations.
gro_batch = gradient(critic,"output-input", ...

rand([obsInfo(2).Dimension 5 9]) } )

{4x5x9 single}
{1x5x9 single}
gro_batch{1}(3,4,7)
ans = single
0.0443
2-124
gradient
critic = accelerate(critic,true);
observation.
grp = gradient(critic,"output-parameters", ...

rand(obsInfo(2).Dimension) } )
grp=9×1 cell array

{[-0.0101 -0.0097 -0.0039 -0.0065]}
{[ -0.0122]}
{[ -0.0078]}
{[ -0.0863]}
{32x2 single }
{32x8 single }
{32x1 single }
{ 3x8 single }
{ 3x1 single }
parameters.
grp_batch = gradient(critic,"output-parameters", ...

rand([obsInfo(2).Dimension 5 9]) } )

{[-1.5801 -1.2618 -2.2168 -1.3463]}
{[ -3.7742]}
{[ -6.2158]}
{[ -12.6841]}
{32x2 single }
{32x8 single }
{32x1 single }
{ 3x8 single }
{ 3x1 single }
Input Arguments
fcnAppx — Function approximator object
Function approximator object, specified:
2-125
2 Functions
• rlIsDoneFunction.
inData — Input data for the function approximator

cell array
Input data for the function approximator, specified as a cell array with as many elements as the
number of input channels of fcnAppx. In the following section, the number of observation channels is
indicated by NO.
• If fcnAppx is an rlQValueFunction, an
rlContinuousDeterministicTransitionFunction or an
rlContinuousGaussianTransitionFunction object, then each of the first NO elements of
inData must be a matrix representing the current observation from the corresponding
observation channel. They must be followed by a final matrix representing the action.
• If fcnAppx is a function approximator object representing an actor or critic (but not an
rlQValueFunction object), inData must contain NO elements, each one being a matrix
representing the current observation from the corresponding observation channel.
• If fcnAppx is an rlContinuousDeterministicRewardFunction, an
rlContinuousGaussianRewardFunction, or an rlIsDoneFunction object, then each of the
first NO elements of inData must be a matrix representing the current observation from the
corresponding observation channel. They must be followed by a matrix representing the action,
and finally by NO elements, each one being a matrix representing the next observation from the
corresponding observation channel.
Each element of inData must be a matrix of dimension MC-by-LB-by-LS, where:
• MC corresponds to the dimensions of the associated input channel.

• LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of (independent)
observations, specify LB > 1. If inData has multiple elements, then LB must be the same for all
elements of inData.
• LS specifies the sequence length (along the time dimension) for recurrent neural network. If
fcnAppx does not use a recurrent neural network, (which is the case of environment function
approximators, as they do not support recurrent neural networks) then LS = 1. If inData has
multiple elements, then LS must be the same for all elements of inData.
Example: {rand(8,3,64,1),rand(4,1,64,1),rand(2,1,64,1)}
lossFcn — Loss function

function handle
Loss function, specified as a function handle to a user-defined function. The user defined function can
either be an anonymous function or a function on the MATLAB path. The function first input
2-126
gradient
parameter must be a cell array like the one returned from the evaluation of fcnAppx. For more
information, see the description of outData in evaluate. The second, optional, input argument of
lossFcn contains additional data that might be needed for the gradient calculation, as described
below in fcnData. For an example of the signature that this function must have, see “Train
Reinforcement Learning Policy Using Custom Training Loop”.
fcnData — Additional input data for loss function

any MATLAB data type
Additional data for the loss function, specified as any MATLAB data type, typically a structure or cell
array. For an example see “Train Reinforcement Learning Policy Using Custom Training Loop”.
Output Arguments
grad — Value of the gradient
cell array
Value of the gradient, returned as a cell array.
When the type of gradient is from the sum of the outputs with respect to the inputs of fcnAppx, then
grad is a cell array in which each element contains the gradient of the sum of all the outputs with
respect to the corresponding input channel.
The numerical array in each cell has dimensions D-by-LB-by-LS, where:
• D corresponds to the dimensions of the input channel of fcnAppx.

• LB is the batch size (length of a batch of independent inputs).
• LS is the sequence length (length of the sequence of inputs along the time dimension) for a
of environment function approximators, as they do not support recurrent neural networks), then
LS = 1.
When the type of gradient is from the output with respect to the parameters of fcnAppx, then grad
is a cell array in which each element contains the gradient of the sum of outputs belonging to an
output channel with respect to the corresponding group of parameters. The gradient is calculated
using the whole history of LS inputs, and all the LB gradients with respect to the independent input
sequences are added together in grad. Therefore, grad has always the same size as the result from
Version History
See Also
evaluate | accelerate | getLearnableParameters | rlValueFunction | rlQValueFunction
| rlVectorQValueFunction | rlContinuousDeterministicActor |
2-127
2 Functions
rlIsDoneFunction
Topics
2-128
hyperbolicPenalty
hyperbolicPenalty
Hyperbolic penalty value for a point with respect to a bounded region
Syntax
p = hyperbolicPenalty(x,xmin,xmax)
p = hyperbolicPenalty( ___ ,lambda,tau)
Description
p = hyperbolicPenalty(x,xmin,xmax) calculates the nonnegative (hyperbolic) penalty vector p
for the point x with respect to the region bounded by xmin and xmax. p has the same dimension as x.
This syntax uses the default values of 1 and 0.1 for the lambda and tau parameters of the
hyperbolic function, respectively.
p = hyperbolicPenalty( ___ ,lambda,tau) specifies both the lambda and tau parameters of
the hyperbolic function. If lambda is an empty matrix its default value is used. Likewise if tau is an
empty matrix or it is omitted, its default value is used instead.
Examples
Calculate Hyperbolic Penalty for a Point
This example shows how to use the hyperbolicPenalty function to calculate the hyperbolic
penalty for a given point with respect to a bounded region.
Calculate the penalty value for the point 0.1 within the interval [-2,2], using default values for the
lambda and tau parameters.
hyperbolicPenalty(0.1,-2,2)
ans = 0.0050
Calculate the penalty value for the point 4 outside the interval [-2,2].
hyperbolicPenalty(4,-2,2)
ans = 4.0033
Calculate the penalty value for the point 0.1 within the interval [-2,2], using a lambda parameter of
5.
hyperbolicPenalty(0.1,-2,2,5)
ans = 0.0010
Calculate the penalty value for the point 4 outside the interval [-2,2], using a lambda parameter of 5.
hyperbolicPenalty(4,-2,2,5)
ans = 20.0007
2-129
2 Functions
Calculate the penalty value for the point 4 outside the interval [-2,2], using a tau parameter of 0.5.
hyperbolicPenalty(4,-2,2,5,0.5)
ans = 20.0167
Calculate the penalty value for the point [-2,0,4] with respect to the box defined by the intervals
[0,1], [-1,1], and [-2,2] along the x, y, and z dimensions, respectively, using the default value for
lambda and a tau parameter of 0.
hyperbolicPenalty([-2 0 4],[0 -1 -2],[1 1 2],1,0)
ans = 3×1
4
0
4
x = -5:0.01:5;
Calculate penalties for all the points in the vector, using default values for the lambda and tau
parameters.
p = hyperbolicPenalty(x,-2,2);
Plot the points and add grid, labels and title.
plot(x,p)
grid
2-130
hyperbolicPenalty
Input Arguments
x — Point for which the penalty is calculated
Point for which the penalty is calculated, specified as a numeric scalar, vector or matrix.
Example: [0.5; 1.6]

Lower bounds for x, specified as a numeric scalar, vector or matrix. To use the same minimum value
for all elements in x specify xmin as a scalar.
Example: -1

Upper bounds for x, specified as a numeric scalar, vector or matrix. To use the same maximum value
for all elements in x specify xmax as a scalar.
Example: 2
lambda — Lambda parameter of the hyperbolic function

1 (default) | nonnegative scalar
2-131
2 Functions
Lambda parameter of the hyperbolic function, specified as a scalar.

Example: 3
tau — Tau parameter of the hyperbolic function

Tau parameter of the hyperbolic function, specified as a scalar.

Example: 0.3
Output Arguments
p — Penalty value
nonnegative vector
Penalty value, returned as a vector of nonnegative elements. Each element pi depends on the position
of xi with respect to the interval specified by xmini and xmaxi. The hyperbolic penalty function
returns the value:
2 2 2 2
p(x) = − λ x − xmin + λ x − xmin + τ2 − λ xmax − x + λ xmax − x + τ2
Here, λ is the argument lambda, and τ is the argument tau. Note that for positive values of τ the
returned penalty value is always positive, because on the right side of the equation the magnitude of
the second term is always greater than that of the first, and the magnitude of the fourth term is
always greater than that of the third. If τ is zero, then the returned penalty is zero inside the interval
defined by the bounds, and it grows linearly with x outside this interval. If x is multidimensional, then
the calculation is applied independently on each dimension. Penalty functions are typically used to
generate negative rewards when constraints are violated, such as in generateRewardFunction.
Version History
See Also
Functions
generateRewardFunction | exteriorPenalty | barrierPenalty
Topics
2-132
inspectTrainingResult
Plot training information from a previous training session
Syntax
inspectTrainingResult(trainResults)
inspectTrainingResult(agentResults)
Description
By default, the train function shows the training progress and results in the Episode Manager
during training. If you configure training to not show the Episode Manager or you close the Episode
Manager after training, you can view the training results using the inspectTrainingResult
function, which opens the Episode Manager. You can also use inspectTrainingResult to view the
training results for agents saved during training.
inspectTrainingResult(trainResults) opens the Episode Manager and plots the training

results from a previous training session.
inspectTrainingResult(agentResults) opens the Episode Manager and plots the training

results from a previously saved agent structure.
Examples
View Results From Previous Training Session
For this example, assume that you have trained the agent in the “Train Reinforcement Learning
Agent in MDP Environment” example and subsequently closed the Episode Manager.
Load the training information returned by the train function.
load mdpTrainingStats trainingStats
Reopen the Episode Manager for this training session.
inspectTrainingResult(trainingStats)
2-133
2 Functions
View Training Results for Saved Agent
For this example, load the environment and agent for the “Train Reinforcement Learning Agent in
MDP Environment” example.
load mdpAgentAndEnvironment
Specify options for training the agent. Configure the SaveAgentCriteria and SaveAgentValue
options to save all agents after episode 30.
trainOpts = rlTrainingOptions;
trainOpts.MaxStepsPerEpisode = 50;
trainOpts.MaxEpisodes = 50;
trainOpts.Plots = "none";
trainOpts.SaveAgentCriteria = "EpisodeCount";
trainOpts.SaveAgentValue = 30;
Train the agent. During training, when an episode has a reward greater than or equal to 13, a copy of
the agent is saved in a savedAgents folder.
rng('default') % for reproducibility

trainingStats = train(qAgent,env,trainOpts);
Load the training results for one of the saved agents. This command loads both the agent and a
structure that contains the corresponding training results.
2-134
load savedAgents/Agent50
View the training results from the saved agent result structure.
inspectTrainingResult(savedAgentResult)
The Episode Manager shows the training progress up to the episode in which the agent was saved.
Input Arguments
trainResults — Training episode data
Training episode data, specified as a structure or structure array returned by the train function.
agentResults — Saved agent results

structure
Saved agent results, specified as a structure previously saved by the train function. The train
function saves agents when you specify the SaveAgentCriteria and SaveAgentValue options in
the rlTrainingOptions object used during training.
When you load a saved agent, the agent and its training results are added to the MATLAB workspace
as saved_agent and savedAgentResultStruct, respectively. To plot the training data for this
agent, use the following command.
inspectTrainingResult(savedAgentResultStruct)
For multi-agent training, savedAgentResultStruct contains structure fields with training results
for all the trained agents.
Version History
See Also
Functions
train
Topics
2-135
2 Functions
predict
Predict next observation, next reward, or episode termination given observation and action input data
Syntax
predNextObs = predict(tsnFcnAppx,obs,act)
predReward = predict(rwdFcnAppx,obs,act,nextObs)
predIsDone = predict(idnFcnAppx,obs,act)
Description
predNextObs = predict(tsnFcnAppx,obs,act) evaluates the environment transition function
approximator object tsnFcnAppx and returns the predicted next observation nextObs, given the
current observation obs and the action act.
predReward = predict(rwdFcnAppx,obs,act,nextObs) evaluates the environment reward

function approximator object rwdFcnAppx and returns the predicted reward predReward, given the
current observation obs, the action act, and the next observation nextObs.
predIsDone = predict(idnFcnAppx,obs,act) evaluates the environment is-done function

approximator object idnFcnAppx and returns the predicted is-done status predIsDone, given the
current observation obs, the action act, and the next observation nextObs.
Examples
Predict Next Observation Using Continuous Gaussian Transition Function Approximator
getActionInfo to extract the specification objects from an environment). For this example, two
observation channels carry vectors in a four- and two-dimensional space, respectively. The action is a
continuous three-dimensional vector.
obsInfo = [rlNumericSpec([4 1],UpperLimit=10*ones(4,1));

rlNumericSpec([1 2],UpperLimit=20*ones(1,2)) ];
Create a deep neural network to use as approximation model for the transition function approximator.
For a continuous Gaussian transition function approximator, the network must have two output layers
for each observation (one for the mean values the other for the standard deviation values).
you can later explicitly associate them with the appropriate environment channel.
% Input path layers from first observation channel

inPath1 = [ featureInputLayer( ...
2-136
predict
Name="netObsIn1")
% Input path layers from second observation channel

Name="netObsIn2")
% Input path layers from action channel

prod(actInfo(1).Dimension), ...
Name="netActIn")
% Joint path layers, concatenate 3 inputs along first dimension

jointPath = [ concatenationLayer(1,3,Name="concat")
tanhLayer(Name="tanhJnt");
fullyConnectedLayer(10,Name="jntfc") ];
% Path layers for mean values of first predicted obs

% Note that scale vector must be a column vector
meanPath1 = [ tanhLayer(Name="tanhMean1");
fullyConnectedLayer(prod(obsInfo(1).Dimension));
scalingLayer(Name="scale1", ...
Scale=obsInfo(1).UpperLimit) ];
% Path layers for standard deviations first predicted obs

% Using softplus layer to make them non negative
sdevPath1 = [ tanhLayer(Name="tanhStdv1");
softplusLayer(Name="splus1") ];
% Path layers for mean values of second predicted obs

% Note that scale vector must be a column vector
meanPath2 = [ tanhLayer(Name="tanhMean2");
scalingLayer(Name="scale2", ...
Scale=obsInfo(2).UpperLimit(:)) ];
% Path layers for standard deviations second predicted obs

sdevPath2 = [ tanhLayer(Name="tanhStdv2");
softplusLayer(Name="splus2") ];

net = layerGraph;
net = addLayers(net,meanPath1);
net = addLayers(net,sdevPath1);
net = addLayers(net,meanPath2);
2-137
2 Functions
net = addLayers(net,sdevPath2);
% Connect layers
net = connectLayers(net,"infc1","concat/in1");
net = connectLayers(net,"jntfc","tanhMean1/in");
net = connectLayers(net,"jntfc","tanhStdv1/in");
net = connectLayers(net,"jntfc","tanhMean2/in");
net = connectLayers(net,"jntfc","tanhStdv2/in");
% Plot network
plot(net)
net=dlnetwork(net);

summary(net)
Initialized: true
Inputs:
1 'netObsIn1' 4 features
2 'netObsIn2' 2 features
3 'netActIn' 3 features
2-138
predict
Create a continuous Gaussian transition function approximator object, specifying the names of all the
input and output layers.
tsnFcnAppx = rlContinuousGaussianTransitionFunction(...
net,obsInfo,actInfo,...
ObservationInputNames=["netObsIn1","netObsIn2"], ...
ActionInputNames="netActIn", ...
NextObservationMeanOutputNames=["scale1","scale2"], ...
NextObservationStandardDeviationOutputNames=["splus1","splus2"] );
Predict the next observation for a random observation and action.

predObs = predict(tsnFcnAppx, ...
{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)}, ...
{rand(actInfo(1).Dimension)})
predObs=1×2 cell array

{4x1 single} {[-24.9934 0.9501]}
Each element of the resulting cell array represents the prediction for the corresponding observation
channel.
To display the mean values and standard deviations of the Gaussian probability distribution for the
predicted observations, use evaluate.
predDst = evaluate(tsnFcnAppx, ...
{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension), ...
rand(actInfo(1).Dimension)})
predDst=1×4 cell array

{4x1 single} {2x1 single} {4x1 single} {2x1 single}
The result is a cell array in which the first and second element represent the mean values for the
predicted observations in the first and second channel, respectively. The third and fourth element
represent the standard deviations for the predicted observations in the first and second channel,
respectively.
Create Deterministic Reward Function and Predict Reward
Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
To approximate the reward function, create a deep neural network. For this example, the network has
two input channels, one for the current action and one for the next observations. The single output
channel contains a scalar, which represents the value of the predicted reward.
action spaces from the environment specifications, and specify a name for the input layers, so you can
later explicitly associate them with the appropriate environment channel.
2-139
2 Functions
actionPath = featureInputLayer( ...

actInfo.Dimension(1), ...
Name="action");
nextStatePath = featureInputLayer( ...

obsInfo.Dimension(1), ...
Name="nextState");
commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(1,Name="reward")];
net = layerGraph(nextStatePath);
net = addLayers(net,actionPath);
net = addLayers(net,commonPath);
net = connectLayers(net,"nextState","concat/in1");
net = connectLayers(net,"action","concat/in2");
plot(net)
Create a dlnetwork object and display the number of weights.
2-140
predict
summary(net);
Initialized: true
Number of learnables: 8.7k
Inputs:
1 'nextState' 4 features
2 'action' 1 features
Create a deterministic transition function object.

rwdFcnAppx = rlContinuousDeterministicRewardFunction(...
ActionInputNames="action", ...
NextObservationInputNames="nextState");
Using this reward function object, you can predict the next reward value based on the current action
and next observation. For example, predict the reward for a random action and next observation.
Since, for this example, only the action and the next observation influence the reward, use an empty
cell array for the current observation.
act = rand(actInfo.Dimension);
nxtobs = rand(obsInfo.Dimension);
reward = predict(rwdFcnAppx,{}, {act}, {nxtobs})
reward = single
0.1034
To predict the reward, you can also use evaluate.

reward_ev = evaluate(rwdFcnAppx, {act,nxtobs} )
reward_ev = 1x1 cell array

{[0.1034]}
Create Is-Done Function and Predict Termination
To approximate the is-done function, use a deep neural network. The network has one input channel
for the next observations. The single output channel is for the predicted termination signal.
Create the neural network as a vecto of layer object.

commonPath = [
featureInputLayer( ...
Name="nextState")
2-141
2 Functions
fullyConnectedLayer(2,Name="isdone0")
softmaxLayer(Name="isdone")];
net = layerGraph(commonPath);
plot(net)
Covert the network to a dlnetwork object and display the number of weights.
summary(net);
Initialized: true
Inputs:
Create an is-done function approximator object.
isDoneFcnAppx = rlIsDoneFunction(...
2-142
predict
Using this is-done function approximator object, you can predict the termination signal based on the
next observation. For example, predict the termination signal for a random next observation. Since
for this example the termination signal only depends on the next observation, use empty cell arrays
for the current action and observation inputs.
predIsDone = predict(isDoneFcnAppx,{},{},{nxtobs})
predIsDone = 0
You can obtain the termination probability using evaluate.

predIsDoneProb = evaluate(isDoneFcnAppx,{nxtobs})
predIsDoneProb = 1x1 cell array

{2x1 single}
predIsDoneProb{1}
0.5405
0.4595
The first number is the probability of obtaining a 0 (no termination predicted), the second one is the
probability of obtaining a 1 (termination predicted).
Input Arguments
tsnFcnAppx — Environment transition function approximator object
rlContinuousDeterministicTransitionFunction object |
rlContinuousGaussianTransitionFunction object
Environment transition function approximator object, specified as one of the following:
• rlContinuousDeterministicTransitionFunction object
• rlContinuousGaussianTransitionFunction object
rwdFcnAppx — Environment reward function

rlContinuousDeterministicRewardFunction object |
rlContinuousGaussianRewardFunction object | function handle
Environment reward function approximator object, specified as one of the following:
• rlContinuousDeterministicRewardFunction object
• rlContinuousGaussianRewardFunction object
• Function handle object. For more information about function handle objects, see “What Is a
Function Handle?”.
idnFcnAppx — Environment is-done function approximator object

rlIsDoneFunction object
Environment is-done function approximator object, specified as an rlIsDoneFunction object.
2-143
2 Functions
obs — Observations
cell array
Observations, specified as a cell array with as many elements as there are observation input
channels. Each element of obs contains an array of observations for a single observation input
channel.
The dimensions of each element in obs are MO-by-LB, where:

specify LB > 1. If valueRep or qValueRep has multiple observation input channels, then LB must
be the same for all elements of obs.
LB must be the same for both act and obs.
act — Action
Action, specified as a single-element cell array that contains an array of action values.
The dimensions of this array are MA-by-LB, where:

specify LB > 1.
LB must be the same for both act and obs.
nextObs — Next observation

cell array
Next observations, that is the observation following the action act from the observation obs,
specified as a cell array of the same dimension as obs.
Output Arguments
predNextObs — Predicted next observation
cell array
Predicted next observation, that is the observation predicted by the transition function approximator
tsnFcnAppx given the current observation obs and the action act, retuned as a cell array of the
same dimension as obs.
predReward — Predicted next observation

single
2-144
predict
Predicted reward, that is the reward predicted by the reward function approximator rwdFcnAppx
given the current observation obs, the action act, and the following observation nextObs, retuned
as a single.
predIsDone — Predicted next observation

double
Predicted is-done episode status, that is the episode termination status predicted by the is-done
function approximator rwdFcnAppx given the current observation obs, the action act, and the
following observation nextObs, returned as a double.
Note If fcnAppx is an rlContinuousDeterministicRewardFunction object, then evaluate

behaves identically to predict except that it returns results inside a single-cell array. If fcnAppx is
an rlContinuousDeterministicTransitionFunction object, then evaluate behaves
identically to predict. If fcnAppx is an rlContinuousGaussianTransitionFunction object,
then evaluate returns the mean value and standard deviation the observation probability
distribution, while predict returns an observation sampled from this distribution. Similarly, for an
rlContinuousGaussianRewardFunction object, evaluate returns the mean value and standard
deviation the reward probability distribution, while predict returns a reward sampled from this
distribution. Finally, if fcnAppx is an rlIsDoneFunction object, then evaluate returns the
probabilities of the termination status being false or true, respectively, while predict returns a
predicted termination status sampled with these probabilities.
Version History
See Also
Objects
rlNeuralNetworkEnvironment | rlContinuousDeterministicTransitionFunction |
rlIsDoneFunction | evaluate | accelerate | gradient
Topics
“Model-Based Policy Optimization Agents”
2-145
2 Functions
reset
Package: rl.policy
Reset environment, agent, experience buffer, or policy object
Syntax
initialObs = reset(env)
reset(agent)
agent = reset(agent)
resetPolicy = reset(policy)
reset(buffer)
Description
initialObs = reset(env) resets the specified MATLAB environment to an initial state and
returns the resulting initial observation value.
Do not use reset for Simulink environments, which are implicitly reset when running a new
simulation. Instead, customize the reset behavior using the ResetFcn property of the environment.
reset(agent) resets the specified agent. Resetting a built-in agent performs the following actions,
if applicable.
• Empty experience buffer.

• Set recurrent neural network states of actor and critic networks to zero.
• Reset the states of any noise models used by the agent.
agent = reset(agent) also returns the reset agent as an output argument.
resetPolicy = reset(policy) returns the policy object resetPolicy in which any recurrent
neural network states are set to zero and any noise model states are set to their initial conditions.
This syntax has no effect if the policy object does not use a recurrent neural network and does not
have a noise model with state.
reset(buffer) resets the specified replay memory buffer by removing all the experiences.
Examples
Reset Environment
Create a reinforcement learning environment. For this example, create a continuous-time cart-pole
system.
2-146
reset
Reset the environment and return the initial observation.
initialObs = reset(env)
initialObs = 4×1
0
0
0.0315
0
Reset Agent
Create observation and action specifications.

Create a default DDPG agent using these specifications.
initOptions = rlAgentInitializationOptions(UseRNN=true);
agent = rlDDPGAgent(obsInfo,actInfo,initOptions);
Reset the agent.
agent = reset(agent);
Reset Experience Buffer

Create a replay memory experience buffer.
Add experiences to the buffer. For this example, add 20 random experiences.
for i = 1:20
end
Reset and clear the buffer.
2-147
2 Functions
reset(buffer)
Reset Policy

To approximate the Q-value function within the critic, use a deep neural network. Create each
network path as an array of layer objects.
% Create Paths
obsPath = [featureInputLayer(4)
fullyConnectedLayer(1,Name="obsout")];
actPath = [featureInputLayer(1)
fullyConnectedLayer(1,Name="actout")];
comPath = [additionLayer(2,Name="add") ...

% Add Layers
net = layerGraph;
net = addLayers(net,obsPath);
net = addLayers(net,actPath);
net = addLayers(net,comPath);
net = connectLayers(net,"obsout","add/in1");
net = connectLayers(net,"actout","add/in2");
% Convert to dlnetwork object

% Diplay the number of weights

summary(net)
Initialized: true
Inputs:
2 'input_1' 1 features
Create an epsilon-greedy policy object using a Q-value function approximator.

critic = rlQValueFunction(net,obsInfo,actInfo);
policy = rlEpsilonGreedyPolicy(critic)
policy =
rlEpsilonGreedyPolicy with properties:
QValueFunction: [1x1 rl.function.rlQValueFunction]

ExplorationOptions: [1x1 rl.option.EpsilonGreedyExploration]
UseEpsilonGreedyAction: 1
EnableEpsilonDecay: 1
2-148
reset

SampleTime: -1
Reset the policy.

policy = reset(policy);
Input Arguments
• rlFunctionEnv — Environment defined using custom functions


reinforcement learning agent object | ...
Reinforcement learning agent, specified as one of the following objects.
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• rlMBPOAgent
Note agent is a handle object, so it is reset whether is returned as an output argument or not. For
more information about handle objects, see “Handle Object Behavior”.

2-149
2 Functions

• rlMaxQPolicy
Output Arguments
initialObs — Initial environment observation
array | cell array
Initial environment observation after reset, returned as one of the following:
• Array with dimensions matching the observation specification for an environment with a single
• Cell array with length equal to the number of observation channel for an environment with
multiple observation channels. Each element of the cell array contains an array with dimensions
matching the corresponding element of the environment observation specifications.
resetPolicy — Reset agent

rlDeterministicActorPolicy object | rlAdditiveNoisePolicy object |
rlEpsilonGreedyPolicy object | rlMaxQPolicy object | rlStochasticActorPolicy object
Reset policy, returned as a policy object of the same type as agent but with its recurrent neural
network states set to zero.
agent — Reset agent

Reset agent, returned as an agent object. Note that agent is a handle object. Therefore, if it contains
any recurrent neural network, its state is reset whether agent is returned as an output argument or
not. For more information about handle objects, see “Handle Object Behavior”.
Version History
See Also
runEpisode | setup | cleanup
2-150
resize
resize
Package: rl.replay
Resize replay memory experience buffer
Syntax
resize(buffer,maxLength)
Description
resize(buffer,maxLength) resizes experience buffer buffer to have a maximum length of
maxLength.
• If maxLength is greater than or equal to the number of experiences stored in the buffer, then
buffer retains its stored experiences.
• If maxLength is less than the number of experiences stored in the buffer, then buffer retains
only the maxLength most recent experiences.
Examples
Resize Experience Buffer in Reinforcement Learning Agent
Create an environment for training the agent. For this example, load a predefined environment.
Extract the observation and action specifications from the agent.
Create a DQN agent from the environment specifications.
agent = rlDQNAgent(obsInfo,actInfo);
By default, the agent uses an experience buffer with a maximum size of 10,000.
agent.ExperienceBuffer
ans =
rlReplayMemory with properties:
MaxLength: 10000
Length: 0
Increase the maximum size of the experience buffer to 20,000.
resize(agent.ExperienceBuffer,20000)
2-151
2 Functions
View the updated experience buffer.
ans =
MaxLength: 20000
Length: 0
Input Arguments
maxLength — Maximum buffer length

nonnegative integer
This property is read-only.
Maximum buffer length, specified as a nonnegative integer.
Version History
See Also
rlReplayMemory | rlPrioritizedReplayMemory
2-152
rlCreateEnvTemplate
rlCreateEnvTemplate
Create custom reinforcement learning environment template
Syntax
rlCreateEnvTemplate(className)
Description
rlCreateEnvTemplate(className) creates and opens a MATLAB script that contains a template
class representing a reinforcement learning environment. The template class contains an
implementation of a simple cart-pole balancing environment. To define your custom environment,
modify this template class. For more information, see “Create Custom MATLAB Environment from
Template”.
Examples
Create Custom Environment Template File
This example shows how to create and open a template file for a reinforcement learning environment.
For this example, name the class myEnvClass
rlCreateEnvTemplate("myEnvClass")
This function opens a MATLAB® script that contains the class. By default, this template class
describes a simple cart-pole environment.
Modify this template class, and save the file as myEnvClass.m
Input Arguments
className — Name of environment class
Name of environment class, specified as a string or character vector. This name defines the name of
the class and the name of the MATLAB script.
Version History
See Also
Topics
2-153
2 Functions
rlDataLogger
Creates either a file logger object or a monitor logger object to log training data
Syntax
fileLgr = rlDataLogger()
monLgr = rlDataLogger(tpm)
Description
fileLgr = rlDataLogger() creates the FileLogger object fileLgr for logging training data to
disk.
monLgr = rlDataLogger(tpm) creates the MonitorLogger object monLgr for logging training
data to the TrainingProgressMonitor object tpm, and its associated window.
Examples
Log Data to Disk during Built-in Training
This example shows how to log data to disk when using train.
logger = rlDataLogger();
Specify a directory to save logged data.
logger.LoggingOptions.LoggingDirectory = "myDataLog";
Create callback functions to log the data (for this example, see the helper function section), and
specify the appropriate callback functions in the logger object. For a related example, see “Log
Training Data To Disk”.
logger.EpisodeFinishedFcn = @myEpisodeFinishedFcn;
logger.AgentStepFinishedFcn = @myAgentStepFinishedFcn;
logger.AgentLearnFinishedFcn = @myAgentLearnFinishedFcn;
To train the agent, you can now call train, passing logger as an argument such as in the following
command.
trainResult = train(agent, env, trainOpts, Logger=logger);
While the training progresses, data will be logged to the specified directory, according to the rule
specified in the FileNameRule property of logger.LoggingOptions.
logger.LoggingOptions.FileNameRule
ans =
"loggedData<id>"
2-154
rlDataLogger
Example Logging Functions
Episode completion logging function.

function dataToLog = myEpisodeFinishedFcn(data)
dataToLog.Experience = data.Experience;
end
Agent step completion logging function.

function dataToLog = myAgentStepFinishedFcn(data)
dataToLog.State = getState(getExplorationPolicy(data.Agent));
end
Learn subroutine completion logging function.

function dataToLog = myAgentLearnFinishedFcn(data)
dataToLog.ActorLoss = data.ActorLoss;
dataToLog.CriticLoss = data.CriticLoss;
end
Log and Visualize Data with a Training Progress Monitor Object
This example shows how to log and visualize data to the window of a trainingProgressMonitor
object when using train.
Create a trainingProgressMonitor object. Creating the object also opens a window associated
with the object.
monitor = trainingProgressMonitor();
Create a MonitorLogger object using rlDataLogger.
2-155
2 Functions
logger = rlDataLogger(monitor);
specify the appropriate callback functions in the logger object.
command.
While the training progresses, data will be logged to the training monitor object, and visualized in the
associated window.
Note that only scalar data can be logged with a monitor logger object.
Define a logging function that logs data periodically at the completion of the learning subroutine.
if mod(data.AgentLearnCount, 2) == 0
else
dataToLog = [];
end
end

setup(flgr);
file.
for iter = 1:10
2-156
rlDataLogger

for ct = 1:3
end


if mod(iter,4)==0
write(flgr);
end
end
cleanup(flgr);
Input Arguments
tpm — Training progress monitor object
trainingProgressMonitor object
Training progress monitor object, specified as a trainingProgressMonitor object. This object is

used to track the progress of training, update information fields, record metric values, and produce
training plots for custom deep learning training loops. For more information, see “Monitor Custom
Training Loop Progress”.
Output Arguments
fileLgr — File logger object
FileLogger object
File logger object, returned as a FileLogger object.
monLgr — Monitor logger object

MonitorLogger object
Monitor logger object, returned as a MonitorLogger object.
Limitations
• Only scalar data is supported when logging data with a MonitorLogger object. The structure
returned by the callback functions must contain fields with scalar data.
• Resuming of training from a previous training result is not supported when logging data with a
MonitorLogger object.
• Logging data using the AgentStepFinishedFcn callback is not supported when training agents
in parallel with the train function.
2-157
2 Functions
Version History
See Also
Functions
train
Objects
FileLogger | MonitorLogger | trainingProgressMonitor
Topics
“Log Training Data To Disk”
“Monitor Custom Training Loop Progress”
2-158
rlOptimizer
rlOptimizer
Creates an optimizer object for actors and critics
Syntax
algobj = rlOptimizer
algobj = rlOptimizer(algOptions)
Description
Create an optimizer object that updates the learnable parameters of an actor or critic in a custom
training loop
algobj = rlOptimizer creates a default optimizer object. You can modify the object properties
using dot notation.
algobj = rlOptimizer(algOptions) creates an optimizer object with properties specified by the

optimizer options object algOptions.
Examples
Create Optimizer Object
Use rlOprimizer to create a default optimizer algorithm object to use for the training of an actor or
critic in a custom training loop.
myAlg = rlOptimizer
myAlg =
rlADAMOptimizer with properties:
GradientDecayFactor: 0.9000
SquaredGradientDecayFactor: 0.9990
Epsilon: 1.0000e-08
LearnRate: 0.0100
L2RegularizationFactor: 1.0000e-04
GradientThreshold: Inf
GradientThresholdMethod: "l2norm"
By default, the function returns an rlADAMOptimizer object with default options. You can use dot
notation to change some parameters.
myAlg.LearnRate = 0.1;
You can now create a structure and set its CriticOptimizer or ActorOptimizer field to myAlg.
When you call runEpisode, pass the structure as an input parameter. The runEpisode function can
then use the update method of myAlg to update the learnable parameters of your actor or critic.
2-159
2 Functions
Create Optimizer Object Specifying an Option Set
Use rlOprimizerOptions to create an optimizer option object. Specify the algorithm as

"rmsprop" and set the learning rate to 0.2.
myOptions=rlOptimizerOptions( ...
Algorithm="rmsprop", ...
LearnRate=0.2);
Use rlOptimizer to create an optimizer algorithm object to use for the training of an actor or critic
in a custom training loop. Specify the optimizer option set myOptions as input parameter.
myAlg=rlOptimizer(myOptions)
myAlg =
rlRMSPropOptimizer with properties:
Epsilon: 1.0000e-08
LearnRate: 0.2000
The function returns an rlRMSPropOptimizer object with default options. You can use dot notation
to change some parameters.
myAlg.GradientThreshold = 2;
You can now create a structure and set its CriticOptimizer or ActorOptimizer field to myAlg.
When you call runEpisode, pass the structure as an input parameter. The runEpisode function can
then use the update method of myAlg to update the learnable parameters of your actor or critic.
Input Arguments
algOptions — Algorithm options object
default Adam option set (default) | rlOptimizerOptions object
Algorithm options object, specified as an rlOptimizerOptions object.

Example: rlOptimizerOptions(Algorithm="sgdm",LearnRate=0.2)
Output Arguments
algobj — Algorithm optimizer object
rlADAMOptimizer object | rlSGDMOptimizer object | rlRMSPropOptimizer object
Algorithm optimizer object, returned as an rlADAMOptimizer, rlSGDMOptimizer, or

rlRMSPropOptimizer object. The runEpisode function uses the update method of the returned
object to update the learnable parameter of an actor or critic.
Version History
2-160
rlOptimizer
See Also
Functions
rlOptimizerOptions
Topics
2-161
2 Functions
rlPredefinedEnv
Create a predefined reinforcement learning environment
Syntax
env = rlPredefinedEnv(keyword)
Description
env = rlPredefinedEnv(keyword) takes a predefined keyword keyword representing the
environment name to create a MATLAB or Simulink reinforcement learning environment env. The
environment env models the dynamics with which the agent interacts, generating rewards and
observations in response to agent actions.
Examples
Basic Grid World Reinforcement Learning Environment
Use the predefined 'BasicGridWorld' keyword to create a basic grid world reinforcement learning
environment.
env = rlPredefinedEnv('BasicGridWorld')
env =

ResetFcn: []
Continuous Double Integrator Reinforcement Learning Environment
Use the predefined 'DoubleIntegrator-Continuous' keyword to create a continuous double

integrator reinforcement learning environment.
env = rlPredefinedEnv('DoubleIntegrator-Continuous')
env =
DoubleIntegratorContinuousAction with properties:
Gain: 1
Ts: 0.1000
MaxDistance: 5
GoalThreshold: 0.0100
Q: [2x2 double]
R: 0.0100
MaxForce: Inf
State: [2x1 double]
2-162
rlPredefinedEnv
You can visualize the environment using the plot function and interact with it using the reset and
step functions.
plot(env)
observation = reset(env)
observation = 2×1
4
0
[observation,reward,isDone] = step(env,16)
observation = 2×1
4.0800
1.6000
reward = -16.5559
isDone = logical
0
Create Continuous Simple Pendulum Model Environment
Use the predefined 'SimplePendulumModel-Continuous' keyword to create a continuous simple

pendulum model reinforcement learning environment.
env = rlPredefinedEnv('SimplePendulumModel-Continuous')
env =
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on
2-163
2 Functions
Input Arguments
keyword — Predefined keyword representing the environment name
'BasicGridWorld' | 'CartPole-Discrete' | 'DoubleIntegrator-Continuous' |
'SimplePendulumWithImage-Discrete' | 'SimplePendulumModel-Discrete' |
'SimplePendulumModel-Continuous' | 'CartPoleSimscapeModel-Continuous' | ...
Predefined keyword representing the environment name, specified as one of the following:
MATLAB Environment
• 'BasicGridWorld'
• 'CartPole-Discrete'
• 'CartPole-Continuous'
• 'DoubleIntegrator-Discrete'
• 'DoubleIntegrator-Continuous'
• 'SimplePendulumWithImage-Discrete'
• 'SimplePendulumWithImage-Continuous'
• 'WaterFallGridWorld-Stochastic'
• 'WaterFallGridWorld-Deterministic'
Simulink Environment
• 'SimplePendulumModel-Discrete'
• 'SimplePendulumModel-Continuous'
• 'CartPoleSimscapeModel-Discrete'
• 'CartPoleSimscapeModel-Continuous'
Output Arguments
env — MATLAB or Simulink environment object
rlMDPEnv object | CartPoleDiscreteAction object | CartPoleContinuousAction object |
DoubleIntegratorDiscreteAction object | DoubleIntegratorContinuousAction object |
SimplePendlumWithImageDiscreteAction object |
SimplePendlumWithImageContinuousAction object | SimulinkEnvWithAgent object
MATLAB or Simulink environment object, returned as one of the following:
• rlMDPEnv object, when you use one of the following keywords.
• 'BasicGridWorld'
• 'WaterFallGridWorld-Stochastic'
• 'WaterFallGridWorld-Deterministic'
• CartPoleDiscreteAction object, when you use the 'CartPole-Discrete' keyword.
• CartPoleContinuousAction object, when you use the 'CartPole-Continuous' keyword.
• DoubleIntegratorDiscreteAction object, when you use the 'DoubleIntegrator-
Discrete' keyword.
2-164
rlPredefinedEnv
• DoubleIntegratorContinuousAction object, when you use the 'DoubleIntegrator-

Continuous' keyword.
• SimplePendlumWithImageDiscreteAction object, when you use the
'SimplePendulumWithImage-Discrete' keyword.
• SimplePendlumWithImageContinuousAction object, when you use the
'SimplePendulumWithImage-Continuous' keyword.
• SimulinkEnvWithAgent object, when you use one of the following keywords.
• 'SimplePendulumModel-Discrete'
• 'SimplePendulumModel-Continuous'
• 'CartPoleSimscapeModel-Discrete'
• 'CartPoleSimscapeModel-Continuous'
Version History
See Also
Topics
“Load Predefined Control System Environments”
“Load Predefined Simulink Environments”
2-165
2 Functions
rlRepresentation
(Not recommended) Model representation for reinforcement learning agents
Note rlRepresentation is not recommended. Use rlValueRepresentation,

rlQValueRepresentation, rlDeterministicActorRepresentation, or
rlStochasticActorRepresentation instead. For more information, see “Compatibility
Considerations”.
Syntax
rep = rlRepresentation(net,obsInfo,'Observation',obsNames)
rep = rlRepresentation(net,obsInfo,actInfo,'Observation',obsNames,'Action',
actNames)
tableCritic = rlRepresentation(tab)
critic = rlRepresentation(basisFcn,W0,obsInfo)
critic = rlRepresentation(basisFcn,W0,oaInfo)
actor = rlRepresentation(basisFcn,W0,obsInfo,actInfo)
rep = rlRepresentation( ___ ,repOpts)
Description
Use rlRepresentation to create a function approximator representation for the actor or critic of a
reinforcement learning agent. To do so, you specify the observation and action signals for the training
environment and options that affect the training of an agent that uses the representation. For more
information on creating representations, see “Create Policies and Value Functions”.
rep = rlRepresentation(net,obsInfo,'Observation',obsNames) creates a representation

for the deep neural network net. The observation names obsNames are the network input layer
names. obsInfo contains the corresponding observation specifications for the training environment.
Use this syntax to create a representation for a critic that does not require action inputs, such as a
critic for an rlACAgent or rlPGAgent agent.
rep = rlRepresentation(net,obsInfo,actInfo,'Observation',obsNames,'Action',
actNames) creates a representation with action signals specified by the names actNames and
specification actInfo. Use this syntax to create a representation for any actor, or for a critic that
takes both observation and action as input, such as a critic for an rlDQNAgent or rlDDPGAgent
agent.
tableCritic = rlRepresentation(tab) creates a critic representation for the value table or Q

table tab. When you create a table representation, you specify the observation and action
specifications when you create tab.
critic = rlRepresentation(basisFcn,W0,obsInfo) creates a linear basis function

representation using the handle to a custom basis function basisFcn and initial weight vector W0.
obsInfo contains the corresponding observation specifications for the training environment. Use
2-166
rlRepresentation
this syntax to create a representation for a critic that does not require action inputs, such as a critic
for an rlACAgent or rlPGAgent agent.
critic = rlRepresentation(basisFcn,W0,oaInfo) creates a linear basis function

representation using the specification cell array oaInfo, where oaInfo = {obsInfo,actInfo}.
Use this syntax to create a representation for a critic that takes both observations and actions as
inputs, such as a critic for an rlDQNAgent or rlDDPGAgent agent.
actor = rlRepresentation(basisFcn,W0,obsInfo,actInfo) creates a linear basis function

representation using the specified observation and action specifications, obsInfo and actInfo,
respectively. Use this syntax to create a representation for an actor that takes observations as inputs
and generates actions.
rep = rlRepresentation( ___ ,repOpts) creates a representation using additional options that
specify learning parameters for the representation when you train an agent. Available options include
the optimizer used for training and the learning rate. Use rlRepresentationOptions to create the
options set repOpts. You can use this syntax with any of the previous input-argument combinations.
Examples
Create Actor and Critic Representations
Create an actor representation and a critic representation that you can use to define a reinforcement
learning agent such as an Actor Critic (AC) agent.
For this example, create actor and critic representations for an agent that can be trained against the
cart-pole environment described in “Train AC Agent to Balance Cart-Pole System”. First, create the
environment. Then, extract the observation and action specifications from the environment. You need
these specifications to define the agent and critic representations.
For a state-value-function critic such as those used for AC or PG agents, the inputs are the
observations and the output should be a scalar value, the state value. For this example, create the
critic representation using a deep neural network with one output, and with observation signals
corresponding to x,xdot,theta,thetadot as described in “Train AC Agent to Balance Cart-Pole
System”. You can obtain the number of observations from the obsInfo specification. Name the
network layer input 'observation'.
numObservation = obsInfo.Dimension(1);
criticNetwork = [
imageInputLayer([numObservation 1 1],'Normalization','none','Name','observation')
fullyConnectedLayer(1,'Name','CriticFC')];
Specify options for the critic representation using rlRepresentationOptions. These options
control parameters of critic network learning, when you train an agent that incorporates the critic
representation. For this example, set the learning rate to 0.05 and the gradient threshold to 1.
repOpts = rlRepresentationOptions('LearnRate',5e-2,'GradientThreshold',1);
Create the critic representation using the specified neural network and options. Also, specify the
action and observation information for the critic. Set the observation name to 'observation',
which is the name you used when you created the network input layer for criticNetwork.
2-167
2 Functions
critic = rlRepresentation(criticNetwork,obsInfo,'Observation',{'observation'},repOpts)
critic =
rlValueRepresentation with properties:
Options: [1x1 rl.option.rlRepresentationOptions]

ActionInfo: {1x0 cell}
Similarly, create a network for the actor. An AC agent decides which action to take given observations
using an actor representation. For an actor, the inputs are the observations, and the output depends
on whether the action space is discrete or continuous. For the actor of this example, there are two
possible discrete actions, –10 or 10. Thus, to create the actor, use a deep neural network with the
same observation input as the critic, that can output these two values. You can obtain the number of
actions from the actInfo specification. Name the output 'action'.
numAction = numel(actInfo.Elements);
actorNetwork = [
imageInputLayer([4 1 1], 'Normalization','none','Name','observation')
fullyConnectedLayer(numAction,'Name','action')];
Create the actor representation using the observation name and specification and the action name
and specification. Use the same representation options.
actor = rlRepresentation(actorNetwork,obsInfo,actInfo,...
'Observation',{'observation'},'Action',{'action'},repOpts)
actor =
rlStochasticActorRepresentation with properties:

You can now use the actor and critic representations to create an AC agent.
agentOpts = rlACAgentOptions(...
'NumStepsToLookAhead',32,...
'DiscountFactor',0.99);
agent = rlACAgent(actor,critic,agentOpts)
agent =
rlACAgent with properties:
AgentOptions: [1x1 rl.option.rlACAgentOptions]
Create Q Table Representation
This example shows how to create a Q Table representation:
Create an environment interface.

env = rlPredefinedEnv("BasicGridWorld");
Create a Q table using the action and observation specifications from the environment.
2-168
rlRepresentation
qTable = rlTable(getObservationInfo(env),getActionInfo(env));
Create a representation for the Q table.
tableRep = rlRepresentation(qTable);
Create Quadratic Basis Function Critic Representation
This example shows how to create a linear basis function critic representation.
Assume that you have an environment, env. For this example, load the environment used in the
“Train Custom LQR Agent” example.
load myLQREnv.mat
Obtain the observation and action specifications from the environment.
Create a custom basis function. In this case, use the quadratic basis function from “Train Custom
LQR Agent”.
Set the dimensions and parameters required for your basis function.
n = 6;
Set an initial weight vector.
w0 = 0.1*ones(0.5*(n+1)*n,1);
Create a representation using a handle to the custom basis function.

critic = rlRepresentation(@(x,u) computeQuadraticBasis(x,u,n),w0,{obsInfo,actInfo});
Function to compute the quadratic basis from “Train Custom LQR Agent”.
function B = computeQuadraticBasis(x,u,n)
z = cat(1,x,u);
idx = 1;
for r = 1:n
for c = r:n
if idx == 1
B = z(r)*z(c);
else
B = cat(1,B,z(r)*z(c));
end
idx = idx + 1;
end
end
end
Input Arguments
net — Deep neural network for actor or critic
array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object
2-169
2 Functions
Deep neural network for actor or critic, specified as one of the following:
• Array of Layer objects

• layerGraph object
• DAGNetwork object
• SeriesNetwork object
• dlnetwork object
For a list of deep neural network layers, see “List of Deep Learning Layers”. For more information on
creating deep neural networks for reinforcement learning, see “Create Policies and Value Functions”.
obsNames — Observation names

cell array of character vectors
Observation names, specified as a cell array of character vectors. The observation names are the
network input layer names you specify when you create net. The names in obsNames must be in the
same order as the observation specifications in obsInfo.
Example: {'observation'}
obsInfo — Observation specification

spec object | array of spec objects
Observation specification, specified as a reinforcement learning spec object or an array of spec

objects. You can extract obsInfo from an existing environment using getObservationInfo. Or,
you can construct the specs manually using a spec command such as rlFiniteSetSpec or
rlNumericSpec. This specification defines such information about the observations as the
dimensions and names of the observation signals.
actNames — Action name

single-element cell array that contains a character vector
Action name, specified as a single-element cell array that contains a character vector. The action
name is the network layer name you specify when you create net. For critic networks, this layer is
the first layer of the action input path. For actors, this layer is the last layer of the action output path.
Example: {'action'}
actInfo — Action specification

spec object
Action specification, specified as a reinforcement learning spec object. You can extract actInfo from
an existing environment using getActionInfo. Or, you can construct the spec manually using a
spec command such as rlFiniteSetSpec or rlNumericSpec. This specification defines such
information about the action as the dimensions and name of the action signal.
For linear basis function representations, the action signal must be a scalar, a column vector, or a
discrete action.
tab — Value table or Q table for critic

rlTable object
Value table or Q table for critic, specified as an rlTable object. The learnable parameters of a table
representation are the elements of tab.
2-170
rlRepresentation
basisFcn — Custom basis function

function handle
Custom basis function, specified as a function handle to a user-defined function. For a linear basis
function representation, the output of the representation is f = W'B, where W is a weight array and B
is the column vector returned by the custom basis function. The learnable parameters of a linear
basis function representation are the elements of W.
When creating:
• A critic representation with observation inputs only, your basis function must have the following
signature.
B = myBasisFunction(obs1,obs2,...,obsN)
Here obs1 to obsN are observations in the same order and with the same data type and
dimensions as the observation specifications in obsInfo.
• A critic representation with observation and action inputs, your basis function must have the
following signature.
B = myBasisFunction(obs1,obs2,...,obsN,act)
Here obs1 to obsN are observations in the same order and with the same data type and
dimensions as the observation specifications in the first element of oaInfo, and act has the same
data type and dimensions as the action specification in the second element of oaInfo.
• An actor representation, your basis function must have the following signature.
Here, obs1 to obsN are observations in the same order and with the same data type and
dimensions as the observation specifications in obsInfo. The data types and dimensions of the
action specification in actInfo affect the data type and dimensions of f.
Example: @(x,u) myBasisFunction(x,u)
W0 — Initial value for linear basis function weight vector

column vector | array
Initial value for linear basis function weight array, W, specified as one of the following:
• Column vector — When creating a critic representation or an actor representation with a

continuous scalar action signal
• Array — When creating an actor representation with a column vector continuous action signal or a
discrete action space.
oaInfo — Observation and action specifications

cell array
Observation and action specifications for creating linear basis function critic representations,
specified as the cell array {obsInfo,actInfo}.
repOpts — Representation options

rlRepresentationOptions object
2-171
2 Functions
Representation options, specified as an option set that you create with rlRepresentationOptions.
Available options include the optimizer used for training and the learning rate. See
rlRepresentationOptions for details.
Output Arguments
rep — Deep neural network representation
rlLayerRepresentation object
Deep neural network representation, returned as an rlLayerRepresentation object. Use this

representation to create an agent for reinforcement learning. For more information, see
“Reinforcement Learning Agents”.
tableCritic — Value or Q table critic representation

rlTableRepresentation object
Value or Q table critic representation, returned as an rlTableRepresentation object. Use this

representation to create an agent for reinforcement learning. For more information, see
critic — Linear basis function critic representation

rlLinearBasisRepresentation object
Linear basis function critic representation, returned as and rlLinearBasisRepresentation

object. Use this representation to create an agent for reinforcement learning. For more information,
see “Reinforcement Learning Agents”.
actor — Linear basis function actor representation

rlLinearBasisRepresentation object
Linear basis function actor representation, returned as and rlLinearBasisRepresentation

object. Use this representation to create an agent for reinforcement learning. For more information,
see “Reinforcement Learning Agents”.
Version History
rlRepresentation is not recommended

Not recommended starting in R2020a
rlRepresentation is not recommended. Depending on the type of representation being created,

use one of the following objects instead:
• rlValueRepresentation — State value critic, computed based on observations from the

environment.
• rlQValueRepresentation — State-action value critic, computed based on both actions and
observations from the environment.
• rlDeterministicActorRepresentation — Actor with deterministic actions, based on
observations from the environment.
• rlStochasticActorRepresentation — Actor with stochastic actions, based on observations
from the environment.
2-172
rlRepresentation
The following table shows some typical uses of the rlRepresentation function to create neural
network-based critics and actors, and how to update your code with one of the new objects instead.
Network-Based Representations: Not Network-Based Representations:

Recommended Recommended
rep = rep =
rlRepresentation(net,obsInfo,'Observat rlValueRepresentation(net,obsInfo,'Obs
ion',obsName), with net having only ervation',obsName). Use this syntax to create
observations as inputs, and a single scalar a representation for a critic that does not require
output. action inputs, such as a critic for an rlACAgent
or rlPGAgent agent.
rep = rep =
rlRepresentation(net,obsInfo,actInfo,' rlQValueRepresentation(net,obsInfo,act
Observation',obsName,'Action',actName) Info,'Observation',obsName,'Action',ac
, with net having both observations and action as tName). Use this syntax to create a single-output
inputs, and a single scalar output. state-action value representation for a critic that
takes both observation and action as input, such
as a critic for an rlDQNAgent or rlDDPGAgent
agent.
rep = rep =
rlRepresentation(net,obsInfo,actInfo,' rlDeterministicActorRepresentation(net
Observation',obsName,'Action',actName) ,obsInfo,actInfo,'Observation',obsName
, with net having observations as inputs and ,'Action',actName). Use this syntax to create
actions as outputs, and actInfo defining a a deterministic actor representation for a
continuous action space. continuous action space.
rep = rep =
rlRepresentation(net,obsInfo,actInfo,' rlStochasticActorRepresentation(net,ob
Observation',obsName,'Action',actName) sInfo,actInfo,'Observation',obsName).
, with net having observations as inputs and Use this syntax to create a stochastic actor
actions as outputs, and actInfo defining a representation for a discrete action space.
discrete action space.
The following table shows some typical uses of the rlRepresentation objects to express table-
based critics with discrete observation and action spaces, and how to update your code with one of
the new objects instead.
Table-Based Representations: Not Table-Based Representations: Recommended

Recommended
rep = rlRepresentation(tab), with tab rep =
containing a value table consisting in a column rlValueRepresentation(tab,obsInfo). Use
vector as long as the number of possible this syntax to create a representation for a critic
observations. that does not require action inputs, such as a
critic for an rlACAgent or rlPGAgent agent.
2-173
2 Functions
Table-Based Representations: Not Table-Based Representations: Recommended

Recommended
rep = rlRepresentation(tab), with tab rep =
containing a Q-value table with as many rows as rlQValueRepresentation(tab,obsInfo,act
the possible observations and as many columns Info). Use this syntax to create a single-output
as the possible actions. state-action value representation for a critic that
takes both observation and action as input, such
as a critic for an rlDQNAgent or rlDDPGAgent
agent.
The following table shows some typical uses of the rlRepresentation function to create critics and
actors which use a custom basis function, and how to update your code with one of the new objects
instead. In the recommended function calls, the first input argument is a two-elements cell containing
both the handle to the custom basis function and the initial weight vector or matrix.
Custom Basis Function-Based Custom Basis Function-Based

Representations: Not Recommended Representations: Recommended
rep = rep =
rlRepresentation(basisFcn,W0,obsInfo), rlValueRepresentation({basisFcn,W0},ob
where the basis function has only observations as sInfo). Use this syntax to create a
inputs and W0 is a column vector. representation for a critic that does not require
action inputs, such as a critic for an rlACAgent
or rlPGAgent agent.
rep = rlRepresentation(basisFcn,W0, rep =
{obsInfo,actInfo}), where the basis function rlQValueRepresentation({basisFcn,W0},o
has both observations and action as inputs and bsInfo,actInfo). Use this syntax to create a
W0 is a column vector. single-output state-action value representation
for a critic that takes both observation and action
as input, such as a critic for an rlDQNAgent or
rlDDPGAgent agent.
rep = rep =
rlRepresentation(basisFcn,W0,obsInfo,a rlDeterministicActorRepresentation({ba
ctInfo), where the basis function has sisFcn,W0},obsInfo,actInfo). Use this
observations as inputs and actions as outputs, W0 syntax to create a deterministic actor
is a matrix, and actInfo defines a continuous representation for a continuous action space.
action space.
rep = rep =
rlRepresentation(basisFcn,W0,obsInfo,a rlStochasticActorRepresentation({basis
ctInfo), where the basis function has Fcn,W0},obsInfo,actInfo). Use this syntax
observations as inputs and actions as outputs, W0 to create a deterministic actor representation for
is a matrix, and actInfo defines a discrete a discrete action space.
action space.
See Also
Functions
rlValueRepresentation | rlQValueRepresentation |
rlDeterministicActorRepresentation | rlStochasticActorRepresentation |
rlRepresentationOptions | getActionInfo | getObservationInfo
2-174
rlRepresentation
Topics
2-175
2 Functions
rlSimulinkEnv
Create reinforcement learning environment using dynamic model implemented in Simulink
Syntax
env = rlSimulinkEnv(mdl,agentBlocks)
env = rlSimulinkEnv(mdl,agentBlocks,obsInfo,actInfo)
env = rlSimulinkEnv( ___ ,'UseFastRestart',fastRestartToggle)
Description
The rlSimulinkEnv function creates a reinforcement learning environment object from a Simulink
model. The environment object acts an interface so that when you call sim or train, these functions
in turn call the Simulink model to generate experiences for the agents.
env = rlSimulinkEnv(mdl,agentBlocks) creates the reinforcement learning environment

object env for the Simulink model mdl. agentBlocks contains the paths to one or more
reinforcement learning agent blocks in mdl. If you use this syntax, each agent block must reference
an agent object already in the MATLAB workspace.
env = rlSimulinkEnv(mdl,agentBlocks,obsInfo,actInfo) creates the reinforcement

learning environment object env for the model mdl. The two cell arrays obsInfo and actInfo must
contain the observation and action specifications for each agent block in mdl, in the same order as
they appear in agentBlocks.
env = rlSimulinkEnv( ___ ,'UseFastRestart',fastRestartToggle) creates a

reinforcement learning environment object env and additionally enables fast restart. Use this syntax
after any of the input arguments in the previous syntaxes.
Examples
Create Simulink Environment Using Agent in Workspace
Create a Simulink environment using the trained agent and corresponding Simulink model from the
“Create Simulink Environment and Train Agent” example.
Load the agent in the MATLAB® workspace.

load rlWaterTankDDPGAgent
Create an environment for the rlwatertank model, which contains an RL Agent block. Since the
agent used by the block is already in the workspace, you do not need to pass the observation and
action specifications to create the environment.
env = rlSimulinkEnv('rlwatertank','rlwatertank/RL Agent')
env =
Model : rlwatertank
2-176
rlSimulinkEnv
AgentBlock : rlwatertank/RL Agent

ResetFcn : []
UseFastRestart : on
Validate the environment by performing a short simulation for two sample times.
validateEnvironment(env)
You can now train and simulate the agent within the environment by using train and sim,
respectively.
Create Reinforcement Learning Environment for Simulink Model
For this example, consider the rlSimplePendulumModel Simulink model. The model is a simple
frictionless pendulum that initially hangs in a downward position.
Open the model.
mdl = 'rlSimplePendulumModel';
open_system(mdl)
Create rlNumericSpec and rlFiniteSetSpec objects for the observation and action information,
respectively.
The observation is a vector containing three signals: the sine, cosine, and time derivative of the
angle.
obsInfo = rlNumericSpec([3 1])
obsInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: [0×0 string]
Description: [0×0 string]
Dimension: [3 1]
DataType: "double"
The action is a scalar expressing the torque and can be one of three possible values, -2 Nm, 0 Nm and
2 Nm.
actInfo = rlFiniteSetSpec([-2 0 2])
actInfo =
Elements: [3×1 double]

Name: [0×0 string]
Dimension: [1 1]
DataType: "double"
2-177
2 Functions
You can use dot notation to assign property values for the rlNumericSpec and rlFiniteSetSpec
objects.
actInfo.Name = 'torque';
Assign the agent block path information, and create the reinforcement learning environment for the
Simulink model using the information extracted in the previous steps.
agentBlk = [mdl '/RL Agent'];
env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo)
env =
ResetFcn : []
UseFastRestart : on
You can also include a reset function using dot notation. For this example, randomly initialize theta0
in the model workspace.
env.ResetFcn = @(in) setVariable(in,'theta0',randn,'Workspace',mdl)
env =
ResetFcn : @(in)setVariable(in,'theta0',randn,'Workspace',mdl)
UseFastRestart : on
Create Simulink Environment for Multiple Agents
Create an environment for the Simulink model from the example “Train Multiple Agents to Perform
Collaborative Task”.
Load the agents in the MATLAB workspace.

load rlCollaborativeTaskAgents
Create an environment for the rlCollaborativeTask model, which has two agent blocks. Since the
agents used by the two blocks (agentA and agentB) are already in the workspace, you do not need
to pass their observation and action specifications to create the environment.
env = rlSimulinkEnv( ...
'rlCollaborativeTask', ...
["rlCollaborativeTask/Agent A","rlCollaborativeTask/Agent B"])
env =
Model : rlCollaborativeTask
2-178
rlSimulinkEnv
AgentBlock : [
rlCollaborativeTask/Agent A
rlCollaborativeTask/Agent B
]
ResetFcn : []
UseFastRestart : on
You can now simulate or train the agents within the environment using sim or train, respectively.
Input Arguments
mdl — Simulink model name
Simulink model name, specified as a string or character vector. The model must contain at least one
RL Agent block.
agentBlocks — Agent block paths

string | character vector | string array
Agent block paths, specified as a string, character vector, or string array.
If mdl contains a single RL Agent block, specify agentBlocks as a string or character vector
containing the block path.
If mdl contains multiple RL Agent blocks, specify agentBlocks as a string array, where each
element contains the path of one agent block.
mdl can contain RL Agent blocks whose path is not included in agentBlocks. Such agent blocks
behave as part of the environment, selecting actions based on their current policies. When you call
sim or train, the experiences of these agents are not returned and their policies are not updated.
Multi-agent simulation is not supported for MATLAB environments.
The agent blocks can be inside of a model reference. For more information on configuring an agent
block for reinforcement learning, see RL Agent.
obsInfo — Observation information

rlNumericSpec object | rlFiniteSetSpec object | array of rlNumericSpec objects | array of
rlFiniteSetSpec objects | cell array
Observation information, specified as a specification object, an array of specification objects, or a cell

array.
If mdl contains a single agent block, specify obsInfo as an rlNumericSpec object, an

rlFiniteSetSpec object, or an array containing a mix of such objects.
If mdl contains multiple agent blocks, specify obsInfo as a cell array, where each cell contains a
specification object or array of specification objects for the corresponding block in agentBlocks.
For more information, see getObservationInfo.
actInfo — Action information

rlNumericSpec object | rlFiniteSetSpec object | cell array
2-179
2 Functions
Action information, specified as a specification object or a cell array.
If mdl contains a single agent block, specify actInfo as an rlNumericSpec or rlFiniteSetSpec

object.
If mdl contains multiple agent blocks, specify actInfo as a cell array, where each cell contains a
specification object for the corresponding block in agentBlocks.
For more information, see getActionInfo.
fastRestartToggle — Option to toggle fast restart

'on' (default) | 'off'
Option to toggle fast restart, specified as either 'on' or 'off'. Fast restart allows you to perform
iterative simulations without compiling a model or terminating the simulation each time.
For more information on fast restart, see “How Fast Restart Improves Iterative Simulations”
(Simulink).
Output Arguments
SimulinkEnvWithAgent object
Reinforcement learning environment, returned as a SimulinkEnvWithAgent object.
For more information on reinforcement learning environments, see “Create Simulink Reinforcement
Learning Environments”.
Version History
See Also
Functions
train | sim | getObservationInfo | getActionInfo | rlNumericSpec | rlFiniteSetSpec
Blocks
RL Agent
Topics
“Train DDPG Agent to Control Double Integrator System”
“Train DDPG Agent to Swing Up and Balance Pendulum”
“Train DDPG Agent to Swing Up and Balance Cart-Pole System”
“Train DDPG Agent to Swing Up and Balance Pendulum with Bus Signal”
“Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation”
“Train DDPG Agent for Adaptive Cruise Control”
“How Fast Restart Improves Iterative Simulations” (Simulink)
2-180
runEpisode
runEpisode
Package: rl.env
Simulate reinforcement learning environment against policy or agent
Syntax
output = runEpisode(env,policy)
output = runEpisode(env,agent)
output = runEpisode( ___ ,Name=Value)
Description
output = runEpisode(env,policy) runs a single simulation of the environment env against the
policy policy.
output = runEpisode(env,agent) runs a single simulation of the environment env against the
agent agent.
output = runEpisode( ___ ,Name=Value) specifies nondefault simulation options using one or
more name-value arguments.
Examples
net = [...
reluLayer
reluLayer
softmaxLayer];
(weights).
summary(net)
2-181
2 Functions
Initialized: true
Inputs:

{[-10]}
for i = 1:100
end
cleanup(env)
Input Arguments
2-182
runEpisode
• rlFunctionEnv — Environment defined using custom functions

createIntegratedEnv
policy — Policy
policy object | array of policy objects
Policy object, specified as one of the following objects.
• rlMaxQPolicy
If env is a Simulink environment configured for multi-agent training, specify policy as an array of
policy objects. The order of the policies in the array must match the agent order used to create env.
For more information on a policy object, at the MATLAB command line, type help followed by the
policy object name.

agent object | array of agent objects
Reinforcement learning agent, specified as one of the following objects.
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• rlMBPOAgent
If env is a Simulink environment configured for multi-agent training, specify agent as an array of
agent objects. The order of the agents in the array must match the agent order used to create env.
2-183
2 Functions

Example: MaxSteps=1000
MaxSteps — Maximum simulation steps

500 (default) | positive integer
Maximum simulation steps, specified as a positive integer.
ProcessExperienceFcn — Function for processing experiences

function handle | cell array of function handles
Function for processing experiences and updating the policy or agent based on each experience as it
occurs during the simulation, specified as a function handle with the following signature.
[updatedPolicy,updatedData] = myFcn(experience,episodeInfo,policy,data)
Here:
• experience is a structure that contains a single experience. For more information on the
structure fields, see output.Experiences.
• episodeInfo contains data about the current episode and corresponds to
output.EpisodeInfo.
• policy is the policy or agent object being simulated.
• data contains experience processing data. For more information, see ProcessExperienceData.
• updatedPolicy is the updated policy or agent.
• updatedData is the updated experience processing data, which is used as the data input when
processing the next experience.
If env is a Simulink environment configured for multi-agent training, specify

ProcessExperienceFcn as a cell array of function handles. The order of the function handles in the
array must match the agent order used to create env.
ProcessExperienceData — Experience processing data

any MATLAB data type | cell array
Experience processing data, specified as any MATLAB data, such as an array or structure. Use this
data to pass additional parameters or information to the experience processing function.
You can also update this data within the experience processing function to use different parameters
when processing the next experience. The data values that you specify when you call runEpisode
are used to process the first experience in the simulation.
If env is a Simulink environment configured for multi-agent training, specify

ProcessExperienceData as a cell array. The order of the array elements must match the agent
order used to create env.
CleanupPostSim — Option to clean up environment

true (default) | false
2-184
runEpisode
Option to clean up the environment after the simulation, specified as true or false. When
CleanupPostSim is true, runEpisode calls cleanup(env) when the simulation ends.
To run multiple episodes without cleaning up the environment, set CleanupPostSim to false. You
can then call cleanup(env) after running your simulations.
If env is a SimulinkEnvWithAgent object and the associated Simulink model is configured to use
fast restart, then the model remains in a compiled state between simulations when CleanUpPostSim
is false.
LogExperiences — Option to log experiences

Option to log experiences for each policy or agent, specified as true or false. When
LogExperiences is true, the experiences of the policy or agent are logged in
output.Experiences.
Output Arguments
output — Simulation output
structure | structure array | rl.env.Future object
Simulation output, returned as a structure with the fields AgentData and SimulationInfo. When
you simulate multiple policies or agents, output is returned as a structure array.
The AgentData field is a structure with the following fields.
Field Description
Experiences Logged experience of the policy or agent, returned as a
structure array. Each experience contains the following fields.
• Observation — Observation
• Action — Action taken
• NextObservation — Resulting next observation
• Reward — Corresponding reward
• IsDone — Termination signal
Time Simulation times of experiences, returned as a vector.
EpisodeInfo Episode information, returned as a structure with the following
fields.
• CumulativeReward — Total reward for all experiences

• StepsTaken — Number of simulation steps taken
• InitialObservation — Initial observation at the start of
the simulation
ProcessExperienceData Experience processing data
Agent Policy or agent used in the simulation
The SimulationInfo field is one of the following:
2-185
2 Functions
• For MATLAB environments — Structure containing the field SimulationError. This structure
contains any errors that occurred during simulation.
• For Simulink environments — Simulink.SimulationOutput object containing simulation data.
Recorded data includes any signals and states that the model is configured to log, simulation
metadata, and any errors that occurred.
If env is configured to run simulations on parallel workers, then output is an rl.env.Future

object to support deferred evaluation of the simulation.
Tips
• You can speed up episode simulation by using parallel computing. To do so, use the setup
function and set the UseParallel argument to true.
setup(env,UseParallel=true)
Version History
See Also
setup | cleanup | reset
Topics
2-186
sample
sample
Package: rl.replay
Sample experiences from replay memory buffer
Syntax
experience = sample(buffer,batchSize)
experience = sample(buffer,batchSize,Name=Value)
[experience,Mask] = sample(buffer,batchSize,Name=Value)
Description
experience = sample(buffer,batchSize) returns a mini-batch of N experiences from the
replay memory buffer, where N is specified using batchSize.
experience = sample(buffer,batchSize,Name=Value) specifies additional sampling options

using one or more name-value pair arguments.
[experience,Mask] = sample(buffer,batchSize,Name=Value) returns a sequence padding

mask indicating which the padded experiences at the end of a sampled sequence.
Examples

LowerLimit=0,...

LowerLimit=0,...
UpperLimit=[5;10]);
2-187
2 Functions
exp.IsDone = 0;
append(buffer,exp);
terminal condition.
for i = 1:100
end
NStepHorizon=10,...
horizon.
SequenceLength=20);
2-188
sample

LowerLimit=0,...
UpperLimit=[1;5]);

LowerLimit=0,...
UpperLimit=[5;10]);
for i = 1:50
exp(i).IsDone = 0;
end
append(buffer,exp);
Input Arguments
rlReplayMemory object | rlPrioritizedReplayMemeory
batchSize — Batch size

positive integer
Batch size of experiences to sample, specified as a positive integer.
If batchSize is greater than the current length of the buffer, then sample returns no experiences.
2-189
2 Functions

Example: DiscountFactor=0.95
SequenceLength — Sequence length

Sequence length, specified as a positive integer. For each batch element, sample up to
SequenceLength consecutive experiences. If a sampled experience has a nonzero IsDone value,
stop the sequence at that experience.
NStepHorizon — N-step horizon length

N-step horizon length, specified as a positive integer. For each batch element, sample up to
NStepHorizon consecutive experiences. If a sampled experience has a nonzero IsDone value, stop
the horizon at that experience. Return the following experience information based on the sampled
horizon.
• Observation and Action values from the first experience in the horizon
• NextObservation and IsDone values from the final experience in the horizon.
• Cumulative reward across the horizon using the specified discount factor, DiscountFactor.
Sampling an n-step horizon is not supported when sampling sequences. Therefore, if

SequenceLength > 1, then NStepHorizon must be 1.
DiscountFactor — Discount factor

0.99 (default) | nonnegative scalar less than or equal to one
Discount factor, specified as a nonnegative scalar less than or equal to one. When you sample a
horizon of experiences (NStepHorizon > 1), sample returns the cumulative reward R computed as
follows.
N
R= ∑ γiRi
i=1
Here:
• γ is the discount factor.

• N is the sampled horizon length, which can be less than NStepHorizon.
• Ri is the reward for the ith horizon step.
DiscountFactor applies only when NStepHorizon is greater than one.
DataSourceID — Data source index

-1 (default) | nonnegative integer
Data source index, specified as one of the following:
2-190
sample
• -1 — Sample from the experiences of all data sources.

• Nonnegative integer — Sample from the experiences of only the data source specified by
DataSourceID.
Output Arguments
experience — Experiences sampled from the buffer
structure
Experiences sampled from the buffer, returned as a structure with the following fields.
cell array
Observation, returned as a cell array with length equal to the number of observation specifications
specified when creating the buffer. Each element of Observation contains a DO-by-batchSize-by-
SequenceLength array, where DO is the dimension of the corresponding observation specification.

cell array
Agent action, returned as a cell array with length equal to the number of action specifications
specified when creating the buffer. Each element of Action contains a DA-by-batchSize-by-
SequenceLength array, where DA is the dimension of the corresponding action specification.

scalar | array
Reward value obtained by taking the specified action from the observation, returned as a 1-by-1-by-
SequenceLength array.

cell array
Next observation reached by taking the specified action from the observation, returned as a cell array
with the same format as Observation.

integer | array
Termination signal, returned as a 1-by-1-by-SequenceLength array of integers. Each element of

IsDone has one of the following values.

Mask — Sequence padding mask

logical array
Sequence padding mask, returned as a logical array with length equal to SequenceLength. When
the sampled sequence length is less than SequenceLength, the data returned in experience is
padded. Each element of Mask is true for a real experience and false for a padded experience.
2-191
2 Functions
You can ignore Mask when SequenceLength is 1.
Version History
See Also
rlReplayMemory | append
2-192
setActor
setActor
Package: rl.agent
Set actor of reinforcement learning agent
Syntax
agent = setActor(agent,actor)
Description
agent = setActor(agent,actor) updates the reinforcement learning agent, agent, to use the
specified actor object, actor.
Examples

{[-15.4622 -7.2252]}
{[ 0]}
2-193
2 Functions
ans=2×1 cell array

{[-30.9244 -14.4504]}
{[ 0]}
2-194
setActor
following commands.
Observations”.
2-195
2 Functions

Input Arguments
rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent | rlSACAgent | rlPPOAgent |
rlTRPOAgent
Reinforcement learning agent that contains an actor, specified as one of the following:
• rlPGAgent object
• rlDDPGAgent object
2-196
setActor
• rlTD3Agent object
• rlACAgent object
• rlSACAgent object
• rlPPOAgent object
• rlTRPOAgent object
Note agent is an handle object. Therefore is updated by setActor whether agent is returned as
an output argument or not. For more information about handle objects, see “Handle Object
Behavior”.
actor — Actor
Actor object, specified as one of the following:
• rlContinuousDeterministicActor object — Specify when agent is an rlDDPGAgent or

rlTD3Agent object
• rlDiscreteCategoricalActor object — Specify when agent is an rlACAgent, rlPGAgent,
rlPPOAgent, rlTRPOAgent or rlSACAgent object for an environment with a discrete action
space.
• rlContinuousGaussianActor object — Specify when agent is an rlACAgent, rlPGAgent,
rlPPOAgent, rlTRPOAgent or rlSACAgent object for an environment with a continuous action
space.
The input and outputs of the approximation model in the actor (typically, a neural network) must
match the observation and action specifications of the original agent.
To create an actor, use one of the following methods:
• Create the actor using the corresponding function approximator object.

Output Arguments
agent — Updated reinforcement learning agent
rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent | rlSACAgent | rlPPOAgent |
rlTRPOAgent
Updated agent, returned as an agent object. Note that agent is an handle object. Therefore its actor
is updated by setActor whether agent is returned as an output argument or not. For more
information about handle objects, see “Handle Object Behavior”.
Version History
2-197
2 Functions
See Also
getActor | getCritic | setCritic | getModel | setModel | getLearnableParameters |
Topics
2-198
setCritic
setCritic
Package: rl.agent
Set critic of reinforcement learning agent
Syntax
agent = setCritic(agent,critic)
Description
agent = setCritic(agent,critic) updates the reinforcement learning agent, agent, to use the
specified critic object, critic.
Examples

{[-4.9889 -1.5548 -0.3434 -0.1111 -0.0500 -0.0035]}
{[ 0]}
2-199
2 Functions
ans=2×1 cell array

{[-9.9778 -3.1095 -0.6867 -0.2223 -0.1000 -0.0069]}
{[ 0]}
2-200
setCritic
following commands.
Observations”.
2-201
2 Functions

Input Arguments
Reinforcement learning agent that contains a critic, specified as one of the following:
• rlQAgent
• rlSARSAAgent
2-202
setCritic
• rlDQNAgent
• rlPGAgent (when using a critic to estimate a baseline value function)
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
Note agent is an handle object. Therefore is updated by setCritic whether agent is returned as
an output argument or not. For more information about handle objects, see “Handle Object
Behavior”.
critic — Critic
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object | two-
element row vector of rlQValueFunction objects
Critic object, specified as one of the following:
• rlValueFunction object — Returned when agent is an rlACAgent, rlPGAgent, or

rlPPOAgent object.
• rlQValueFunction object — Returned when agent is an rlQAgent, rlSARSAAgent,
rlDQNAgent, rlDDPGAgent, or rlTD3Agent object with a single critic.
• rlVectorQValueFunction object — Returned when agent is an rlQAgent, rlSARSAAgent,
rlDQNAgent, object with a discrete action space, vector Q-value function critic.
• Two-element row vector of rlQValueFunction objects — Returned when agent is an
rlTD3Agent or rlSACAgent object with two critics.
Output Arguments
agent — Updated reinforcement learning agent
Updated agent, returned as an agent object. Note that agent is an handle object. Therefore its actor
is updated by setCritic whether agent is returned as an output argument or not. For more
Version History
See Also
getActor | getCritic | setActor | getModel | setModel | getLearnableParameters |
2-203
2 Functions
Topics
2-204
Package: rl.policy
Set learnable parameter values of agent, function approximator, or policy object
Syntax
setLearnableParameters(agent)
agent = setLearnableParameters(agent)
newFcn = setLearnableParameters(oldFcn,pars)
newPol = setLearnableParameters(oldPol,pars)
Description
Agent
setLearnableParameters(agent) sets the learnable parameter values specified in pars in the

specified agent.
agent = setLearnableParameters(agent) also returns the new agent as an output argument.

Actor or Critic
newFcn = setLearnableParameters(oldFcn,pars) returns a new actor or critic function

approximator object, newFcn, with the same structure as the original function object, oldFcn, and
the learnable parameter values specified in pars.
Policy
newPol = setLearnableParameters(oldPol,pars) returns a new policy object, newPol, with

the same structure as the original function object, oldFcn, and the learnable parameter values
specified in pars.
Examples
2-205
2 Functions

{[-4.9889 -1.5548 -0.3434 -0.1111 -0.0500 -0.0035]}
{[ 0]}
ans=2×1 cell array

{[-9.9778 -3.1095 -0.6867 -0.2223 -0.1000 -0.0069]}
{[ 0]}

{[-15.4622 -7.2252]}
{[ 0]}
2-206
ans=2×1 cell array

{[-30.9244 -14.4504]}
{[ 0]}
Input Arguments
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
Note agent is an handle object. Therefore its parameters are updated by

setLearnableParameters whether agent is returned as an output argument or not. For more
oldFcn — Original actor or critic function object

Original actor or critic function object, specified as one of the following:

space
action space
2-207
2 Functions

oldPol — Reinforcement learning policy

• rlMaxQPolicy
pars — Learnable parameter values

cell array
Learnable parameter values for the representation object, specified as a cell array. The parameters in
pars must be compatible with the structure and parameterization of the agent, function
approximator, or policy object passed as a first argument.
To obtain a cell array of learnable parameter values from an existing agent, function approximator, or
policy object , which you can then modify, use the getLearnableParameters function.
Output Arguments
newFcn — New actor or critic object
New actor or critic object, returned as a function object of the same type as oldFcn. Apart from the
learnable parameter values, newFcn is the same as oldFcn.
newPol — New reinforcement learning policy

New reinforcement learning policy, returned as a policy object of the same type as oldPol. Apart
from the learnable parameter values, newPol is the same as oldPol.
agent — Updated agent

Updated agent, returned as an agent object. Note that agent is an handle object. Therefore its
parameters are updated by setLearnableParameters whether agent is returned as an output
argument or not. For more information about handle objects, see “Handle Object Behavior”.
2-208
Version History
setLearnableParameters now uses approximator objects instead of representation objects

longer recommended. Therefore, setLearnableParameters now uses function approximator
objects instead.
setLearnableParameterValues is now setLearnableParameters

setLearnableParameterValues is now setLearnableParameters. To update your code, change

the function name from setLearnableParameterValues to setLearnableParameters. The
syntaxes are equivalent.
See Also
getLearnableParameters | getActor | getCritic | setActor | setCritic
Topics
2-209
2 Functions
setModel
Set function approximation model for actor or critic
Syntax
newFcnAppx = setModel(oldFcnAppx,model)
Description
newFcnAppx = setModel(oldFcnAppx,model) returns a new actor or critic function object,
newFcnAppx, with the same configuration as the original function object, oldFcnAppx, and the
computational model specified in model.
Examples
2-210
setModel
following commands.
Observations”.
2-211
2 Functions

Input Arguments
oldFcnAppx — Original actor or critic function object
2-212
setModel
Original actor or critic function object, specified as one of the following:

space
action space

model — Function approximation model

array of Layer objects | layerGraph object | DAGNetwork object | dlnetwork object | rlTable
object | 1-by-2 cell array
Function approximation model, specified as one of the following:
• Deep neural network defined as an array of Layer objects, a layerGraph object, a DAGNetwork
object, or a dlnetwork object. The input and output layers of model must have the same names
and dimensions as the network returned by getModel for the same function object. Here, the
output layer is the layer immediately before the output loss layer.
• rlTable object with the same dimensions as the table model defined in newRep.
• 1-by-2 cell array that contains the function handle for a custom basis function and the basis
function parameters.
When specifying a new model, you must use the same type of model as the one already defined in
newRep.
Note For agents with more than one critic, such as TD3 and SAC agents, you must call setModel for
each critic representation individually, rather than calling setModel for the array of returned by
getCritic.
critics = getCritic(myTD3Agent);
% Modify critic networks.
critics(1) = setModel(critics(1),criticNet1);
critics(2) = setModel(critics(2),criticNet2);
myTD3Agent = setCritic(myTD3Agent,critics);
Output Arguments
newFcnAppx — New actor or critic function object
2-213
2 Functions
New actor or critic function object, returned as a function object of the same type as oldFcnAppx.
Apart from the new computational model, newFcnAppx is the same as oldFcnAppx.
Version History
setModel now uses approximator objects instead of representation objects

longer recommended. Therefore, setModel now uses function approximator objects instead.
See Also
getActor | setActor | getCritic | setCritic | getModel
Topics
2-214
setup
setup
Package: rl.env
Set up reinforcement learning environment or initialize data logger object
Syntax
setup(env)
setup(env,Name=Value)
setup(lgr)
Description
When you define a custom training loop for reinforcement learning, you can simulate an agent or
policy against an environment using the runEpisode function. Use the setup function to configure
the environment for running simulations using multiple calls to runEpisode.
Also use setup to initialize a FileLogger or MonitorLogger object before logging data within a
custom training loop.
Environment Objects
setup(env) sets up the specified reinforcement learning environment for running multiple
simulations using runEpisode.
setup(env,Name=Value) specifies nondefault configuration options using one or more name-value

pair arguments.
Data Logger Objects
setup(lgr) sets up the specified data logger object. Setup tasks may include setting up a
visualization, or creating directories for logging to file.
Examples
net = [...
reluLayer
2-215
2 Functions
reluLayer
softmaxLayer];
(weights).
summary(net)
Initialized: true
Inputs:



{[-10]}


for i = 1:100
end

cleanup(env)
2-216
setup

setup(flgr);
file.
for iter = 1:10

for ct = 1:3
end


if mod(iter,4)==0
write(flgr);
end
end
cleanup(flgr);
Input Arguments
| rlMDPEnv object | ...
2-217
2 Functions
• rlFunctionEnv — Environment defined using custom functions.

createIntegratedEnv


Example: StopOnError="on"
StopOnError — Option to stop episode when error occurs

"on" (default) | "off"
Option to stop an episode when an error occurs, specified as one of the following:
• "on" — Stop the episode when an error occurs and generate an error message in the MATLAB
command window.
• "off" — Log errors in the SimulationInfo output of runEpisode.
UseParallel — Option for using parallel simulations

Option for using parallel simulations, specified as a logical value. Using parallel computing allows
the usage of multiple cores, processors, computer clusters, or cloud resources to speed up simulation.
When you set UseParallel to true, the output of a subsequent call to runEpisode is an
rl.env.Future object, which supports deferred evaluation of the simulation.
SetupFcn — Function to run on each worker before running an episode

[] (default) | function handle
Function to run on the each worker before running an episode, specified as a handle to a function
with no input arguments. Use this function to perform any preprocessing required before running an
episode.
CleanupFcn — Function to run on each worker when cleaning up environment

Function to run on each worker when cleaning up the environment, specified as a handle to a
function with no input arguments. Use this function to clean up the workspace or perform other
processing after calling runEpisode.
2-218
setup
TransferBaseWorkspaceVariables — Option to send model and workspace variables to

parallel workers
Option to send model and workspace variables to parallel workers, specified as "on" or "off". When
the option is "on", the client sends variables used in models and defined in the base MATLAB
workspace to the workers.
AttachedFiles — Additional files to attach to parallel pool

string | string array
Additional files to attach to the parallel pool before running an episode, specified as a string or string
array.
WorkerRandomSeeds — Work random seeds

-1 (default) | vector
Worker random seeds, specified as one of the following:
• -1 — Set the random seed of each worker to the worker ID.

• Vector with length equal to the number of workers — Specify the random seed for each worker.
Version History
See Also
Functions
runEpisode | cleanup | reset | store | write
Objects
rlFunctionEnv | rlMDPEnv | SimulinkEnvWithAgent | rlNeuralNetworkEnvironment |
Topics
2-219
2 Functions
sim
Package: rl.env
Simulate trained reinforcement learning agents within specified environment
Syntax
experience = sim(env,agents)
experience = sim(agents,env)
env = sim( ___ ,simOpts)
Description
experience = sim(env,agents) simulates one or more reinforcement learning agents within an
environment, using default simulation options.
experience = sim(agents,env) performs the same simulation as the previous syntax.
env = sim( ___ ,simOpts) uses the simulation options object simOpts. Use simulation options to
specify parameters such as the number of steps per simulation or the number of simulations to run.
Use this syntax after any of the input arguments in the previous syntaxes.
Examples
Simulate Reinforcement Learning Environment
Simulate a reinforcement learning environment with an agent configured for that environment. For
this example, load an environment and agent that are already configured. The environment is a
discrete cart-pole environment created with rlPredefinedEnv. The agent is a policy gradient
(rlPGAgent) agent. For more information about the environment and agent used in this example, see
rng(0) % for reproducibility

load RLSimExample.mat
env
env =
CartPoleDiscreteAction with properties:
Gravity: 9.8000
MassCart: 1
MassPole: 0.1000
Length: 0.5000
MaxForce: 10
Ts: 0.0200
ThetaThresholdRadians: 0.2094
XThreshold: 2.4000
RewardForNotFalling: 1
PenaltyForFalling: -5
2-220
sim
State: [4x1 double]
agent
agent =
rlPGAgent with properties:
AgentOptions: [1x1 rl.option.rlPGAgentOptions]

UseExplorationPolicy: 1
SampleTime: 0.1000
Typically, you train the agent using train and simulate the environment to test the performance of
the trained agent. For this example, simulate the environment using the agent you loaded. Configure
simulation options, specifying that the simulation run for 100 steps.
simOpts = rlSimulationOptions('MaxSteps',100);
For the predefined cart-pole environment used in this example. you can use plot to generate a
visualization of the cart-pole system. When you simulate the environment, this plot updates
automatically so that you can watch the system evolve during the simulation.
plot(env)
Simulate the environment.
experience = sim(env,agent,simOpts)
2-221
2 Functions
experience = struct with fields:

Observation: [1x1 struct]
Action: [1x1 struct]
Reward: [1x1 timeseries]
IsDone: [1x1 timeseries]
SimulationInfo: [1x1 struct]
The output structure experience records the observations collected from the environment, the
action and reward, and other data collected during the simulation. Each field contains a timeseries
object or a structure of timeseries data objects. For instance, experience.Action is a
timeseries containing the action imposed on the cart-pole system by the agent at each step of the
simulation.
experience.Action
ans = struct with fields:

CartPoleAction: [1x1 timeseries]
Simulate Simulink Environment with Multiple Agents
Simulate an environment created for the Simulink® model used in the example “Train Multiple
Agents to Perform Collaborative Task”, using the agents trained in that example.
Load the agents in the MATLAB® workspace.

Create an environment for the rlCollaborativeTask Simulink® model, which has two agent
blocks. Since the agents used by the two blocks (agentA and agentB) are already in the workspace,
you do not need to pass their observation and action specifications to create the environment.
["rlCollaborativeTask/Agent A","rlCollaborativeTask/Agent B"]);
Load the parameters that are needed by the rlCollaborativeTask Simulink® model to run.
2-222
sim
rlCollaborativeTaskParams
Simulate the agents against the environment, saving the experiences in xpr.
xpr = sim(env,[agentA agentB]);
Plot actions of both agents.
subplot(2,1,1); plot(xpr(1).Action.forces)
subplot(2,1,2); plot(xpr(2).Action.forces)
Input Arguments
env — Environment
reinforcement learning environment object
Environment in which the agents act, specified as one of the following kinds of reinforcement
learning environment object:
• A predefined MATLAB or Simulink environment created using rlPredefinedEnv. This kind of

environment does not support training multiple agents at the same time.
• A custom MATLAB environment you create with functions such as rlFunctionEnv or
rlCreateEnvTemplate. This kind of environment does not support training multiple agents at
the same time.
2-223
2 Functions
• A custom Simulink environment you create using rlSimulinkEnv. This kind of environment
supports training multiple agents at the same time.
For more information about creating and configuring environments, see:
• “Create MATLAB Reinforcement Learning Environments”

• “Create Simulink Reinforcement Learning Environments”
When env is a Simulink environment, calling sim compiles and simulates the model associated with
the environment.
agents — Agents
reinforcement learning agent object | array of agent objects
Agents to simulate, specified as a reinforcement learning agent object, such as rlACAgent or

rlDDPGAgent, or as an array of such objects.
If env is a multi-agent environment created with rlSimulinkEnv, specify agents as an array. The
order of the agents in the array must match the agent order used to create env. Multi-agent
simulation is not supported for MATLAB environments.
For more information about how to create and configure agents for reinforcement learning, see
simOpts — Simulation options

rlSimulationOptions object
Simulation options, specified as an rlSimulationOptions object. Use this argument to specify

options such as:
• Number of steps per simulation

• Number of simulations to run
For details, see rlSimulationOptions.
Output Arguments
experience — Simulation results
Simulation results, returned as a structure or structure array. The number of rows in the array is
equal to the number of simulations specified by the NumSimulations option of
rlSimulationOptions. The number of columns in the array is the number of agents. The fields of
each experience structure are as follows.
Observation — Observations
structure
Observations collected from the environment, returned as a structure with fields corresponding to the
observations specified in the environment. Each field contains a timeseries of length N + 1, where
N is the number of simulation steps.
To obtain the current observation and the next observation for a given simulation step, use code such
as the following, assuming one of the fields of Observation is obs1.
2-224
sim
Obs = getSamples(experience.Observation.obs1,1:N);
NextObs = getSamples(experience.Observation.obs1,2:N+1);
These values can be useful if you are writing your own training algorithm using sim to generate
experiences for training.
Action — Actions
structure
Actions computed by the agent, returned as a structure with fields corresponding to the action
signals specified in the environment. Each field contains a timeseries of length N, where N is the
number of simulation steps.
Reward — Rewards
timeseries
Reward at each step in the simulation, returned as a timeseries of length N, where N is the
number of simulation steps.
IsDone — Flag indicating termination of episode

timeseries
Flag indicating termination of the episode, returned as a timeseries of a scalar logical signal. This
flag is set at each step by the environment, according to conditions you specify for episode
termination when you configure the environment. When the environment sets this flag to 1,
simulation terminates.
SimulationInfo — Information collected during simulation

structure | vector of Simulink.SimulationOutput objects
Information collected during simulation, returned as one of the following:
• For MATLAB environments, a structure containing the field SimulationError. This structure
contains any errors that occurred during simulation.
• For Simulink environments, a Simulink.SimulationOutput object containing simulation data.
Recorded data includes any signals and states that the model is configured to log, simulation
metadata, and any errors that occurred.
Version History
See Also
train | rlSimulationOptions
Topics
2-225
2 Functions
store
Package: rl.logging
Store data in the internal memory of a (file or monitor) logger object
Syntax
store(lgr,context,data,iter)
Description
store(lgr,context,data,iter) stores data into lgr internal memory, grouped in a variable
named context and associated with iteration iter.
Examples
setup(flgr);
file.
for iter = 1:10

for ct = 1:3
end

2-226
store

if mod(iter,4)==0
write(flgr);
end
end
cleanup(flgr);
Input Arguments
context — Name of the saved variable

string | char array | ...
Name of the saved variable, specified as either a string or character array. All data associated with
iteration iter and the name context is vertically concatenated in a single MATLAB variable named
context. This variable is then written to the logger target (either a MAT file or a
trainingProgressMonitor object) when write is invoked for lgr.
data — Data to be saved

any fundamental MATLAB datatype | ...
Data to be saved, specified as any fundamental MATLAB datatype. Data associated with the same
iteration and the same context name is grouped in a single variable.
iter — Iteration number

positive integer | ...
Iteration number, specified as a positive integer. When store is executed multiple times with the
same iteration number, data is appended to the memory entry for that iteration. This memory entry,
which can contain several context variables, is then written to the logger target as a single unit (for
example as a single MAT file) when write is invoked for lgr.
Version History
See Also
Functions
rlDataLogger | train | write | setup | cleanup
Objects
2-227
2 Functions
Topics
“Function Handles”
“Handle Object Behavior”
2-228
train
train
Package: rl.agent
Train reinforcement learning agents within a specified environment
Syntax
trainStats = train(env,agents)
trainStats = train(agents,env)
trainStats = train( ___ ,trainOpts)
trainStats = train(agents,env,prevTrainStats)
Description
trainStats = train(env,agents) trains one or more reinforcement learning agents within a
specified environment, using default training options. Although agents is an input argument, after
each training episode, train updates the parameters of each agent specified in agents to maximize
their expected long-term reward from the environment. This is possible because each agent is an
handle object. When training terminates, agents reflects the state of each agent at the end of the
final training episode.
trainStats = train(agents,env) performs the same training as the previous syntax.
trainStats = train( ___ ,trainOpts) trains agents within env, using the training options
object trainOpts. Use training options to specify training parameters such as the criteria for
terminating training, when to save agents, the maximum number of episodes to train, and the
maximum number of steps per episode. Use this syntax after any of the input arguments in the
previous syntaxes.
trainStats = train(agents,env,prevTrainStats) resumes training from the last values of

the agent parameters and training results contained in prevTrainStats obtained after the previous
function call to train.
Examples
Train a Reinforcement Learning Agent
Train the agent configured in the “Train PG Agent to Balance Cart-Pole System” example, within the
corresponding environment. The observation from the environment is a vector containing the position
and velocity of a cart, as well as the angular position and velocity of the pole. The action is a scalar
with two possible elements (a force of either -10 or 10 Newtons applied to a cart).
Load the file containing the environment and a PG agent already configured for it.
load RLTrainExample.mat
Specify some training parameters using rlTrainingOptions. These parameters include the
maximum number of episodes to train, the maximum steps per episode, and the conditions for
2-229
2 Functions
terminating training. For this example, use a maximum of 1000 episodes and 500 steps per episode.
Instruct the training to stop when the average reward over the previous five episodes reaches 500.
Create a default options set and use dot notation to change some of the parameter values.
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 500;
trainOpts.ScoreAveragingWindowLength = 5;
During training, the train command can save candidate agents that give good results. Further
configure the training options to save an agent when the episode reward exceeds 500. Save the agent
to a folder called savedAgents.
trainOpts.SaveAgentCriteria = "EpisodeReward";
trainOpts.SaveAgentValue = 500;
trainOpts.SaveAgentDirectory = "savedAgents";
Finally, turn off the command-line display. Turn on the Reinforcement Learning Episode Manager so
you can observe the training progress visually.
trainOpts.Verbose = false;
trainOpts.Plots = "training-progress";
You are now ready to train the PG agent. For the predefined cart-pole environment used in this
example, you can use plot to generate a visualization of the cart-pole system.
plot(env)
When you run this example, both this visualization and the Reinforcement Learning Episode Manager
update with each training episode. Place them side by side on your screen to observe the progress,
and train the agent. (This computation can take 20 minutes or more.)
trainingInfo = train(agent,env,trainOpts);
2-230
train
2-231
2 Functions
Episode Manager shows that the training successfully reaches the termination condition of a reward
of 500 averaged over the previous five episodes. At each training episode, train updates agent with
the parameters learned in the previous episode. When training terminates, you can simulate the
environment with the trained agent to evaluate its performance. The environment plot updates during
simulation as it did during training.
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
During training, train saves to disk any agents that meet the condition specified with
trainOps.SaveAgentCritera and trainOpts.SaveAgentValue. To test the performance of any
2-232
train
of those agents, you can load the data from the data files in the folder you specified using
trainOpts.SaveAgentDirectory, and simulate the environment with that agent.
Train Multiple Agents to Perform Collaborative Task
This example shows how to set up a multi-agent training session on a Simulink® environment. In the
example, you train two agents to collaboratively perform the task of moving an object.
The environment in this example is a frictionless two dimensional surface containing elements
represented by circles. A target object C is represented by the blue circle with a radius of 2 m, and
robots A (red) and B (green) are represented by smaller circles with radii of 1 m each. The robots
attempt to move object C outside a circular ring of a radius 8 m by applying forces through collision.
All elements within the environment have mass and obey Newton's laws of motion. In addition,
contact forces between the elements and the environment boundaries are modeled as spring and
mass damper systems. The elements can move on the surface through the application of externally
applied forces in the X and Y directions. There is no motion in the third dimension and the total
energy of the system is conserved.
Set the random seed and create the set of parameters required for this example.
rng(10)
rlCollaborativeTaskParams
Open the Simulink model.
2-233
2 Functions
mdl = "rlCollaborativeTask";
open_system(mdl)
For this environment:
• The 2-dimensional space is bounded from –12 m to 12 m in both the X and Y directions.
• The contact spring stiffness and damping values are 100 N/m and 0.1 N/m/s, respectively.
• The agents share the same observations for positions, velocities of A, B, and C and the action
values from the last time step.
• The simulation terminates when object C moves outside the circular ring.
• At each time step, the agents receive the following reward:
2-234
train
r A = rglobal + rlocal, A
rB = rglobal + rlocal, B
rglobal = 0 . 001dc
rlocal, A = − 0 . 005dAC − 0 . 008u2A
2
rlocal, B = − 0 . 005dBC − 0 . 008uB
Here:
• r Aand rB are the rewards received by agents A and B, respectively.

• rglobal is a team reward that is received by both agents as object C moves closer towards the
boundary of the ring.
• rlocal, A and rlocal, B are local penalties received by agents A and B based on their distances from
object C and the magnitude of the action from the last time step.
• dC is the distance of object C from the center of the ring.
• dAC and dBC are the distances between agent A and object C and agent B and object C,
respectively.
• The agents apply external forces on the robots that result in motion. uA and uB are the action
values of the two agents A and B from the last time step. The range of action values is between -1
and 1.
Environment
To create a multi-agent environment, specify the block paths of the agents using a string array. Also,
specify the observation and action specification objects using cell arrays. The order of the
specification objects in the cell array must match the order specified in the block path array. When
agents are available in the MATLAB workspace at the time of environment creation, the observation
and action specification arrays are optional. For more information on creating multi-agent
environments, see rlSimulinkEnv.
Create the I/O specifications for the environment. In this example, the agents are homogeneous and
have the same I/O specifications.
% Number of observations
numObs = 16;
% Number of actions
numAct = 2;
% Maximum value of externally applied force (N)

maxF = 1.0;
% I/O specifications for each agent

oinfo = rlNumericSpec([numObs,1]);
ainfo = rlNumericSpec([numAct,1], ...
"UpperLimit", maxF, ...
"LowerLimit", -maxF);
oinfo.Name = "observations";
ainfo.Name = "forces";
Create the Simulink environment interface.
2-235
2 Functions
blks = ["rlCollaborativeTask/Agent A", "rlCollaborativeTask/Agent B"];

obsInfos = {oinfo,oinfo};
actInfos = {ainfo,ainfo};
env = rlSimulinkEnv(mdl,blks,obsInfos,actInfos);
Specify a reset function for the environment. The reset function resetRobots ensures that the
robots start from random initial positions at the beginning of each episode.
env.ResetFcn = @(in) resetRobots(in,RA,RB,RC,boundaryR);
Agents
This example uses two Proximal Policy Optimization (PPO) agents with continuous action spaces. The
agents apply external forces on the robots that result in motion. To learn more about PPO agents, see
“Proximal Policy Optimization Agents”.
The agents collect experiences until the experience horizon (600 steps) is reached. After trajectory
completion, the agents learn from mini-batches of 300 experiences. An objective function clip factor
of 0.2 is used to improve training stability and a discount factor of 0.99 is used to encourage long-
term rewards.
Specify the agent options for this example.
agentOptions = rlPPOAgentOptions(...
"ExperienceHorizon",600,...
"ClipFactor",0.2,...
"EntropyLossWeight",0.01,...
"MiniBatchSize",300,...
"NumEpoch",4,...
"AdvantageEstimateMethod","gae",...
"GAEFactor",0.95,...
"SampleTime",Ts,...
"DiscountFactor",0.99);
agentOptions.ActorOptimizerOptions.LearnRate = 1e-4;
agentOptions.CriticOptimizerOptions.LearnRate = 1e-4;
Create the agents using the default agent creation syntax. For more information see rlPPOAgent.
agentA = rlPPOAgent(oinfo, ainfo, ...

rlAgentInitializationOptions("NumHiddenUnit", 200), agentOptions);
agentB = rlPPOAgent(oinfo, ainfo, ...
rlAgentInitializationOptions("NumHiddenUnit", 200), agentOptions);
Training
To train multiple agents, you can pass an array of agents to the train function. The order of agents
in the array must match the order of agent block paths specified during environment creation. Doing
so ensures that the agent objects are linked to their appropriate I/O interfaces in the environment.
You can train multiple agents in a decentralized or centralized manner. In decentralized training,
agents collect their own set of experiences during the episodes and learn independently from those
experiences. In centralized training, the agents share the collected experiences and learn from them
together. The actor and critic functions are synchronized between the agents after trajectory
completion.
To configure a multi-agent training, you can create agent groups and specify a learning strategy for
each group through the rlMultiAgentTrainingOptions object. Each agent group may contain
2-236
train
unique agent indices, and the learning strategy can be "centralized" or "decentralized". For
example, you can use the following command to configure training for three agent groups with
different learning strategies. The agents with indices [1,2] and [3,4] learn in a centralized manner
while agent 4 learns in a decentralized manner.
opts = rlMultiAgentTrainingOptions("AgentGroups", {[1,2], 4, [3,5]},

"LearningStrategy", ["centralized","decentralized","centralized"])
For more information on multi-agent training, type help rlMultiAgentTrainingOptions in

MATLAB.
You can perform decentralized or centralized training by running one of the following sections using
the Run Section button.
1. Decentralized Training
To configure decentralized multi-agent training for this example:
• Automatically assign agent groups using the AgentGroups=auto option. This allocates each
agent in a separate group.
• Specify the "decentralized" learning strategy.
• Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.
• Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or
more.
trainOpts = rlMultiAgentTrainingOptions(...
"AgentGroups","auto",...
"LearningStrategy","decentralized",...
"MaxEpisodes",1000,...
"MaxStepsPerEpisode",600,...
"ScoreAveragingWindowLength",30,...
"StopTrainingCriteria","AverageReward",...
"StopTrainingValue",-10);
Train the agents using the train function. Training can take several hours to complete depending on
the available computational power. To save time, load the MAT file decentralizedAgents.mat
which contains a set of pretrained agents. To train the agents yourself, set doTraining to true.
doTraining = false;
if doTraining
decentralizedTrainResults = train([agentA,agentB],env,trainOpts);
else
load("decentralizedAgents.mat");
end
The following figure shows a snapshot of decentralized training progress. You can expect different
results due to randomness in the training process.
2-237
2 Functions
2. Centralized Training
To configure centralized multi-agent training for this example:
• Allocate both agents (with indices 1 and 2) in a single group. You can do this by specifying the
agent indices in the "AgentGroups" option.
• Specify the "centralized" learning strategy.
• Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.
• Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or
more.
"AgentGroups",{[1,2]},...
"LearningStrategy","centralized",...
"MaxEpisodes",1000,...
"MaxStepsPerEpisode",600,...
"ScoreAveragingWindowLength",30,...
"StopTrainingCriteria","AverageReward",...
"StopTrainingValue",-10);
Train the agents using the train function. Training can take several hours to complete depending on
the available computational power. To save time, load the MAT file centralizedAgents.mat which
contains a set of pretrained agents. To train the agents yourself, set doTraining to true.
2-238
train
doTraining = false;
if doTraining
centralizedTrainResults = train([agentA,agentB],env,trainOpts);
else
load("centralizedAgents.mat");
end
The following figure shows a snapshot of centralized training progress. You can expect different
results due to randomness in the training process.
Simulation
Once the training is finished, simulate the trained agents with the environment.
simOptions = rlSimulationOptions("MaxSteps",300);
exp = sim(env,[agentA agentB],simOptions);
2-239
2 Functions
For more information on agent simulation, see rlSimulationOptions and sim.
Stop and Resume Training Using Existing Training Data
This example shows how to resume training using existing training data for training Q-learning. For
more information on these agents, see “Q-Learning Agents” and “SARSA Agents”.
For this example, create the basic grid world environment.

To specify that the initial state of the agent is always [2,1], create a reset function that returns the
state number for the initial agent state.
x0 = [1:12 15:17 19:22 24];
env.ResetFcn = @() x0(randi(numel(x0)));
Fix the random generator seed for reproducibility.

rng(1)
Create Q-Learning Agent
To create a Q-learning agent, first create a Q table using the observation and action specifications
from the grid world environment. Set the learning rate of the representation to 1.
qTable = rlTable(getObservationInfo(env),getActionInfo(env));
qVf = rlQValueFunction(qTable,getObservationInfo(env),getActionInfo(env));
Next, create a Q-learning agent using this table representation and configure the epsilon-greedy
exploration. For more information on creating Q-learning agents, see rlQAgent and
rlQAgentOptions. Keep the default value of the discount factor to 0.99.
2-240
train
agentOpts = rlQAgentOptions;
agentOpts.EpsilonGreedyExploration.Epsilon = 0.2;
agentOpts.CriticOptimizerOptions.LearnRate = 0.2;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-3;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 1e-3;
agentOpts.DiscountFactor = 1;
qAgent = rlQAgent(qVf,agentOpts);
Train Q-Learning Agent for 100 Episodes
To train the agent, first specify the training options. For more information, see rlTrainingOptions.
trainOpts.MaxEpisodes = 1e6;
trainOpts.Plots = "none";
trainOpts.Verbose = false;
trainOpts.StopTrainingCriteria = "EpisodeCount";
trainOpts.ScoreAveragingWindowLength = 30;
Train the Q-learning agent using the train function. Training can take several minutes to complete.
To save time while running this example, load a pretrained agent by setting doTraining to false.
To train the agent yourself, set doTraining to true.
trainingStats = train(qAgent,env,trainOpts);
Display index of last episode.
trainingStats.EpisodeIndex(end)
ans = 100
Train Q-Learning Agent for 200 More Episodes
Set the training to stop after episode 300.
trainingStats.TrainingOptions.StopTrainingValue = 300;
Resume the training using the training data that exists in trainingStats.
trainingStats = train(qAgent,env,trainingStats);
Display index of last episode.
trainingStats.EpisodeIndex(end)
ans = 300
Plot episode reward.
figure()
plot(trainingStats.EpisodeIndex,trainingStats.EpisodeReward)
title('Episode Reward')
xlabel('EpisodeIndex')
ylabel('EpisodeReward')
2-241
2 Functions
Display the final Q-value table.
qAgentFinalQ = getLearnableParameters(getCritic(qAgent));
qAgentFinalQ{1}
4.8373 10.0000 -1.3036 0.8020

9.2058 4.3147 11.0000 5.8501
10.0000 3.3987 4.5830 -6.4751
6.3569 6.0000 8.9971 5.4393
5.0433 5.8399 7.0067 4.1439
5.1031 8.5228 10.9936 0.1200
9.9616 8.8647 12.0000 10.0026
11.0000 8.4131 8.6974 6.0001
10.0000 6.9997 5.8122 8.5523
7.1164 7.0019 8.0000 5.8196
⋮
Validate Q-Learning Results
To validate the training results, simulate the agent in the training environment.
Before running the simulation, visualize the environment and configure the visualization to maintain a
trace of the agent states.
2-242
train
plot(env)
env.ResetFcn = @() 2;
env.Model.Viewer.ShowTrace = true;
env.Model.Viewer.clearTrace;
Simulate the agent in the environment using the sim function.
sim(qAgent,env)
Input Arguments
agents — Agents
agent object | array of agent objects
Agents to train, specified as a reinforcement learning agent object, such as rlACAgent or

rlDDPGAgent, or as an array of such objects.
If env is a multi-agent environment created with rlSimulinkEnv, specify agents as an array. The
order of the agents in the array must match the agent order used to create env. Multi-agent training
is not supported for MATLAB environments.
Note train updates the agents as training progresses. This is possible because each agent is an
handle object. To preserve the original agent parameters for later use, save the agent to a MAT-file (if
you copy the agent into a new variable, the new variable will also always point to the most recent
2-243
2 Functions
agent version with updated parameters). For more information about handle objects, see “Handle
Object Behavior”.
Note When training terminates, agents reflects the state of each agent at the end of the final
training episode. The rewards obtained by the final agents are not necessarily the highest achieved
during the training process, due to continuous exploration. To save agents during training, create an
rlTrainingOptions object specifying the SaveAgentCriteria and SaveAgentValue properties and
pass it to train as a trainOpts argument.
For more information about how to create and configure agents for reinforcement learning, see
env — Environment
reinforcement learning environment object
Environment in which the agents act, specified as one of the following kinds of reinforcement
learning environment object:
• A predefined MATLAB or Simulink environment created using rlPredefinedEnv. This kind of

environment does not support training multiple agents at the same time.
• A custom MATLAB environment you create with functions such as rlFunctionEnv or
rlCreateEnvTemplate. This kind of environment does not support training multiple agents at
the same time.
• A custom Simulink environment you create using rlSimulinkEnv. This kind of environment
supports training multiple agents at the same time.

When env is a Simulink environment, calling train compiles and simulates the model associated
with the environment.
trainOpts — Training parameters and options

rlTrainingOptions object | rlMultiAgentTrainingOptions object
Training parameters and options, specified as either an rlTrainingOptions or an

rlMultiAgentTrainingOptions object. Use this argument to specify such parameters and options
as:
• Criteria for ending training

• Criteria for saving candidate agents
• How to display training progress
• Options for parallel computing
For details, see rlTrainingOptions and rlMultiAgentTrainingOptions.
prevTrainStats — Training episode data

rlTrainingResult object | array of rlTrainingResult objects
2-244
train
Training episode data, specified as an:
• rlTrainingResult object, when training a single agent.

• Array of rlTrainingResult objects when training multiple agents.
Use this argument to resume training from the exact point at which it stopped. This starts the
training from the last values of the agent parameters and training results object obtained after the
previous train function call. prevTrainStats contains, as one of its properties, the
rlTrainingOptions object or the rlMultiAgentTrainingOptions object specifying the training
option set. Therefore, to restart the training with updated training options, first change the training
options in trainResults using dot notation. If the maximum number of episodes was already
reached in the previous training session, you must increase the maximum number of episodes.
For details about the rlTrainingResult object properties, see the “trainStats” on page 2-0
output argument.
Output Arguments
trainStats — Training episode data
rlTrainingResult object | array of rlTrainingResult objects
Training episode data, returned as an:
• rlTrainingResult object, when training a single agent.

• Array of rlTrainingResult objects when training multiple agents.
The following properties pertain to the rlTrainingResult object:
EpisodeIndex — Episode numbers

[1;2;…;N]
Episode numbers, returned as the column vector [1;2;…;N], where N is the number of episodes in
the training run. This vector is useful if you want to plot the evolution of other quantities from
episode to episode.
EpisodeReward — Reward for each episode

column vector
Reward for each episode, returned in a column vector of length N. Each entry contains the reward for
the corresponding episode.
EpisodeSteps — Number of steps in each episode

column vector
Number of steps in each episode, returned in a column vector of length N. Each entry contains the
number of steps in the corresponding episode.
AverageReward — Average reward over the averaging window

column vector
Average reward over the averaging window specified in trainOpts, returned as a column vector of
length N. Each entry contains the average award computed at the end of the corresponding episode.
2-245
2 Functions
TotalAgentSteps — Total number of steps

column vector
Total number of agent steps in training, returned as a column vector of length N. Each entry contains
the cumulative sum of the entries in EpisodeSteps up to that point.
EpisodeQ0 — Critic estimate of long-term reward for each episode

column vector
Critic estimate of long-term reward using the current agent and the environment initial conditions,
returned as a column vector of length N. Each entry is the critic estimate (Q0) for the agent of the
corresponding episode. This field is present only for agents that have critics, such as rlDDPGAgent
and rlDQNAgent.
SimulationInfo — Information collected during simulation

structure | vector of Simulink.SimulationOutput objects
Information collected during the simulations performed for training, returned as:
• For training in MATLAB environments, a structure containing the field SimulationError. This
field is a column vector with one entry per episode. When the StopOnError option of
rlTrainingOptions is "off", each entry contains any errors that occurred during the
corresponding episode.
• For training in Simulink environments, a vector of Simulink.SimulationOutput objects
containing simulation data recorded during the corresponding episode. Recorded data for an
episode includes any signals and states that the model is configured to log, simulation metadata,
and any errors that occurred during the corresponding episode.
TrainingOptions — Training options set

rlTrainingOptions object | rlMultiAgentTrainingOptions object
Training options set, returned as:
• For a single agent, an rlTrainingOptions object. For more information, see the
rlTrainingOptions reference page.
• For multiple agents, an rlMultiAgentTrainingOptions object. For more information, see the
rlMultiAgentTrainingOptions reference page.
Tips
• train updates the agents as training progresses. To preserve the original agent parameters for
later use, save the agents to a MAT-file.
• By default, calling train opens the Reinforcement Learning Episode Manager, which lets you
visualize the progress of the training. The Episode Manager plot shows the reward for each
episode, a running average reward value, and the critic estimate Q0 (for agents that have critics).
The Episode Manager also displays various episode and training statistics. To turn off the
Reinforcement Learning Episode Manager, set the Plots option of trainOpts to "none".
• If you use a predefined environment for which there is a visualization, you can use plot(env) to
visualize the environment. If you call plot(env) before training, then the visualization updates
during training to allow you to visualize the progress of each episode. (For custom environments,
you must implement your own plot method.)
2-246
train
• Training terminates when the conditions specified in trainOpts are satisfied. To terminate
training in progress, in the Reinforcement Learning Episode Manager, click Stop Training.
Because train updates the agent at each episode, you can resume training by calling
train(agent,env,trainOpts) again, without losing the trained parameters learned during the
first call to train.
• During training, you can save candidate agents that meet conditions you specify with trainOpts.
For instance, you can save any agent whose episode reward exceeds a certain value, even if the
overall condition for terminating training is not yet satisfied. train stores saved agents in a MAT-
file in the folder you specify with trainOpts. Saved agents can be useful, for instance, to allow
you to test candidate agents generated during a long-running training process. For details about
saving criteria and saving location, see rlTrainingOptions.
Algorithms
In general, train performs the following iterative steps:
1 Initialize agent.
2 For each episode:
a Reset the environment.

b Get the initial observation s0 from the environment.
c Compute the initial action a0 = μ(s0).
d Set the current action to the initial action (a←a0) and set the current observation to the initial
observation (s←s0).
e While the episode is not finished or terminated:
i Step the environment with action a to obtain the next observation s' and the reward r.
ii Learn from the experience set (s,a,r,s').
iii Compute the next action a' = μ(s').
iv Update the current action with the next action (a←a') and update the current
observation with the next observation (s←s').
v Break if the episode termination conditions defined in the environment are met.
3 If the training termination condition defined by trainOpts is met, terminate training.
Otherwise, begin the next episode.
The specifics of how train performs these computations depends on your configuration of the agent
and environment. For instance, resetting the environment at the start of each episode can include
randomizing initial state values, if you configure your environment to do so.
Version History
train now returns an object instead of a structure

Starting in R2022a, train returns an object or an array of objects instead of a structure. The
properties of the object match the fields of the structure returned in previous versions. Therefore, the
code based on dot notation works in the same way.
2-247
2 Functions
For example, if you train an agent using the following command:
trainStats = train(agent,env,trainOptions);
When training terminates, either because a termination condition is reached or because you click
Stop Training in the Reinforcement Learning Episode Manager, trainStats is returned as an
rlTrainingResult object.
The rlTrainingResult object contains the same training statistics previously returned in a
structure along with data to correctly recreate the training scenario and update the episode manager.
You can use trainStats as third argument for another train call, which (when executed with the
same agents and environment) will cause training to resume from the exact point at which it stopped.
Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.
To train in parallel, set the UseParallel and ParallelizationOptions options in the option set
trainOpts. Parallel training is not supported for multi-agent environments. For more information,
see rlTrainingOptions.
See Also
rlTrainingOptions | sim | rlMultiAgentTrainingOptions
Topics
2-248
validateEnvironment
validateEnvironment
Package: rl.env
Validate custom reinforcement learning environment
Syntax
Description
validateEnvironment(env) validates a reinforcement learning environment. This function is
useful when:
• You are using a custom environment for which you supplied your own step and reset functions,
such as an environment created using rlCreateEnvTemplate.
• You are using an environment you created from a Simulink model using rlSimulinkEnv.
validateEnvironment resets the environment, generates an initial observation and action, and
simulates the environment for one or two steps (see “Algorithms” on page 2-251). If there are no
errors during these operations, validation is successful, and validateEnvironment returns no
result. If errors occur, these errors appear in the MATLAB command window. Use the errors to
determine what to change in your observation specification, action specification, custom functions, or
Simulink model.
Examples
Validate Simulink Environment
This example shows how to validate a Simulink environment.
Create and validate and environment for the rlwatertank model, which represents a control
system containing a reinforcement learning agent (For details about this model, see “Create Simulink
Environment and Train Agent”.)
open_system('rlwatertank')
Create observation and action specifications for the environment.

'LowerLimit',[-inf -inf 0 ]',...
'UpperLimit',[ inf inf inf]');
obsInfo.Description = 'integrated error, error, and measured height';
numObservations = obsInfo.Dimension(1);

actInfo.Name = 'flow';
numActions = numel(actInfo);
2-249
2 Functions
Create an environment from the model.
env = rlSimulinkEnv('rlwatertank','rlwatertank/RL Agent',obsInfo,actInfo);
Now you use validateEnvironment to check whether the model is configured correctly.
Error using rl.env.SimulinkEnvWithAgent/validateEnvironment (line 187)

Simulink environment validation requires an agent in the MATLAB base workspace
or in a data dictionary linked to the model. Specify the agent in the Simulink model.
validateEnvironment attempts to compile the model, initialize the environment and the agent,
and simulate the model. In this case, the RL Agent block is configured to use an agent called agent,
but no such variable exists in the MATLAB® workspace. Thus, the function returns an error
indicating the problem.
Create an appropriate agent for this system using the commands detailed in the “Create Simulink
Environment and Train Agent” example. In this case, load the agent from the
rlWaterTankDDPGAgent.mat file.
Now, run validateEnvironment again.
Input Arguments
env — Environment to validate
environment object
Environment to validate, specified as a reinforcement learning environment object, such as:
• A custom MATLAB environment you create with rlCreateEnvTemplate. In this case,

validateEnvironment checks that the observations and actions generated during simulation of
the environment are consistent in size, data type, and value range with the observation
specification and action specification. It also checks that your custom step and reset functions
run without error. (When you create a custom environment using rlFunctionEnv, the software
runs validateEnvironment automatically.)
• A custom Simulink environment you create using rlSimulinkEnv. If you use a Simulink
environment, you must also have an agent defined and associated with the RL Agent block in the
model. For a Simulink model, validateEnvironment checks that the model compiles and runs
without error. The function does not dirty your model.

2-250
validateEnvironment
Algorithms
validateEnvironment works by running a brief simulation of the environment and making sure
that the generated signals match the observation and action specifications you provided when you
created the environment.
MATLAB Environments
For MATLAB environments, validation includes the following steps.
1 Reset the environment using the reset function associated with the environment.
2 Obtain the first observation and check whether it is consistent with the dimension, data type, and
range of values in the observation specification.
3 Generate a test action based on the dimension, data type, and range of values in the action
specification.
4 Simulate the environment for one step using the generated action and the step function
associated with the environment.
5 Obtain the new observation signal and check whether it is consistent with the dimension, data
type, and range of values in the observation specification.
If any of these operations generates an error, validateEnvironment returns the error. If

validateEnvironment returns no result, then validation is successful.
Simulink Environments
For Simulink environments, validation includes the following steps.
1 Reset the environment.

2 Simulate the model for two time steps.
If any of these operations generates an error, validateEnvironment returns the error. If

validateEnvironment returns no result, then validation is successful.
validateEnvironment performs these steps without dirtying the model, and leaves all model
parameters in the state they were in when you called the function.
Version History
See Also
rlCreateEnvTemplate | rlSimulinkEnv | rlFunctionEnv
Topics
“Create Simulink Environment and Train Agent”
“Create Custom MATLAB Environment from Template”
2-251
2 Functions
write
Package: rl.logging
Transfer stored data from the internal logger memory to the logging target
Syntax
write(lgr)
Description
write(lgr) transfers all stored logging contexts into the logger's target (either a MAT file or a
trainingProgressMonitor object).
Examples
setup(flgr);
file.
for iter = 1:10

for ct = 1:3
end

2-252
write

if mod(iter,4)==0
write(flgr);
end
end
cleanup(flgr);
Input Arguments
Version History
See Also
Functions
rlDataLogger | train | setup | store | cleanup
Objects
Topics
2-253
3
Objects
3 Objects
quadraticLayer
Quadratic layer for actor or critic network
Description
A quadratic layer takes an input vector and outputs a vector of quadratic monomials constructed from
the input elements. This layer is useful when you need a layer whose output is a quadratic function of
its inputs. For example, to recreate the structure of quadratic value functions such as those used in
LQR controller design.
For example, consider an input vector U = [u1 u2 u3]. For this input, a quadratic layer gives the
output Y = [u1*u1 u1*u2 u2*u2 u1*u3 u2*u3 u3*u3]. For an example that uses a
QuadraticLayer, see “Train DDPG Agent to Control Double Integrator System”.
Note The QuadraticLayer layer does not support inputs coming directly or indirectly from a
featureInputLayer or sequenceInputLayer.
The parameters of a QuadraticLayer object are not learnable.
Creation
Syntax
qLayer = quadraticLayer
qLayer = quadraticLayer(Name,Value)
Description
qLayer = quadraticLayer creates a quadratic layer with default property values.
qLayer = quadraticLayer(Name,Value) sets properties on page 3-2 using name-value pairs.

For example, quadraticLayer('Name','quadlayer') creates a quadratic layer and assigns the
name 'quadlayer'.
Properties
Name — Name of layer
'quadratic' (default) | character vector
Name of layer, specified as a character vector. To include a layer in a layer graph, you must specify a
nonempty unique layer name. If you train a series network with this layer and Name is set to '', then
the software automatically assigns a name to the layer at training time.
Description — Description of layer

'quadratic layer' (default) | character vector
3-2
quadraticLayer
Description of layer, specified as a character vector. When you create the quadratic layer, you can use
this property to give it a description that helps you identify its purpose.
Examples
Create Quadratic Layer
Create a quadratic layer that converts an input vector U into a vector of quadratic monomials
constructed from binary combinations of the elements of U.
qLayer = quadraticLayer
qLayer =
QuadraticLayer with properties:
Name: 'quadratic'
Learnable Parameters
No properties.
State Parameters
No properties.
Show all properties
Confirm that the layer produces the expected output. For instance, for U = [u1 u2 u3], the
expected output is [u1*u1 u1*u2 u2*u2 u1*u3 u2*u3 u3*u3].
predict(qLayer,[1 2 3])
ans = 1×3
1 4 9
You can incorporate qLayer into an actor network or critic network for reinforcement learning.
Version History
See Also
scalingLayer | softplusLayer
Topics
“Train DDPG Agent to Control Double Integrator System”
3-3
3 Objects
FileLogger
Log reinforcement learning training data to MAT files
Description
Use a FileLogger object to log data to MAT files, within the train function or inside a custom
training loop. To log data when using the train function, specify appropriate callback functions in
FileLogger, as shown in the examples. These callbacks are executed at different stages of training,
for example, EpisodeFinishedFcn is executed after the completion of an episode. The output of a
callback function is a structure containing the data to log at that stage of training.
Note FileLogger is a handle object. If you assign an existing FileLogger object to a new
FileLogger object, both the new object and the original one refer to the same underlying object in
memory. To preserve the original object parameters for later use, save the object to a MAT-file. For
Creation
Create a FileLogger object using rlDataLogger without any input arguments.
Properties
LoggingOptions — Object containing a set of logging options
MATFileLoggingOptions object (default)
Object containing a set of logging options, returned as a MATFileLoggingOptions object. A

MATFileLoggingOptions object has the following properties, which you can access using dot
notation after creating the FileLogger object.
LoggingDirectory — Name of the logging directory

"logs" subdirectory of the current directory (default) | string or character array
Name or fully qualified path of the logging directory, specified as a string or a character array. This is
the name of the directory where the MAT files containing the logged data are saved. As a default, a
subdirectory called logs is created in the current folder during setup and files are saved there
during training.
FileNameRule — Rule to name the MAT files

"loggedData<id>" (default) | string | char array
Rule to name the MAT files, specified as a string or a character array. For example, the naming rule
"episode<id>" results in the file names episode001.mat, episode002.mat and so on.
Version — MAT files version

"-v7" (default) | string | char array
3-4
FileLogger
MAT file versions, specified as a string or character array. The default is "-v7". For more information,
see “MAT-File Versions”.
UseCompression — Option to use compression

Option to use compression when saving data to a MAT file, specified as a logical variable. The default
is true. For more information, see “MAT-File Versions”.
DataWriteFrequency — Frequency for writing data to disk

Frequency for writing data to disk, specified as a positive integer. It is the number of episodes after
which data is saved to disk. The default is 1.
MaxEpisodes — Maximum number of episodes

Maximum number of episodes, specified as a positive integer. When using train, the value is
automatically initialized. Set this value when using the logger object in a custom training loop. The
default is 500.
EpisodeFinishedFcn — Callback to log data after episode completion

Callback to log data after episode completion, specified as a function handle object. The specified
function must return a structure containing the data to log, such as experiences, simulation
information, or initial observations.
Example: @myEpisodeLoggingFcn
AgentStepFinishedFcn — Callback to log data after training step completion

[] (default) | function handle | cell array of function handles
Callback to log data after training step completion within an episode, specified as a function handle
object. The specified function must return a structure containing the data to log, such as for example
the state of the agent's exploration policy.
For multi agent training, AgentStepFinishedFcn can be a cell array of function handles with as
many elements as number of agent groups.
Note Logging data using the AgentStepFinishedFcn callback is not supported when training
agents in parallel with the train function.
Example: @myAgentStepLoggingFcn
AgentLearnFinishedFcn — Callback to log data after completion of the learn subroutine

Callback to log data after completion of the learn subroutine, specified as a function handle object.
The specified function must return a structure containing the data to log, such as the training losses
of the actor and critic networks, or, for a model-based agent, the environment model training losses.
3-5
3 Objects
For multi agent training, AgentLearnFinishedFcn can be a cell array of function handles with as
Example: @myLearnLoggingFcn
Object Functions
setup Set up reinforcement learning environment or initialize data logger object
cleanup Clean up reinforcement learning environment or data logger object
Examples
Log Data to Disk during Built-in Training
This example shows how to log data to disk when using train.
logger = rlDataLogger();
Specify a directory to save logged data.
logger.LoggingOptions.LoggingDirectory = "myDataLog";
specify the appropriate callback functions in the logger object. For a related example, see “Log
Training Data To Disk”.
logger.EpisodeFinishedFcn = @myEpisodeFinishedFcn;
logger.AgentStepFinishedFcn = @myAgentStepFinishedFcn;
command.
While the training progresses, data will be logged to the specified directory, according to the rule
specified in the FileNameRule property of logger.LoggingOptions.
logger.LoggingOptions.FileNameRule
ans =
"loggedData<id>"
Episode completion logging function.
function dataToLog = myEpisodeFinishedFcn(data)

dataToLog.Experience = data.Experience;
end
Agent step completion logging function.
3-6
FileLogger
function dataToLog = myAgentStepFinishedFcn(data)

dataToLog.State = getState(getExplorationPolicy(data.Agent));
end
Learn subroutine completion logging function.

end

setup(flgr);
file.
for iter = 1:10

for ct = 1:3
end


if mod(iter,4)==0
write(flgr);
end
end
cleanup(flgr);
3-7
3 Objects
Limitations
• Logging data using the AgentStepFinishedFcn callback is not supported when training agents
in parallel with the train function.
Version History
See Also
Functions
rlDataLogger | train | setup | store | write | cleanup
Objects
MonitorLogger
Topics
3-8
MonitorLogger
MonitorLogger
Log reinforcement learning training data to monitor window
Description
Use a MonitorLogger object to log data to a monitor window, within the train function or inside a
custom training loop. To log data when using the train function, specify appropriate callback
functions in MonitorLogger, as shown in the examples. These callbacks are executed at different
stages of training, for example, EpisodeFinishedFcn is executed after the completion of an
episode. The output of a callback function is a structure containing the data to log at that stage of
training.
Note MonitorLogger is an handle object. If you assign an existing MonitorLogger object to a new
MonitorLogger object, both the new object and the original one refer to the same underlying object
in memory. To preserve the original object parameters for later use, save the object to a MAT-file. For
Creation
Create a MonitorLogger object using rlDataLogger specifying a trainingProgressMonitor
object as input argument.
Properties
LoggingOptions — Object containing a set of logging options
MonitorLoggingOptions object (default)
Object containing a set of logging options, returned as a MonitorLoggingOptions object. A

MonitorLoggingOptions object has the following properties, which you can access using dot
notation after creating the MonitorLogger object.
DataWriteFrequency — Frequency for writing data to the monitor window

Frequency for writing data to the monitor window, specified as a positive integer. It is the number of
episodes after which data is transmitted to the trainingProgressMonitor object. The default is 1.
MaxEpisodes — Maximum number of episodes

Maximum number of episodes, specified as a positive integer. When using train, the value is
automatically initialized. Set this value when using the logger object in a custom training loop. The
default is 500.
EpisodeFinishedFcn — Callback to log data after episode completion

3-9
3 Objects
Callback to log data after episode completion, specified as a function handle object. The specified
function must return a structure containing the data to log, such as experiences, simulation
information, or initial observations.
Example: @myEpisodeLoggingFcn
AgentStepFinishedFcn — Callback to log data after training step completion

Callback to log data after training step completion within an episode, specified as a function handle
object. The specified function must return a structure containing the data to log, such as for example
the state of the agent's exploration policy.
For multi agent training, AgentStepFinishedFcn can be a cell array of function handles with as
Note Logging data using the AgentStepFinishedFcn callback is not supported when training
agents in parallel with the train function.
Example: @myAgentStepLoggingFcn
AgentLearnFinishedFcn — Callback to log data after completion of the learn subroutine

Callback to log data after completion of the learn subroutine, specified as a function handle object.
The specified function must return a structure containing the data to log, such as the training losses
of the actor and critic networks, or, for a model-based agent, the environment model training losses.
For multi agent training, AgentLearnFinishedFcn can be a cell array of function handles with as
Example: @myLearnLoggingFcn
Object Functions
setup Set up reinforcement learning environment or initialize data logger object
cleanup Clean up reinforcement learning environment or data logger object
Examples
Log and Visualize Data with a Training Progress Monitor Object
This example shows how to log and visualize data to the window of a trainingProgressMonitor
object when using train.
Create a trainingProgressMonitor object. Creating the object also opens a window associated
with the object.
monitor = trainingProgressMonitor();
3-10
MonitorLogger
Create a MonitorLogger object using rlDataLogger.
logger = rlDataLogger(monitor);
specify the appropriate callback functions in the logger object.
command.
While the training progresses, data will be logged to the training monitor object, and visualized in the
associated window.
Note that only scalar data can be logged with a monitor logger object.
Define a logging function that logs data periodically at the completion of the learning subroutine.
if mod(data.AgentLearnCount, 2) == 0
else
dataToLog = [];
end
3-11
3 Objects
end
Limitations
• Only scalar data is supported when logging data with a MonitorLogger object. The structure
returned by the callback functions must contain fields with scalar data.
• Resuming of training from a previous training result is not supported when logging data with a
MonitorLogger object.
Version History
See Also
Functions
rlDataLogger | train | setup | store | write | cleanup
Objects
FileLogger | trainingProgressMonitor
Topics
“Monitor Custom Training Loop Progress”
3-12
rlACAgent
rlACAgent
Actor-critic reinforcement learning agent
Description
Actor-critic (AC) agents implement actor-critic algorithms such as A2C and A3C, which are model-
free, online, on-policy reinforcement learning methods. The actor-critic agent optimizes the policy
(actor) directly and uses a critic to estimate the return or future rewards. The action space can be
either discrete or continuous.
For more information, see “Actor-Critic Agents”. For more information on the different types of
reinforcement learning agents, see “Reinforcement Learning Agents”.
Creation
Syntax
agent = rlACAgent(observationInfo,actionInfo)
agent = rlACAgent(observationInfo,actionInfo,initOpts)
agent = rlACAgent(actor,critic)
agent = rlACAgent( ___ ,agentOptions)
Description
Create Agent from Observation and Action Specifications
agent = rlACAgent(observationInfo,actionInfo) creates an actor-critic agent for an

environment with the given observation and action specifications, using default initialization options.
The actor and critic in the agent use default deep neural networks built from the observation
specification observationInfo and the action specification actionInfo. The ObservationInfo
and ActionInfo properties of agent are set to the observationInfo and actionInfo input
arguments, respectively.
agent = rlACAgent(observationInfo,actionInfo,initOpts) creates an actor-critic agent

for an environment with the given observation and action specifications. The agent uses default
networks in which each hidden fully connected layer has the number of units specified in the
initOpts object. Actor-critic agents do not support recurrent neural networks. For more information
on the initialization options, see rlAgentInitializationOptions.
Create Agent from Actor and Critic
agent = rlACAgent(actor,critic) creates an actor-critic agent with the specified actor and
critic, using the default options for the agent.
3-13
3 Objects
Specify Agent Options
agent = rlACAgent( ___ ,agentOptions) creates an actor-critic agent and sets the
AgentOptions property to the agentOptions input argument. Use this syntax after any of the input
arguments in the previous syntaxes.
Input Arguments
initOpts — Agent initialization options

rlAgentInitializationOptions object
Agent initialization options, specified as an rlAgentInitializationOptions object. Actor-critic

agents do not support recurrent neural networks.
actor — Actor
rlDiscreteCategoricalActor object | rlContinuousGaussianActor object
Actor that implements the policy, specified as an rlDiscreteCategoricalActor or

rlContinuousGaussianActor function approximator object. For more information on creating
actor approximators, see “Create Policies and Value Functions”.
critic — Critic
Critic that estimates the discounted long-term reward, specified as an rlValueFunction object. For
more information on creating critic approximators, see “Create Policies and Value Functions”.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
Observation specifications, specified as a reinforcement learning specification object or an array of

specification objects defining properties such as dimensions, data type, and names of the observation
signals.
If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.
You can extract observationInfo from an existing environment or agent using

getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.
ActionInfo — Action specification

specification object
Action specifications, specified as a reinforcement learning specification object defining properties

such as dimensions, data type, and names of the action signals.
For a discrete action space, you must specify actionInfo as an rlFiniteSetSpec object.
For a continuous action space, you must specify actionInfo as an rlNumericSpec object.
If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.
3-14
rlACAgent
You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.
AgentOptions — Agent options

rlACAgentOptions object
Agent options, specified as an rlACAgentOptions object.
UseExplorationPolicy — Option to use exploration policy

Option to use exploration policy when selecting actions, specified as a one of the following logical
values.
• true — Use the base agent exploration policy when selecting actions in sim and
generatePolicyFunction. In this case, the agent selects its actions by sampling its probability
distribution, the policy is therefore stochastic and the agent explores its observation space.
• false — Use the base agent greedy policy (the action with maximum likelihood) when selecting
actions in sim and generatePolicyFunction. In this case, the simulated agent and generated
policy behave deterministically.
Note This option affects only simulation and deployment; it does not affect training.
SampleTime — Sample time of agent

positive scalar | -1
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
3-15
3 Objects
Examples
Create Discrete Actor-Critic Agent from Observation and Action Specifications
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to a swinging pole).
Obtain observation and action specifications
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
Create an actor-critic agent from the environment observation and action specifications.
agent = rlACAgent(obsInfo,actInfo);
To check your agent, use getAction to return the action from random observations.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

{[-2]}
You can now test and train the agent within the environment. You can also use getActor and
getCritic to extract the actor and critic, respectively, and getModel to extract the approximator
model (by default a deep neural network) from the actor or critic.
Create Continuous Actor-Critic Agent Using Initialization Options
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
representing a torque ranging continuously from -2 to 2 Nm.

env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");
% obtain observation and action specifications

3-16
rlACAgent
Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256). Actor-critic agents do not
support recurrent networks, so setting the UseRNN option to true generates an error when the agent
is created.
initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);
The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator.
rng(0)
Create an actor-critic agent from the environment observation and action specifications.
agent = rlACAgent(obsInfo,actInfo,initOpts);
Extract the deep neural networks from both the agent actor and critic.
actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));
Display the layers of the critic network, and verify that each hidden fully connected layer has 128
neurons
criticNet.Layers
ans =
11x1 Layer array with layers:
1 'concat' Concatenation Concatenation of 2 inputs along dimension 1

2 'relu_body' ReLU ReLU
3 'fc_body' Fully Connected 128 fully connected layer
4 'body_output' ReLU ReLU
5 'input_1' Image Input 50x50x1 images
6 'conv_1' 2-D Convolution 64 3x3x1 convolutions with stride [1 1] and padding
7 'relu_input_1' ReLU ReLU
8 'fc_1' Fully Connected 128 fully connected layer
9 'input_2' Feature Input 1 features
11 'output' Fully Connected 1 fully connected layer
Plot actor and critic networks
3-17
3 Objects
plot(layerGraph(criticNet))
3-18
rlACAgent
To check your agent, use getAction to return the action from a random observation.

{[0.9228]}
You can now test and train the agent within the environment.
Create Discrete Actor-Critic Agent from Actor and Critic
Create an environment with a discrete action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DQN Agent to
Balance Cart-Pole System”. This environment has a four-dimensional observation vector (cart position
and velocity, pole angle, and pole angle derivative), and a scalar action with two possible elements (a
force of either -10 or +10 N applied on the cart).

obsInfo =
3-19
3 Objects
LowerLimit: -Inf
UpperLimit: Inf
Name: "CartPole States"
Description: "x, dx, theta, dtheta"
Dimension: [4 1]
DataType: "double"
actInfo =
Elements: [-10 10]

Name: "CartPole Action"
Dimension: [1 1]
DataType: "double"
rng(0)
For actor-critic agents, the critic estimates a value function, therefore it must take the observation
signal as input and return a scalar value.
To approximate the value function within the critic, use a deep neural network. Define the network as
an array of layer objects. Get the dimensions of the observation space from the environment
cnet = [
featureInputLayer(prod(obsInfo.Dimension))
reluLayer
Convert the network to a dlnetwork object, and display the number of weights.
cnet = dlnetwork(cnet);
summary(cnet)
Initialized: true
Inputs:
Create the critic. Actor-critic agents use an rlValueFunction object to implement the critic.
critic = rlValueFunction(cnet,obsInfo);
Check your critic with a random observation input.
getValue(critic,{rand(obsInfo.Dimension)})
3-20
rlACAgent
ans = single
-0.1411
Create a deep neural network to be used as approximation model within the actor. For actor-critic
agents, the actor executes a stochastic policy, which for discrete action spaces is implemented by a
discrete categorical actor. In this case the network must take the observation signal as input and
return a probability for each action. Therefore the output layer must have as many elements as the
number of possible actions.
anet = [
reluLayer
fullyConnectedLayer(numel(actInfo.Dimension))];
Convert the network to a dlnetwork object, and display the number of weights.
anet = dlnetwork(anet);
summary(anet)
Initialized: true
Inputs:
Create the actor. Actor-critic agents use an rlDiscreteCategoricalActor object to implement

the actor for discrete action spaces.
actor = rlDiscreteCategoricalActor(anet,obsInfo,actInfo);
Check your actor with a random observation input.

getAction(actor,{rand(obsInfo.Dimension)})

{[-10]}
Create the AC agent using the actor and the critic.

agent = rlACAgent(actor,critic)
agent =

SampleTime: 1
Specify some options for the agent, including training options for the actor and critic.
agent.AgentOptions.NumStepsToLookAhead=32;
agent.AgentOptions.DiscountFactor=0.99;
agent.AgentOptions.CriticOptimizerOptions.LearnRate=8e-3;
3-21
3 Objects
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold=1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate=8e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold=1;
Check your agent with a random observation.
getAction(agent,{rand(obsInfo.Dimension)})

{[-10]}
Create Continuous Actor-Critic Agent from Actor and Critic
Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the double integrator continuous action space environment used
in the example “Train DDPG Agent to Control Double Integrator System”.
obsInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"
actInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Dimension: [1 1]
DataType: "double"
In this example, the action is a scalar value representing a force ranging from -2 to 2 Newton. To
make sure that the output from the agent is in this range, you perform an appropriate scaling
operation. Store these limits so you can easily access them later.
3-22
rlACAgent
% Make sure action space upper and lower limits are finite
actInfo.LowerLimit=-2;
actInfo.UpperLimit=2;
The actor and critic networks are initialized randomly. You can ensure reproducibility by fixing the
seed of the random generator.
rng(0)
signal as input and return a scalar value. To approximate the value function within the critic, use a
deep neural network.
Define the network as an array of layer objects, and get the dimensions of the observation space from
the environment specification object.
cNet = [
reluLayer
cNet = dlnetwork(cNet);
summary(cNet)
Initialized: true
Inputs:
Create the critic using cNet. Actor-critic agents use an rlValueFunction object to implement the
critic.
critic = rlValueFunction(cNet,obsInfo);
Check your critic with a random input observation.
ans = single
-0.0969
To approximate the policy within the actor, use a deep neural network. For actor-critic agents, the
actor executes a stochastic policy, which for continuous action spaces is implemented by a continuous
Gaussian actor. In this case the network must take the observation signal as input and return both a
mean value and a standard deviation value for each action. Therefore it must have two output layers
(one for the mean values the other for the standard deviation values), each having as many elements
as the dimension of the action space.
Note that standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.
3-23
3 Objects
action spaces from the environment specification objects, and specify a name for the input and output
layers, so you can later explicitly associate them with the appropriate channel.
% Input path
inPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="netObsIn")
fullyConnectedLayer(prod(actInfo.Dimension),Name="infc")
];
% Mean value path

meanPath = [
tanhLayer(Name="tanhMean");
reluLayer
scalingLayer( ...
Name="netMout", ...
Scale=actInfo.UpperLimit) % scale to range
];
% Standard deviation path

sdevPath = [
tanhLayer(Name="tanhStdv");
reluLayer
softplusLayer(Name="netSDout") % nonnegative
];

aNet = layerGraph;
aNet = addLayers(aNet,inPath);
aNet = addLayers(aNet,meanPath);
aNet = addLayers(aNet,sdevPath);
% Connect layers
aNet = connectLayers(aNet,"infc","tanhMean/in");
aNet = connectLayers(aNet,"infc","tanhStdv/in");
% Plot network
plot(aNet)
3-24
rlACAgent
(weights).
aNet = dlnetwork(aNet);
summary(aNet)
Initialized: true
Inputs:
1 'netObsIn' 2 features
Create the actor. Actor-critic agents use an rlContinuousGaussianActor object to implement the
actor for continuous action spaces.
actor = rlContinuousGaussianActor(aNet, obsInfo, actInfo, ...

ActionMeanOutputNames="netMout",...
ActionStandardDeviationOutputNames="netSDout",...
ObservationInputNames="netObsIn");
Check your actor with a random input observation.

{[-1.2332]}
3-25
3 Objects
Create the AC agent using the actor and the critic.
agent = rlACAgent(actor,critic);
Specify agent options, including training options for its actor and critic.
agent.AgentOptions.NumStepsToLookAhead = 32;
Check your agent using a random input observation.

{[-1.5401]}
Create a Discrete Actor-Critic Agent with Recurrent Neural Networks
For this example load the predefined environment used for the “Train DQN Agent to Balance Cart-
Pole System” example. This environment has a four-dimensional observation vector (cart position and
velocity, pole angle, and pole angle derivative), and a scalar action with two possible elements (a
force of either -10 or +10 N applied on the cart).
Get observation and action information.
rng(0)
signal as input and return a scalar value.
To approximate the value function within the critic, use a recurrent deep neural network. Define the
network as an array of layer objects, and get the dimensions of the observation space from the
environment specification object. To create a recurrent network, use a sequenceInputLayer as the
input layer and include an lstmLayer as one of the other network layers.
cNet = [
lstmLayer(10)
3-26
rlACAgent
reluLayer
(weights).
summary(cNet)
Initialized: true
Inputs:
Create the critic using cNet. Actor-critic agents use an rlValueFunction object to implement the
critic.
critic = rlValueFunction(cNet,obsInfo);
Check the critic with a random input observation.
ans = single
-0.0344
Since the critic has a recurrent network, the actor must also use a recurrent network too. For actor-
critic agents, the actor executes a stochastic policy, which for discrete action spaces is implemented
by a discrete categorical actor. In this case the network must take the observation signal as input and
aNet = [
lstmLayer(20)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))];
summary(aNet)
Initialized: true
Number of learnables: 2k
Inputs:
Create the actor using aNet. Actor-critic agents use an rlDiscreteCategoricalActor object to
implement the actor for discrete action spaces.
actor = rlDiscreteCategoricalActor(aNet,obsInfo,actInfo);
Check the actor with a random input observation.
3-27
3 Objects

{[10]}
Set some training options for the critic.
criticOpts = rlOptimizerOptions( ...

LearnRate=8e-3,GradientThreshold=1);
Set some training options for the actor.
actorOpts = rlOptimizerOptions( ...

LearnRate=8e-3,GradientThreshold=1);
Specify agent options, and create an AC agent using the actor, the critic, and the agent options
object. Since the agent uses recurrent neural networks, NumStepsToLookAhead is treated as the
training trajectory length.
agentOpts = rlACAgentOptions( ...

NumStepsToLookAhead=32, ...
DiscountFactor=0.99, ...
CriticOptimizerOptions=criticOpts, ...
ActorOptimizerOptions=actorOpts);
agent = rlACAgent(actor,critic,agentOpts);
To check your agent, return the action from a random observation.

{[10]}
Configure Options for A3C Training
To train an agent using the asynchronous advantage actor-critic (A3C) method, you must set the
agent and parallel training options appropriately.
When creating the AC agent, set the NumStepsToLookAhead value to be greater than 1. Common
values are 64 and 128.
agentOpts = rlACAgentOptions(NumStepsToLookAhead=64);
Use agentOpts when creating your agent. Alternatively, create your agent first and then modify its
options, including the actor and critic options later using dot notation.
Configure the training algorithm to use asynchronous parallel training.
trainOpts = rlTrainingOptions(UseParallel=true);
trainOpts.ParallelizationOptions.Mode = "async";
3-28
rlACAgent
Configure the workers to return gradient data to the host. Also, set the number of steps before the
workers send data back to the host to match the number of steps to look ahead.
trainOpts.ParallelizationOptions.DataToSendFromWorkers = ...
"gradients";
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = ...
agentOpts.NumStepsToLookAhead;
Use trainOpts when training your agent.
For an example on asynchronous advantage actor-critic agent training, see “Train AC Agent to
Balance Cart-Pole System Using Parallel Computing”.
Tips
• For continuous action spaces, the rlACAgent object does not enforce the constraints set by the
action specification, so you must enforce action space constraints within the environment.
Version History
See Also
rlAgentInitializationOptions | rlACAgentOptions | rlValueFunction |
rlDiscreteCategoricalActor | rlContinuousGaussianActor | Deep Network Designer
Topics
“Actor-Critic Agents”
3-29
3 Objects
rlACAgentOptions
Options for AC agent
Description
Use an rlACAgentOptions object to specify options for creating actor-critic (AC) agents. To create
an actor-critic agent, use rlACAgent
For more information see “Actor-Critic Agents”.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
opt = rlACAgentOptions
opt = rlACAgentOptions(Name,Value)
Description
opt = rlACAgentOptions creates a default option set for an AC agent. You can modify the object
properties using dot notation.
opt = rlACAgentOptions(Name,Value) sets option properties on page 3-30 using name-value

pairs. For example, rlDQNAgentOptions('DiscountFactor',0.95) creates an option set with a
discount factor of 0.95. You can specify multiple name-value pairs. Enclose each property name in
quotes.
Properties
NumStepsToLookAhead — Number of steps ahead
Number of steps the agent interacts with the environment before learning from its experience,
specified as a positive integer. When the agent uses a recurrent neural network,
NumStepsToLookAhead is treated as the training trajectory length.
EntropyLossWeight — Entropy loss weight

0 (default) | scalar value between 0 and 1
Entropy loss weight, specified as a scalar value between 0 and 1. A higher entropy loss weight value
promotes agent exploration by applying a penalty for being too certain about which action to take.
Doing so can help the agent move out of local optima.
When gradients are computed during training, an additional gradient component is computed for
minimizing this loss function.
3-30
rlACAgentOptions
ActorOptimizerOptions — Actor optimizer options

rlOptimizerOptions object
Actor optimizer options, specified as an rlOptimizerOptions object. It allows you to specify

training parameters of the actor approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.
CriticOptimizerOptions — Critic optimizer options

Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify

training parameters of the critic approximator such as learning rate, gradient threshold, as well as
rlOptimizer.

1 (default) | positive scalar | -1
event-based simulations.

0.99 (default) | positive scalar less than or equal to 1
Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.
Object Functions
rlACAgent Actor-critic reinforcement learning agent
Examples
Create AC Agent Options Object
Create an AC agent options object, specifying the discount factor.
opt = rlACAgentOptions('DiscountFactor',0.95)
opt =
rlACAgentOptions with properties:
NumStepsToLookAhead: 32
EntropyLossWeight: 0
3-31
3 Objects
ActorOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]

CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
SampleTime: 1
DiscountFactor: 0.9500
InfoToSave: [1x1 struct]
You can modify options using dot notation. For example, set the agent sample time to 0.5.
opt.SampleTime = 0.5;
Version History
Simulation and deployment: UseDeterministicExploitation will be removed

Warns starting in R2022a
The property UseDeterministicExploitation of the rlACAgentOptions object will be removed

in a future release. Use the UseExplorationPolicy property of rlACAgent instead.
Previously, you set UseDeterministicExploitation as follows.
• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.AgentOptions.UseDeterministicExploitation = true;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.AgentOptions.UseDeterministicExploitation = false;
Starting in R2022a, set UseExplorationPolicy as follows.
agent.UseExplorationPolicy = false;
agent.UseExplorationPolicy = true;
Similarly to UseDeterministicExploitation, UseExplorationPolicy affects only simulation

and deployment; it does not affect training.
Default value for NumStepsToLookAhead changed to 32

A value of 32 for this property should work better than 1 for most environments. If you nave MATLAB
R2020b or a later version and you want to reproduce how rlACAgent behaved on versions prior to
R2020b, set this value to 1.
3-32
rlACAgentOptions
See Also
Topics
“Actor-Critic Agents”
3-33
3 Objects
rlAdditiveNoisePolicy
Policy object to generate continuous noisy actions for custom training loops
Description
This object implements an additive noise policy, which returns continuous deterministic actions with
added noise, given an input observation. You can create an rlAdditiveNoisePolicy object from an
rlContinuousDeterministicActor or extract it from an rlDDPGAgent or rlTD3Agent. You can
then train the policy object using a custom training loop. If UseNoisyAction is set to 0 the policy
does not explore. This object is not compatible with generatePolicyBlock and
generatePolicyFunction. For more information on policies and value functions, see “Create
Policies and Value Functions”.
Creation
Syntax
policy = rlAdditiveNoisePolicy(actor,NoiseType=noiseType)
Description
policy = rlAdditiveNoisePolicy(actor) creates the additive noise policy object policy from
the continuous deterministic actor actor. It also sets the Actor property of policy to the input
argument actor.
policy = rlAdditiveNoisePolicy(actor,NoiseType=noiseType) specifies the type of noise

distribution for the policy. noiseType can be either "gaussian" (Gaussian noise) or "ou"
(Ornstein-Uhlenbeck noise). This syntax also sets the NoiseType property of policy to the input
argument noiseType.
Properties
Actor — Continuous deterministic actor
rlContinuousDeterministicActor object
Continuous deterministic actor, specified as an rlContinuousDeterministicActor object.
NoiseType — Noise type

"gaussian" (default) | "ou"
Noise type, specified as either "gaussian" (default, Gaussian noise) or "ou" (Ornstein-Uhlenbeck
noise). For more information on noise models, see “Noise Models” on page 3-365.
Example: "ou"
NoiseOptions — Noise model options

GaussianActionNoise object (default) | OrnsteinUhlenbeckActionNoise object
3-34
Noise model options, specified as a GaussianActionNoise object or an

OrnsteinUhlenbeckActionNoise object. For more information on noise models, see “Noise
Models” on page 3-365.
EnableNoiseDecay — Option to enable noise decay

Option to enable noise decay, specified as a logical value: either true (default, enabling noise decay)
or false (disabling noise decay).
Example: false
UseNoisyAction — Option to enable noisy action

Option to enable noisy actions, specified as a logical value: either true (default, adding noise to
actions, which helps exploration) or false (no noise is added to the actions). When noise is not
added to the actions the policy is deterministic and therefore it does not explore.
Example: false

rlFiniteSetSpec object | rlNumericSpec object | array
Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array

of such objects. These objects define properties such as the dimensions, data types, and names of the
observation signals.
ActionInfo — Action specifications

rlNumericSpec object
Action specifications, specified as an rlNumericSpec object. This object defines the properties of the
environment action channel, such as its dimensions, data type, and name. Note that the name of the
action channel specified in actionInfo (if any) is not used.
Note Only one action channel is allowed.
SampleTime — Sample time of policy

positive scalar | -1 (default)
Sample time of the policy, specified as a positive scalar or as -1 (default). Setting this parameter to
-1 allows for event-based simulations.
Within a Simulink environment, the RL Agent block in which the policy is specified executes every
Within a MATLAB environment, the policy is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience. If
SampleTime is -1, the sample time is treated as being equal to 1.
Example: 0.2
3-35
3 Objects
Object Functions
observations
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
reset Reset environment, agent, experience buffer, or policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
Examples
Create Additive Noise Policy from Continuous Deterministic Actor

Alternatively, use getObservationInfo and getActionInfo to extract the specification objects

action as output.
To approximate the policy function within the actor, use a deep neural network model. Define the
network as an array of layer objects, and get the dimension of the observation and action spaces from
the environment specification objects.
layers = [
reluLayer
];
summary(model)
Initialized: true
Inputs:
actor =
3-36

UseDevice: "cpu"
act{1}
0.4013
0.0578
policy =

NoiseType: "gaussian"
NoiseOptions: [1x1 rl.option.GaussianActionNoise]
EnableNoiseDecay: 1
UseNoisyAction: 1
SampleTime: -1
You can access the policy options using dot notation. For example, change the upper and lower limits
of the distribution.
policy.NoiseOptions.LowerLimit = -3;
policy.NoiseOptions.UpperLimit = 3;
Check the policy with a random observation input.
act{1}
ans = 2×1
0.1878
-0.1645
You can now train the policy with a custom training loop and then deploy it to your application.
Create Additive Noise Policy Specifying Noise Model
action spaces as continuous three- and one-dimensional spaces, respectively.
3-37
3 Objects


from an environment
action as output.
layers = [
reluLayer
];
summary(model)
Initialized: true
Inputs:
actor =

UseDevice: "cpu"

act{1}
ans = single
-0.2535
Create a policy object from actor, specifying an Ornstein-Uhlenbeck probability distribution for the
noise.
policy = rlAdditiveNoisePolicy(actor,NoiseType="ou")
policy =
3-38

NoiseType: "ou"
NoiseOptions: [1x1 rl.option.OrnsteinUhlenbeckActionNoise]
EnableNoiseDecay: 1
UseNoisyAction: 1
SampleTime: -1
You can access the policy options using dot notation. For example, change the standard deviation of
the distribution.
policy.NoiseOptions.StandardDeviation = 0.6;
act{1}
ans = -0.1625
Version History
See Also
Functions
rlStochasticActorPolicy | rlTD3Agent | rlDDPGAgent
Blocks
RL Agent
Topics
“Model-Based Reinforcement Learning Using Custom Training Loop”
3-39
3 Objects
rlAgentInitializationOptions
Options for initializing reinforcement learning agents
Description
Use the rlAgentInitializationOptions object to specify initialization options for an agent. To
create an agent, use the specific agent creation function, such as rlACAgent.
Creation
Syntax
initOpts = rlAgentInitializationOptions
initOpts = rlAgentInitializationOptions(Name,Value)
Description
initOpts = rlAgentInitializationOptions returns a default options object for initializing a

reinforcement learning agent that supports default networks. Use the initialization options to specify
agent initialization parameters, such as the number of units for each hidden layer of the agent
networks and whether to use a recurrent neural network.
initOpts = rlAgentInitializationOptions(Name,Value) creates an initialization options

object and sets its properties on page 3-40 by using one or more name-value pair arguments.
Properties
NumHiddenUnit — Number of units in each hidden fully connected layer
Number of units in each hidden fully connected layer of the agent networks, except for the fully
connected layer just before the network output, specified as a positive integer. The value you set also
applies to any LSTM layers.
Example: 'NumHiddenUnit',64
UseRNN — Flag to use recurrent neural network

Flag to use recurrent neural network, specified as a logical.
If you set UseRNN to true, during agent creation the software inserts a recurrent LSTM layer with
the output mode set to sequence in the output path of the agent networks. Policy gradient and actor-
critic agents do not support recurrent neural networks. For more information on LSTM, see “Long
Short-Term Memory Networks”.
Example: 'UseRNN',true
3-40
rlAgentInitializationOptions
Object Functions
rlPGAgent Policy gradient reinforcement learning agent
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
rlDQNAgent Deep Q-network (DQN) reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning agent
rlSACAgent Soft actor-critic reinforcement learning agent
rlTRPOAgent Trust region policy optimization reinforcement learning agent
Examples
Create Agent Initialization Options Object
Create an agent initialization options object, specifying the number of hidden neurons and use of a
recurrent neural network.
initOpts = rlAgentInitializationOptions('NumHiddenUnit',64,'UseRNN',true)
initOpts =
rlAgentInitializationOptions with properties:
NumHiddenUnit: 64
UseRNN: 1
You can modify the options using dot notation. For example, set the agent sample time to 0.5.
initOpts.NumHiddenUnit = 128
initOpts =
rlAgentInitializationOptions with properties:
NumHiddenUnit: 128
UseRNN: 1
Version History
See Also
getActionInfo | getObservationInfo
Topics
3-41
3 Objects
rlContinuousDeterministicActor
Deterministic actor with a continuous action space for reinforcement learning agents
Description
This object implements a function approximator to be used as a deterministic actor within a
reinforcement learning agent with a continuous action space. A continuous deterministic actor takes
an environment state as input and returns as output the action that maximizes the expected
discounted cumulative long-term reward, thereby implementing a deterministic policy. After you
create an rlContinuousDeterministicActor object, use it to create a suitable agent, such as
rlDDPGAgent. For more information on creating representations, see “Create Policies and Value
Functions”.
Creation
Syntax
actor = rlContinuousDeterministicActor(net,observationInfo,actionInfo)
actor = rlContinuousDeterministicActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames)
actor = rlContinuousDeterministicActor({basisFcn,W0},observationInfo,
actionInfo)
actor = rlContinuousDeterministicActor( ___ ,UseDevice=useDevice)
Description
actor = rlContinuousDeterministicActor(net,observationInfo,actionInfo) creates a

continuous deterministic actor object using the deep neural network net as underlying approximator.
For this actor, actionInfo must specify a continuous action space. The network input layers are
automatically associated with the environment observation channels according to the dimension
specifications in observationInfo. The network must have a single output layer with the same data
type and dimensions as the action specified in actionInfo. This function sets the
ObservationInfo and ActionInfo properties of actor to the observationInfo and
actionInfo input arguments, respectively.
Note actor does not enforce constraints set by the action specification; therefore, when using this
actor, you must enforce action space constraints within the environment.
actor = rlContinuousDeterministicActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames) specifies the names of the network input
layers to be associated with the environment observation channels. The function assigns, in
sequential order, each environment observation channel specified in observationInfo to the layer
specified by the corresponding name in the string array netObsNames. Therefore, the network input
3-42
layers, ordered as the names in netObsNames, must have the same data type and dimensions as the
observation specifications, as ordered in observationInfo.
actor = rlContinuousDeterministicActor({basisFcn,W0},observationInfo,
actionInfo) creates a continuous deterministic actor object using a custom basis function as
underlying approximator. The first input argument is a two-element cell array whose first element is
the handle basisFcn to a custom basis function and whose second element is the initial weight
vector W0. This function sets the ObservationInfo and ActionInfo properties of actor to the
observationInfo and actionInfo input arguments, respectively.
actor = rlContinuousDeterministicActor( ___ ,UseDevice=useDevice) specifies the

device used to perform computational operations on the actor object, and sets the UseDevice
property of actor to the useDevice input argument. You can use this syntax with any of the
previous input-argument combinations.
Input Arguments
net — Deep neural network

array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object |
dlNetwork object (preferred)
Deep neural network used as the underlying approximator within the actor, specified as one of the
following:

Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
Deep Learning Toolbox neural network object. The resulting dlnet is the dlnetwork object that you
use for your critic or actor. This practice allows a greater level of insight and control for cases in
which the conversion is not straightforward and might require additional specifications.
The network must have the environment observation channels as inputs and a single output layer
representing the action.
rlContinuousDeterministicActor objects support recurrent deep neural networks. For an

example, see “Create Deterministic Actor from Recurrent Neural Network” on page 3-50.
The learnable parameters of the actor are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.
netObsNames — Network input layers names corresponding to the environment observation

channels
string array | cell array of character vectors
3-43
3 Objects
Network input layers names corresponding to the environment observation channels. When you use
the pair value arguments 'ObservationInputNames' with netObsNames, the function assigns, in
sequential order, each environment observation channel specified in observationInfo to each
network input layer specified by the corresponding name in the string array netObsNames.
Therefore, the network input layers, ordered as the names in netObsNames, must have the same
data type and dimensions as the observation specifications, as ordered in observationInfo.
Note Of the information specified in observationInfo, the function only uses the data type and
dimension of each channel, but not its (optional) name or description.
Example: {"NetInput1_airspeed","NetInput2_altitude"}

function handle
Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The action
to be taken based on the current observation, which is the output of the actor, is the vector a =
W'*B, where W is a weight matrix containing the learnable parameters and B is the column vector
returned by the custom basis function.
Your basis function must have the following signature.
Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the
environment observation channels defined in observationInfo.
Example: @(obs1,obs2,obs3) [obs3(2)*obs1(1)^2; abs(obs2(5)+obs3(1))]
W0 — Initial value of basis function weights

matrix
Initial value of the basis function weights W, specified as a matrix having as many rows as the length
of the vector returned by the basis function and as many columns as the dimension of the action
space.
Properties

rlContinuousDeterministicActor sets the ObservationInfo property of actor to the input

observationInfo.
You can extract ObservationInfo from an existing environment or agent using

getObservationInfo. You can also construct the specifications manually.
3-44

Action specifications for a continuous action space, specified as an rlNumericSpec object defining
properties such as dimensions, data type and name of the action signals.
rlContinuousDeterministicActor sets the ActionInfo property of actor to the input

observationInfo.
You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually.
For custom basis function representations, the action signal must be a scalar, a column vector, or a
discrete action.
UseDevice — Computation device used for training and simulation

"cpu" (default) | "gpu"
Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox™ software and a CUDA® enabled
NVIDIA® GPU. For more information on supported GPUs see “GPU Computing Requirements”
(Parallel Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.
To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: "gpu"
Object Functions
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning
agent
observations
evaluate Evaluate function approximator object given observation (or observation-
action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data
accelerate Option to accelerate computation of gradient for approximator object
based on neural network
policy object
3-45
3 Objects

object
setModel Set function approximation model for actor or critic
getModel Get function approximator model from actor or critic
Examples
Create Continuous Deterministic Actor from Deep Neural Network
doubles.
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.
To approximate the policy within the actor, use a deep neural network. The input of the network must
accept a four-element vector (the observation vector just defined by obsInfo), and its output must
be the action and be a two-element vector, as defined by actInfo.
Create a neural network as an array of layer objects.
Convert the network to a dlnetwork object and display the number of learnable parameters.
summary(net)
Initialized: true
Inputs:
Create the actor object with rlContinuousDeterministicActor, using the network and the
observation and action specification objects as input arguments. The network input layer is
automatically associated with the environment observation channel according to the dimension
specifications in obsInfo.
actor = rlContinuousDeterministicActor( ...

net, ...
obsInfo, ...
actInfo)
actor =
3-46

UseDevice: "cpu"
To check your actor, use getAction to return the action from a random observation, using the
current network weights.
act = getAction(actor, ...
{rand(obsInfo.Dimension)});
act{1}
-0.5054
1.5390
You can now use the actor to create a suitable agent (such as rlDDPGAgent or
rlTD3AgentOptions).
Create Continuous Deterministic Actor Specifying Network Input Layer Name
doubles.
To approximate the policy within the actor, use a deep neural network. The input of the network (here
called myobs) must accept a four-element vector (the observation vector just defined by obsInfo),
and its output must be the action (here called myact) and be a two-element vector, as defined by
actInfo.
Create the network as an array of layer objects. Name the network input layer netObsIn so you can
later explicitly associate it to the observation input channel.
net = [
featureInputLayer(4,Name="netObsIn")
reluLayer
Convert the network to a dlnetwork object, and display the number of learnable parameters.
summary(net)
3-47
3 Objects
Initialized: true
Inputs:
Create the actor object with rlContinuousDeterministicActor, using the network, the
observation and action specification objects, and the name of the network input layer to be associated
with the environment observation channel.
actor = rlContinuousDeterministicActor(net, ...

obsInfo,actInfo, ...
Observation="netObsIn")
actor =

UseDevice: "cpu"
act{1}
0.4013
0.0578
rlTD3AgentOptions).
Create Continuous Deterministic Actor from Custom Basis Function
specification object from an environment). For this example, define the observation space as
consisting of two channels, the first is a two-by-two continuous matrix and the second is a scalar that
can assume only two values, 0 and 1.

Create a continuous action space specification object (or alternatively use getActionInfo to extract
the specification object from an environment). For this example, define the action space as a
continuous three-dimensional space, so that a single action is a column vector containing three
doubles.
3-48
Create a custom basis function with two input arguments in which each output element is a function
of the observations defined by obsInfo.
myBasisFcn = @(obsA,obsB) [obsA(1,1)+obsB(1)^2;

obsA(2,1)-obsB(1)^2;
obsA(1,2)^2+obsB(1);
obsA(2,2)^2-obsB(1)];
The output of the actor is the vector W'*myBasisFcn(obsA,obsB), which is the action taken as a
result of the given observation. The weight matrix W contains the learnable parameters and must have
as many rows as the length of the basis function output and as many columns as the dimension of the
action space.
Define an initial parameter matrix.
W0 = rand(4,3);
Create the actor. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight matrix. The second and third arguments are, respectively, the
observation and action specification objects.
actor = rlContinuousDeterministicActor({myBasisFcn,W0},obsInfo,actInfo)
actor =
ObservationInfo: [2x1 rl.util.RLDataSpec]

UseDevice: "cpu"
To check your actor, use the getAction function to return the action from a given observation, using
the current parameter matrix.
a = getAction(actor,{rand(2,2),0})
a = 1x1 cell array

{3x1 double}
a{1}
ans = 3×1
1.3192
0.8420
1.5053
Note that the actor does not enforce the set constraint for the discrete set elements.
a = getAction(actor,{rand(2,2),-1});
a{1}
ans = 3×1
2.7890
1.8375
3-49
3 Objects
3.0855
rlTD3AgentOptions).
Create Deterministic Actor from Recurrent Neural Network
environment. For this example, define the observation space as a continuous four-dimensional space,
so that a single observation is a column vector containing four doubles, and the action space as a
continuous two-dimensional space, so that a single action is a column vector containing two doubles.

To approximate the policy within the actor, use a recurrent deep neural network. You can obtain the
dimension of the observation and action spaces from the environment specification objects.
Create a neural network as an array of layer objects. Since this network is recurrent, use a
sequenceInputLayer as the input layer and at least one lstmLayer.
net = [sequenceInputLayer(obsInfo.Dimension(1))
reluLayer
tanhLayer];
summary(net)
Initialized: true
Inputs:
actor = rlContinuousDeterministicActor( ...

net, ...
obsInfo, ...
actInfo);
To check your actor, use getAction to return the action from a random observation, given the
a = getAction(actor, ...
{rand(obsInfo.Dimension)});
a{1}
3-50
-0.0742
0.0158
rlTD3AgentOptions).
Version History
See Also
Functions
rlDiscreteCategoricalActor | rlContinuousGaussianActor | getActionInfo |
getObservationInfo
Topics
3-51
3 Objects
rlContinuousDeterministicRewardFunction
Deterministic reward function approximator object for neural network-based environment
Description
When creating a neural network-based environment using rlNeuralNetworkEnvironment, you can
specify the reward function approximator using an
rlContinuousDeterministicRewardFunction object. Do so when you do not know a ground-
truth reward signal for your environment but you expect the reward signal to be deterministic.
The reward function approximator object uses a deep neural network as internal approximation
model to predict the reward signal for the environment given one of the following input combinations.
• Observations, actions, and next observations

• Observations and actions
• Actions and next observations
• Next observations
To specify a stochastic reward function, use an rlContinuousGaussianRewardFunction object.
Creation
Syntax
rwdFcnAppx = rlContinuousDeterministicRewardFunction(net,observationInfo,
actionInfo,Name=Value)
Description
rwdFcnAppx = rlContinuousDeterministicRewardFunction(net,observationInfo,
actionInfo,Name=Value) creates the deterministic reward function approximator object
rwdFcnAppx using the deep neural network net and sets the ObservationInfo and ActionInfo
properties.
When creating a reward function you must specify the names of the deep neural network inputs using
one of the following combinations of name-value pair arguments.
• ObservationInputNames, ActionInputNames, and NextObservationInputNames

• ObservationInputNames and ActionInputNames
• ActionInputNames and NextObservationInputNames
• NextObservationInputNames
You can also specify the UseDevice property using and an optional name-value pair argument. For
example, to use a GPU for prediction, specify UseDevice="gpu".
3-52
Input Arguments

dlnetwork object
Deep neural network with a scalar output value, specified as a dlnetwork object.
The input layer names for this network must match the input names specified using the
ObservationInputNames, ActionInputNames, and NextObservationInputNames. The
dimensions of the input layers must match the dimensions of the corresponding observation and
action specifications in ObservationInfo and ActionInfo, respectively.

Example: ObservationInputNames="velocity"
ObservationInputNames — Observation input layer names

Observation input layer names, specified as a string or string array. Specify

ObservationInputNames when you expect the reward signal to depend on the current environment
observation.
The number of observation input names must match the length of ObservationInfo and the order
of the names must match the order of the specifications in ObservationInfo.
ActionInputNames — Action input layer names

Action input layer names, specified as a string or string array. Specify ActionInputNames when you
expect the reward signal to depend on the current action value.
The number of action input names must match the length of ActionInfo and the order of the names
must match the order of the specifications in ActionInfo.
NextObservationInputNames — Next observation input layer names

Next observation input layer names, specified as a string or string array. Specify
NextObservationInputNames when you expect the reward signal to depend on the next
environment observation.
The number of next observation input names must match the length of ObservationInfo and the
order of the names must match the order of the specifications in ObservationInfo.
Properties
3-53
3 Objects

specification objects defining properties such as dimensions, data types, and names of the
You can extract the observation specifications from an existing environment or agent using
or rlNumericSpec.

Action specifications, specified as a reinforcement learning specification object or an array of

specification objects defining properties such as dimensions, data types, and names of the action
signals.
You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.

Computation device used to perform operations such as gradient computation, parameter updates,
and prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA-enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
used with MATLAB.
Training or simulating a network on a GPU involves device-specific numerical round-off errors. These
errors can produce different results compared to performing the same operations using a CPU.
Object Functions
rlNeuralNetworkEnvironment Environment model with deep neural network transition models
Examples
Create Deterministic Reward Function and Predict Reward
3-54
To approximate the reward function, create a deep neural network. For this example, the network has
two input channels, one for the current action and one for the next observations. The single output
channel contains a scalar, which represents the value of the predicted reward.
action spaces from the environment specifications, and specify a name for the input layers, so you can
later explicitly associate them with the appropriate environment channel.
actionPath = featureInputLayer( ...

actInfo.Dimension(1), ...
Name="action");

Name="nextState");
fullyConnectedLayer(1,Name="reward")];
net = layerGraph(nextStatePath);
net = addLayers(net,actionPath);
net = connectLayers(net,"nextState","concat/in1");
net = connectLayers(net,"action","concat/in2");
plot(net)
3-55
3 Objects
Create a dlnetwork object and display the number of weights.

summary(net);
Initialized: true
Inputs:
2 'action' 1 features

rwdFcnAppx = rlContinuousDeterministicRewardFunction(...
and next observation. For example, predict the reward for a random action and next observation.
Since, for this example, only the action and the next observation influence the reward, use an empty
cell array for the current observation.
reward = predict(rwdFcnAppx,{}, {act}, {nxtobs})
3-56
reward = single
0.1034
To predict the reward, you can also use evaluate.
reward_ev = evaluate(rwdFcnAppx, {act,nxtobs} )
reward_ev = 1x1 cell array

{[0.1034]}
Version History
See Also
Objects
rlContinuousGaussianTransitionFunction | rlContinuousGaussianRewardFunction |
rlNeuralNetworkEnvironment | rlIsDoneFunction | evaluate | gradient | accelerate
Topics
3-57
3 Objects
rlContinuousDeterministicTransitionFunction
Deterministic transition function approximator object for neural network-based environment
Description
specify deterministic transition function approximators using
rlContinuousDeterministicTransitionFunction objects.
A transition function approximator object uses a deep neural network to predict the next observations
based on the current observations and actions.
To specify stochastic transition function approximators, use

rlContinuousGaussianTransitionFunction objects.
Creation
Syntax
tsnFcnAppx = rlContinuousDeterministicTransitionFunction(net,observationInfo,
Description
tsnFcnAppx = rlContinuousDeterministicTransitionFunction(net,observationInfo,
actionInfo,Name=Value) creates a deterministic transition function approximator object using
the deep neural network net and sets the ObservationInfo and ActionInfo properties.
When creating a deterministic transition function approximator you must specify the names of the
deep neural network inputs and outputs using the ObservationInputNames, ActionInputNames,
and NextObservationOutputNames name-value pair arguments.
You can also specify the PredictDiff and UseDevice properties using optional name-value pair
arguments. For example, to use a GPU for prediction, specify UseDevice="gpu".
Input Arguments

dlnetwork object
Deep neural network, specified as a dlnetwork object.
The input layer names for this network must match the input names specified using
ObservationInputNames and ActionInputNames. The dimensions of the input layers must match
the dimensions of the corresponding observation and action specifications in ObservationInfo and
ActionInfo, respectively.
3-58
The output layer names for this network must match the output names specified using
NextObservationOutputNames. The dimensions of the input layers must match the dimensions of
the corresponding observation specifications in ObservationInfo.


Observation input layer names, specified as a string or string array.

Action input layer names, specified as a string or string array.
NextObservationOutputNames — Next observation output layer names

Next observation output layer names, specified as a string or string array.
The number of next observation output names must match the length of ObservationInfo and the
Properties

or rlNumericSpec.

3-59
3 Objects

signals.
rlNumericSpec.
PredictDiff — Option to predict the difference between the current observation and the
next observation
Option to predict the difference between the current observation and the next observation, specified
as one of the following logical values.
• false — Select this option if net outputs the value of the next observation.
• true — Select this option if net outputs the difference between the next observation and the
current observation. In this case, the predict function computes the next observation by adding
the current observation to the output of net.

Computing Toolbox).
used with MATLAB.
Note Training or simulating a network on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations using a CPU.
Object Functions
Examples
Create Deterministic Transition Function and Predict Next Observation
Create a deep neural network. The network has two input channels, one for the current observations
and one for the current actions. The single output channel is for the predicted next observation.
3-60
statePath = featureInputLayer(obsInfo.Dimension(1),...
Normalization="none",Name="state");
actionPath = featureInputLayer(actInfo.Dimension(1),...
Normalization="none",Name="action");
fullyConnectedLayer(64, Name="FC3")
fullyConnectedLayer(obsInfo.Dimension(1),Name="nextObservation")];
tsnNet = layerGraph(statePath);
tsnNet = addLayers(tsnNet,actionPath);
tsnNet = addLayers(tsnNet,commonPath);
tsnNet = connectLayers(tsnNet,"state","concat/in1");
tsnNet = connectLayers(tsnNet,"action","concat/in2");
plot(tsnNet)
Create a dlnetwork object.

tsnNet = dlnetwork(tsnNet);

tsnFcnAppx = rlContinuousDeterministicTransitionFunction(...
tsnNet,obsInfo,actInfo,...
3-61
3 Objects
ObservationInputNames="state", ...
NextObservationOutputNames="nextObservation");
Using this transition function object, you can predict the next observation based on the current
observation and action. For example, predict the next observation for a random observation and
action.
obs = rand(obsInfo.Dimension);
nextObsP = predict(tsnFcnAppx,{obs},{act})
nextObsP = 1x1 cell array

{4x1 single}
nextObsP{1}
-0.1172
0.1168
0.0493
-0.0155
You can also obtain the same result using evaluate.
nextObsE = evaluate(tsnFcnAppx,{obs,act})
nextObsE = 1x1 cell array

{4x1 single}
nextObsE{1}
-0.1172
0.1168
0.0493
-0.0155
Version History
See Also
Objects
rlNeuralNetworkEnvironment | rlContinuousGaussianTransitionFunction |
rlIsDoneFunction | evaluate | gradient | accelerate
3-62
Topics
3-63
3 Objects
rlContinuousGaussianActor
Stochastic Gaussian actor with a continuous action space for reinforcement learning agents
Description
This object implements a function approximator to be used as a stochastic actor within a
reinforcement learning agent with a continuous action space. A continuous Gaussian actor takes an
environment state as input and returns as output a random action sampled from a Gaussian
probability distribution of the expected cumulative long term reward, thereby implementing a
stochastic policy. After you create an rlContinuousGaussianActor object, use it to create a
suitable agent, such as an rlACAgent or rlPGAgent agent. For more information on creating
representations, see “Create Policies and Value Functions”.
Creation
Syntax
actor = rlContinuousGaussianActor(net,observationInfo,
actionInfo,ActionMeanOutputNames=
netMeanActName,ActionStandardDeviationOutputNames=netStdvActName)
netMeanActName,ActionStandardDeviationOutputNames=
netStdActName,ObservationInputNames=netObsNames)
actor = rlContinuousGaussianActor( ___ ,UseDevice=useDevice)
Description
netMeanActName,ActionStandardDeviationOutputNames=netStdvActName) creates a
Gaussian stochastic actor with a continuous action space using the deep neural network net as
function approximator. Here, net must have two differently named output layers, each with as many
elements as the number of dimensions of the action space, as specified in actionInfo. The two
output layers calculate the mean and standard deviation of each component of the action. The actor
uses these layers, according to the names specified in the strings netMeanActName and
netStdActName, to represent the Gaussian probability distribution from which the action is
sampled. The function sets the ObservationInfo and ActionInfo properties of actor to the input
arguments observationInfo and actionInfo, respectively.
Note actor does not enforce constraints set by the action specification, therefore, when using this
3-64
netMeanActName,ActionStandardDeviationOutputNames=
netStdActName,ObservationInputNames=netObsNames) specifies the names of the network
input layers to be associated with the environment observation channels. The function assigns, in
actor = rlContinuousGaussianActor( ___ ,UseDevice=useDevice) specifies the device

used to perform computational operations on the actor object, and sets the UseDevice property of
actor to the useDevice input argument. You can use this syntax with any of the previous input-
argument combinations.
Input Arguments

Deep neural network used as the underlying approximator within the actor. The network must have
two differently named output layers each with as many elements as the number of dimensions of the
action space, as specified in actionInfo. The two output layers calculate the mean and standard
deviation of each component of the action. The actor uses these layers, according to the names
specified in the strings netMeanActName and netStdActName, to represent the Gaussian
probability distribution from which the action is sampled.
Note Standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore, the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, and the output layer that returns the mean values must be a scaling
You can specify the network as one of the following:

neural network object from the Deep Learning Toolbox. The resulting dlnet is the dlnetwork object
that you use for your critic or actor. This practice allows a greater level of insight and control for
cases in which the conversion is not straightforward and might require additional specifications.
rlContinuousGaussianActor objects support recurrent deep neural networks.
3-65
3 Objects
netMeanActName — Names of the network output layers corresponding to the mean values
of the action channel
Names of the network output layers corresponding to the mean values of the action channel,
specified as a string or character vector. The actor uses this name to select the network output layer
that returns the mean values of each elements of the action channel. Therefore, this network output
layer must be named as indicated in netMeanActName. Furthermore, it must be a scaling layer that
scales the returned mean values to the desired action range.
Note Of the information specified in actionInfo, the function uses only the data type and
Example: "myNetOut_Force_Mean_Values"
netStdvActName — Names of the network output layers corresponding to the standard

deviations of the action channel
Names of the network output layers corresponding to the standard deviations of the action channel,
specified as a string or character vector. The actor uses this name to select the network output layer
that returns the standard deviations of each elements of the action channel. Therefore, this network
output layer must be named as indicated in netStdvActName. Furthermore, it must be a softplus or
ReLU layer, to enforce nonnegativity of the returned standard deviations.
Note Of the information specified in actionInfo, the function uses only the data type and
Example: "myNetOut_Force_Standard_Deviations"

channels
Network input layers names corresponding to the environment observation channels, specified as a
string array or a cell array of character vectors. When you use this argument after
'ObservationInputNames', the function assigns, in sequential order, each environment
observation channel specified in observationInfo to each network input layer specified by the
corresponding name in the string array netObsNames. Therefore, the network input layers, ordered
as the names in netObsNames, must have the same data type and dimensions as the observation
specifications, as ordered in observationInfo.
Note Of the information specified in observationInfo, the function uses only the data type and
3-66
Properties

rlContinuousGaussianActor sets the ObservationInfo property of actor to the input

observationInfo.


environment action channel, such as its dimensions, data type, and name. Note that the function does
not use the name of the action channel specified in actionInfo.
rlContinuousGaussianActor sets the ActionInfo property of critic to the input actionInfo.
also construct the specifications manually.

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
Computing Toolbox).
used with MATLAB.
3-67
3 Objects
Example: 'UseDevice',"gpu"
Object Functions
observations
action) input data
action input data
policy object
object
Examples
Create Continuous Gaussian Actor from Deep Neural Network
continuous six-dimensional space, so that a single observation is a column vector containing five
doubles.
three-dimensional space, so that a single action is a column vector containing three doubles, each
between -10 and 10.
actInfo = rlNumericSpec([3 1], ...
LowerLimit=-10, ...
UpperLimit=10);
To approximate the policy within the actor, use a deep neural network.
For a continuous Gaussian actor, the network must take the observation signal as input and return
both a mean value and a standard deviation value for each action. Therefore it must have two output
layers (one for the mean values the other for the standard deviation values), each having as many
elements as the dimension of the action space. You can obtain the dimensions of the observation and
action spaces from the environment specification objects (for example regardless of whether the
observation space is a column vector, row vector, or matrix, prod(obsInfo.Dimension)).
3-68
Create each network path as an array of layer objects. Specify a name for the input and output layers,
so you can later explicitly associate them with the correct channels.
% Input path layers

inPath = [
prod(obsInfo.Dimension), ...
Name="netOin")
Name="infc")
];
% Path layers for mean value

meanPath = [
tanhLayer(Name="tanhMean");
scalingLayer(Name="scale", ...
Scale=actInfo.UpperLimit)
];
% Path layers for standard deviations

sdevPath = [
tanhLayer(Name="tanhStdv");
softplusLayer(Name="splus")
];

net = layerGraph(inPath);
% Connect layers
net = connectLayers(net,"infc","tanhMean/in");
net = connectLayers(net,"infc","tanhStdv/in");
% Plot the network

plot(net)
3-69
3 Objects
(weights).
summary(net)
Initialized: true
Inputs:
1 'netOin' 5 features
Create the actor with rlContinuousGaussianActor, using the network, the observation and action
specification objects, and the names of the network input and output layers.
actor = rlContinuousGaussianActor(net, obsInfo, actInfo, ...
ActionMeanOutputNames="scale",...
ActionStandardDeviationOutputNames="splus",...
ObservationInputNames="netOin");
To check your actor, use getAction to return an action from a random observation vector, using the
current network weights. Each of the three elements of the action vector is a random sample from the
Gaussian distribution with mean and standard deviation calculated, as a function of the current
observation, by the neural network.
act{1}
3-70
-12.0285
1.7628
10.8733
To return the Gaussian distribution of the action, given an observation, use evaluate.
dist = evaluate(actor,{rand(obsInfo.Dimension)});
Display the vector of mean values.
dist{1}
-5.6127
3.9449
9.6213
Display the vector of standard deviations.
dist{2}
0.8516
0.8366
0.7004
You can now use the actor to create a suitable agent (such as rlACAgent, rlPGAgent,
rlSACAgentOptions, rlPPOAgent, or rlTRPOAgentOptions).
Version History
See Also
Functions
rlContinuousDeterministicActor | rlDiscreteCategoricalActor | getActionInfo |
getObservationInfo
Topics
3-71
3 Objects
rlContinuousGaussianRewardFunction
Stochastic Gaussian reward function approximator object for neural network-based environment
Description
specify the reward function approximator using an
rlContinuousDeterministicRewardFunction object. Do so when you do not know a ground-
truth reward signal for your environment and you expect the reward signal to be stochastic.
The reward function object uses a deep neural network as internal approximation model to predict
the reward signal for the environment given one of the following input combinations.

To specify a deterministic reward function approximator, use an

rlContinuousDeterministicRewardFunction object.
Creation
Syntax
rwdFcnAppx = rlContinuousGaussianRewardFunction(net,observationInfo,
Description
rwdFcnAppx = rlContinuousGaussianRewardFunction(net,observationInfo,
actionInfo,Name=Value) creates a stochastic reward function using the deep neural network net
and sets the ObservationInfo and ActionInfo properties.
When creating a reward function you must specify the names of the deep neural network inputs using
one of the following combinations of name-value pair arguments.

You must also specify the names of the deep neural network outputs using the
RewardMeanOutputName and RewardStandardDeviationOutputName name-value pair
arguments.
3-72
You can also specify the UseDevice property using an optional name-value pair argument. For
example, to use a GPU for prediction, specify UseDevice="gpu".
Input Arguments

dlnetwork object



ObservationInputNames when you expect the reward signal to depend on the current environment
observation.

expect the reward signal to depend on the current action value.

NextObservationInputNames when you expect the reward signal to depend on the next
RewardMeanOutputName — Reward mean output layer name

string
Reward mean output layer name, specified as a string.
3-73
3 Objects
RewardStandardDeviationOutputName — Reward standard deviation output layer name

string
Reward standard deviation output layer name, specified as a string.
Properties

or rlNumericSpec.


signals.
rlNumericSpec.

Computing Toolbox).
used with MATLAB.
Object Functions
3-74
Examples
Create Stochastic Reward Function and Predict Reward
Create a deep neural network. The network has two input channels, one for the current action and
one for the next observations. The single output channel is for the predicted reward value.
statePath = featureInputLayer(obsInfo.Dimension(1),Name="obs");
actionPath = featureInputLayer(actInfo.Dimension(1),Name="action");
nextStatePath = featureInputLayer(obsInfo.Dimension(1),Name="nextObs");
fullyConnectedLayer(32,Name="fc")
reluLayer(Name="relu1")
fullyConnectedLayer(32,Name="fc2")];
meanPath = [reluLayer(Name="rewardMeanRelu")
fullyConnectedLayer(1,Name="rewardMean")];
stdPath = [reluLayer(Name="rewardStdRelu")
fullyConnectedLayer(1,Name="rewardStdFc")
softplusLayer(Name="rewardStd")];
rwdNet = layerGraph(statePath);
rwdNet = addLayers(rwdNet,actionPath);
rwdNet = addLayers(rwdNet,nextStatePath);
rwdNet = addLayers(rwdNet,commonPath);
rwdNet = addLayers(rwdNet,meanPath);
rwdNet = addLayers(rwdNet,stdPath);
rwdNet = connectLayers(rwdNet,"nextObs","concat/in1");
rwdNet = connectLayers(rwdNet,"action","concat/in2");
rwdNet = connectLayers(rwdNet,"obs",'concat/in3');
rwdNet = connectLayers(rwdNet,"fc2","rewardMeanRelu");
rwdNet = connectLayers(rwdNet,"fc2","rewardStdRelu");
plot(rwdNet)
3-75
3 Objects
rwdNet = dlnetwork(rwdNet);
Create a stochastic reward function object.
rwdFncAppx = rlContinuousGaussianRewardFunction(...
rwdNet,obsInfo,actInfo,...
ObservationInputNames="obs",...
NextObservationInputNames="nextObs", ...
RewardMeanOutputNames="rewardMean", ...
RewardStandardDeviationOutputNames="rewardStd");
and next observation. For example, predict the reward for a random action and next observation. The
reward value is sampled from a Gaussian distribution with the mean and standard deviation output by
the reward network.
obs = rand(obsInfo.Dimension);
nextObs = rand(obsInfo.Dimension(1),1);
predRwd = predict(rwdFncAppx,{obs},{act},{nextObs})
predRwd = single
-0.1308
3-76
You can obtain the mean value and standard deviation of the Gaussian distribution for the predicted
reward using evaluate.
predRwdDist = evaluate(rwdFncAppx,{obs,act,nextObs})
predRwdDist=1×2 cell array

{[-0.0995]} {[0.6195]}
Version History
See Also
Objects
rlContinuousDeterministicRewardFunction | rlNeuralNetworkEnvironment |
rlIsDoneFunction | evaluate | gradient | accelerate
Topics
3-77
3 Objects
rlContinuousGaussianTransitionFunction
Stochastic Gaussian transition function approximator object for neural network-based environment
Description
specify stochastic transition function approximators using
rlContinuousDeterministicTransitionFunction objects.
A transition function approximator object uses a deep neural network as internal approximation
model to predict the next observations based on the current observations and actions.
To specify deterministic transition function approximators, use

rlContinuousGaussianTransitionFunction objects.
Creation
Syntax
tsnFcnAppx = rlContinuousGaussianTransitionFunction(net,observationInfo,
Description
tsnFcnAppx = rlContinuousGaussianTransitionFunction(net,observationInfo,
actionInfo,Name=Value) creates the stochastic transition function approximator object
tsnFcnAppx using the deep neural network net and sets the ObservationInfo and ActionInfo
properties.
When creating a stochastic transition function approximator, you must specify the names of the deep
neural network inputs and outputs using the ObservationInputNames, ActionInputNames,
NextObservationMeanOutputNames, and NextObservationStandardDeviationOutputNames
name-value pair arguments.
You can also specify the PredictDiff and UseDevice properties using optional name-value pair
arguments. For example, to use a GPU for prediction, specify UseDevice="gpu".
Input Arguments

dlnetwork object
Deep neural network, specified as a dlnetwork object.
The input layer names for this network must match the input names specified using
ObservationInputNames and ActionInputNames. The dimensions of the input layers must match
the dimensions of the corresponding observation and action specifications in ObservationInfo and
ActionInfo, respectively.
3-78
The output layer names for this network must match the output names specified using
NextObservationOutputNames. The dimensions of the input layers must match the dimensions of
the corresponding observation specifications in ObservationInfo.


Observation input layer names, specified as a string or string array.

Action input layer names, specified as a string or string array.
NextObservationMeanOutputNames — Next observation mean output layer names

Next observation mean output layer names, specified as a string or string array.
The number of next observation mean output names must match the length of ObservationInfo
and the order of the names must match the order of the specifications in ObservationInfo.
NextObservationStandardDeviationOutputNames — Next observation standard deviation

output layer names
Next observation standard deviation output layer names, specified as a string or string array.
The number of next observation standard deviation output names must match the length of
ObservationInfo and the order of the names must match the order of the specifications in
ObservationInfo.
Properties

3-79
3 Objects
or rlNumericSpec.


signals.
rlNumericSpec.

Computing Toolbox).
used with MATLAB.
PredictDiff — Option to predict the difference between the current observation and the
next observation
Option to predict the difference between the current observation and the next observation, specified
as one of the following logical values.
• false — Select this option if net outputs the value of the next observation.
• true — Select this option if net outputs the difference between the next observation and the
current observation. In this case, the predict function computes the next observation by adding
the current observation to the output of net.
Object Functions
Examples
3-80
Create Stochastic Transition Function and Predict Next Observation
Define the layers for the deep neural network. The network has two input channels, one for the
current observations and one for the current actions. The output of the network is the predicted
Gaussian distribution for each next observation. The two output channels correspond to the means
and standard deviations of these distribution.
statePath = featureInputLayer(obsInfo.Dimension(1),Name="obs");
actionPath = featureInputLayer(actInfo.Dimension(1),Name="act");
meanPath = [reluLayer(Name="nextObsMeanRelu")
fullyConnectedLayer(obsInfo.Dimension(1),Name="nextObsMean")];
stdPath = [reluLayer(Name="nextObsStdRelu")
fullyConnectedLayer(obsInfo.Dimension(1),Name="nextObsStdReluFull")
softplusLayer(Name="nextObsStd")];
tsnNet = layerGraph(statePath);
tsnNet = addLayers(tsnNet,actionPath);
tsnNet = addLayers(tsnNet,commonPath);
tsnNet = addLayers(tsnNet,meanPath);
tsnNet = addLayers(tsnNet,stdPath);
tsnNet = connectLayers(tsnNet,"obs","concat/in1");
tsnNet = connectLayers(tsnNet,"act","concat/in2");
tsnNet = connectLayers(tsnNet,"fc2","nextObsMeanRelu");
tsnNet = connectLayers(tsnNet,"fc2","nextObsStdRelu");
plot(tsnNet)
3-81
3 Objects

tsnNet = dlnetwork(tsnNet);
Create a stochastic transition function object.

tsnFcnAppx = rlContinuousGaussianTransitionFunction(tsnNet,obsInfo,actInfo, ...
ObservationInputNames="obs",...
ActionInputNames="act",...
NextObservationMeanOutputNames="nextObsMean",...
NextObservationStandardDeviationOutputNames="nextObsStd");
Using this transition function object, you can predict the next observation based on the current
observation and action. For example, predict the next observation for a random observation and
action. The next observation values are sampled from Gaussian distributions with the means and
standard deviations output by the transition network.
observation = rand(obsInfo.Dimension);
action = rand(actInfo.Dimension);
nextObs = predict(tsnFcnAppx,{observation},{action})
nextObs = 1x1 cell array

{4x1 single}
nextObs{1}
3-82
1.2414
0.7307
-0.5588
-0.9567
You can also obtain the mean value and standard deviation of the Gaussian distribution of the
predicted next observation using evaluate.
nextObsDist = evaluate(tsnFcnAppx,{observation,action})
nextObsDist=1×2 cell array

{4x1 single} {4x1 single}
Version History
See Also
Objects
rlContinuousDeterministicTransitionFunction | rlNeuralNetworkEnvironment
Topics
3-83
3 Objects
rlDDPGAgent
Deep deterministic policy gradient (DDPG) reinforcement learning agent
Description
The deep deterministic policy gradient (DDPG) algorithm is an actor-critic, model-free, online, off-
policy reinforcement learning method which computes an optimal policy that maximizes the long-
term reward. The action space can only be continuous.
For more information, see “Deep Deterministic Policy Gradient (DDPG) Agents”. For more
information on the different types of reinforcement learning agents, see “Reinforcement Learning
Agents”.
Creation
Syntax
agent = rlDDPGAgent(observationInfo,actionInfo)
agent = rlDDPGAgent(observationInfo,actionInfo,initOpts)
agent = rlDDPGAgent(actor,critic,agentOptions)
agent = rlDDPGAgent( ___ ,agentOptions)
Description
agent = rlDDPGAgent(observationInfo,actionInfo) creates a deep deterministic policy

gradient agent for an environment with the given observation and action specifications, using default
initialization options. The actor and critic in the agent use default deep neural networks built from
the observation specification observationInfo and the action specification actionInfo. The
ObservationInfo and ActionInfo properties of agent are set to the observationInfo and
agent = rlDDPGAgent(observationInfo,actionInfo,initOpts) creates a deep

deterministic policy gradient agent for an environment with the given observation and action
specifications. The agent uses default networks configured using options specified in the initOpts
object. For more information on the initialization options, see rlAgentInitializationOptions.
agent = rlDDPGAgent(actor,critic,agentOptions) creates a DDPG agent with the specified

actor and critic, using default DDPG agent options.
agent = rlDDPGAgent( ___ ,agentOptions) creates a DDPG agent and sets the AgentOptions
property to the agentOptions input argument. Use this syntax after any of the input arguments in
the previous syntaxes.
3-84
rlDDPGAgent
Input Arguments

Agent initialization options, specified as an rlAgentInitializationOptions object.
actor — Actor
Actor, specified as an rlContinuousDeterministicActor. For more information on creating

actors, see “Create Policies and Value Functions”.
critic — Critic
Critic, specified as an rlQValueFunction object. For more information on creating critics, see
“Create Policies and Value Functions”.
Properties

signals.

or rlNumericSpec.


Since a DDPG agent operates in a continuous action space, you must specify actionInfo as an
rlNumericSpec object.
also construct the specification manually using rlNumericSpec.

rlDDPGAgentOptions object
Agent options, specified as an rlDDPGAgentOptions object.
3-85
3 Objects
If you create a DDPG agent with default actor and critic that use recurrent neural networks, the
default value of AgentOptions.SequenceLength is 32.
ExperienceBuffer — Experience buffer

rlReplayMemory object
Experience buffer, specified as an rlReplayMemory object. During training the agent stores each of
its experiences (S,A,R,S',D) in a buffer. Here:
• S is the current observation of the environment.

• A is the action taken by the agent.
• R is the reward for taking action A.
• S' is the next observation after taking action A.
• D is the is-done signal after taking action A.

values.
• true — Use the base agent exploration policy when selecting actions.
• false — Use the base agent greedy policy when selecting actions.

Object Functions
environment
observations
3-86
rlDDPGAgent
Examples
Create DDPG Agent from Observation and Action Specifications
Control Double Integrator System”. The observation from the environment is a vector containing the
position and velocity of a mass. The action is a scalar representing a force, applied to the mass,
ranging continuously from -2 to 2 Newton.

rng(0)
Create a policy gradient agent from the environment observation and action specifications.
agent = rlDDPGAgent(obsInfo,actInfo)
agent =
rlDDPGAgent with properties:
ExperienceBuffer: [1x1 rl.replay.rlReplayMemory]

AgentOptions: [1x1 rl.option.rlDDPGAgentOptions]
SampleTime: 1
getAction(agent,{rand(obsInfo(1).Dimension)})

{[0.0182]}
Create DDPG Agent Using Initialization Options
3-87
3 Objects
network must have 128 neurons (instead of the default number, 256).
initOpts = rlAgentInitializationOptions('NumHiddenUnit',128);
rng(0)
Create a DDPG agent from the environment observation and action specifications.
agent = rlDDPGAgent(obsInfo,actInfo,initOpts);
neurons
criticNet.Layers
ans =

3-88
rlDDPGAgent
3-89
3 Objects

{[-0.0364]}
Create DDPG Agent from Actor and Critic
position and velocity of a mass. The action is a scalar representing a force ranging continuously from
-2 to 2 Newton.
Obtain the environment observation and action specification objects.
3-90
rlDDPGAgent
The actor and critic networks are initialized randomly. Ensure reproducibility by fixing the seed of the
random generator.
rng(0)
For DDPG agents, the critic estimates a Q-value function, therefore it must take both the observation
and action signals as inputs and return a scalar value.
To approximate the Q-value function within the critic, use a deep neural network. Define each
network path as an array of layer objects. Get the dimensions of the observation and action spaces
from the environment specification objects, and specify a name for the input layers, so you can later
explicitly associate them with the appropriate environment channel.
% Define observation and action paths

obsPath = featureInputLayer(prod(obsInfo.Dimension),Name="netOin");
actPath = featureInputLayer(prod(actInfo.Dimension),Name="netAin");
% Define common path: concatenate along first dimension

reluLayer
% Add paths to layerGraph network

cNet = layerGraph(obsPath);
cNet = addLayers(cNet, actPath);
cNet = addLayers(cNet, commonPath);
% Connect paths
cNet = connectLayers(cNet,"netOin","concat/in1");
cNet = connectLayers(cNet,"netAin","concat/in2");
% Plot the network

plot(cNet)
3-91
3 Objects


summary(cNet)
Initialized: true
Inputs:
2 'netAin' 1 features
Create the critic using cNet, and the names of the input layers. DDPG agents use an
rlQValueFunction object to implement the critic.
critic = rlQValueFunction(cNet,obsInfo,actInfo,...
ObservationInputNames="netOin", ...
ActionInputNames="netAin");
Check the critic with random observation and action inputs.
getValue(critic,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})
ans = single
-0.4260
3-92
rlDDPGAgent
To approximate the policy within the actor, use a neural network. For DDPG agents, the actor
executes a deterministic policy, which is implemented by a continuous deterministic actor. In this case
the network must take the observation signal as input and return an action. Therefore the output
layer must have as many elements as the number of possible actions.
% create a network to be used as underlying actor approximator
aNet = [
reluLayer
fullyConnectedLayer(numel(actInfo))];


summary(aNet)
Initialized: true
Inputs:
Create the actor using actorNetwork. DDPG agents use an rlContinuousDeterministicActor

object to implement the actor.
actor = rlContinuousDeterministicActor(aNet,obsInfo,actInfo);


{[-0.2327]}
Create the DDPG agent using the actor and critic.

agent = rlDDPGAgent(actor,critic);
Specify agent options, including training options for the actor and critic.
agent.AgentOptions.SampleTime=env.Ts;
agent.AgentOptions.TargetSmoothFactor=1e-3;
agent.AgentOptions.ExperienceBufferLength=1e6;
agent.AgentOptions.MiniBatchSize=32;
Check the agent with a random observation input.

3-93
3 Objects

{[-0.1666]}
You can now train the agent within the environment.
Create DDPG Agent with Recurrent Neural Networks
For this example, load the environment used in the example “Train DDPG Agent to Control Double
Integrator System”. The observation from the environment is a vector containing the position and
velocity of a mass. The action is a scalar representing a force ranging continuously from -2 to 2
Newton.
Get the observation and action specification objects.
For DDPG agents, the critic estimates a Q-value function, therefore it must take both the observation
and action signals as inputs and return a scalar value.
To approximate the Q-value function within the critic, use a recurrent neural network. Define each
network path as an array of layer objects. Get the dimensions of the observation spaces from the
environment specification object, and specify a name for the input layers, so you can later explicitly
associate them with the correct environment channel. To create a recurrent neural network, use
sequenceInputLayer as the input layer and include an lstmLayer as one of the other network
layers.
% Define observation and action paths

obsPath = sequenceInputLayer(prod(obsInfo.Dimension),Name="netOin");
actPath = sequenceInputLayer(prod(actInfo.Dimension),Name="netAin");
% Define common path: concatenate along first dimension

commonPath = [concatenationLayer(1,2,Name="cat")
lstmLayer(50)
reluLayer
% Add paths to layerGraph network

cNet = layerGraph(obsPath);
cNet = addLayers(cNet, actPath);
cNet = addLayers(cNet, commonPath);
% Connect paths
cNet = connectLayers(cNet,"netOin","cat/in1");
cNet = connectLayers(cNet,"netAin","cat/in2");
% Plot the network

plot(cNet)
3-94
rlDDPGAgent


summary(cNet)
Initialized: true
Inputs:
1 'netOin' Sequence input with 2 dimensions
2 'netAin' Sequence input with 1 dimensions
Create the critic using cNet, specifying the names of the input layers. DDPG agents use an
rlQValueFunction object to implement the critic.
critic = rlQValueFunction(cNet,obsInfo,actInfo,...
ObservationInputNames="netOin",ActionInputNames="netAin");
Check the critic with random observation and action inputs.

ans = single
-0.0074
Since the critic has a recurrent network, the actor must have a recurrent network too. For DDPG
agents, the actor executes a deterministic policy, which is implemented by a continuous deterministic
3-95
3 Objects
actor. In this case the network must take the observation signal as input and return an action.
Therefore the output layer must have as many elements as the number of possible actions.
Define the network as an array of layer objects, and get the dimensions of the observation and action
spaces from the environment specification objects.
aNet = [
lstmLayer(10)
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension)) ];
Convert to dlnetwork and display the number of weights.

summary(aNet)
Initialized: true
Inputs:
Create the actor using aNet. DDPG agents use an rlContinuousDeterministicActor object to
implement the actor.
actor = rlContinuousDeterministicActor(aNet,obsInfo,actInfo);
Check the actor with random observation input.


{[0.0246]}
Specify some training options for the critic.

'LearnRate',5e-3,'GradientThreshold',1);
Specify some training options for the actor.

Specify agent options. To use a DDPG agent with recurrent neural networks, you must specify a
SequenceLength greater than 1.
agentOpts = rlDDPGAgentOptions(...
'SampleTime',env.Ts,...
'TargetSmoothFactor',1e-3,...
'ExperienceBufferLength',1e6,...
'DiscountFactor',0.99,...
'SequenceLength',20,...
'MiniBatchSize',32, ...
'CriticOptimizerOptions',criticOpts, ...
'ActorOptimizerOptions',actorOpts);
3-96
rlDDPGAgent
Create the DDPG agent using the actor and critic.
agent = rlDDPGAgent(actor,critic,agentOpts);
To check your agent return the action from a random observation.

{[0.0158]}
Version History
See Also
rlAgentInitializationOptions | rlDDPGAgentOptions | rlQValueFunction |
rlContinuousDeterministicActor | Deep Network Designer
Topics
“Deep Deterministic Policy Gradient (DDPG) Agents”
3-97
3 Objects
rlDDPGAgentOptions
Options for DDPG agent
Description
Use an rlDDPGAgentOptions object to specify options for deep deterministic policy gradient
(DDPG) agents. To create a DDPG agent, use rlDDPGAgent.
For more information, see “Deep Deterministic Policy Gradient (DDPG) Agents”.
Learning Agents”.
Creation
Syntax
opt = rlDDPGAgentOptions
opt = rlDDPGAgentOptions(Name,Value)
Description
opt = rlDDPGAgentOptions creates an options object for use as an argument when creating a
DDPG agent using all default options. You can modify the object properties using dot notation.
opt = rlDDPGAgentOptions(Name,Value) sets option properties on page 3-98 using name-

value pairs. For example, rlDDPGAgentOptions('DiscountFactor',0.95) creates an option set
with a discount factor of 0.95. You can specify multiple name-value pairs. Enclose each property
name in quotes.
Properties
NoiseOptions — Noise model options
OrnsteinUhlenbeckActionNoise object
Noise model options, specified as an OrnsteinUhlenbeckActionNoise object. For more

information on the noise model, see “Noise Model” on page 3-101.
For an agent with multiple actions, if the actions have different ranges and units, it is likely that each
action requires different noise model parameters. If the actions have similar ranges and units, you
can set the noise parameters for all actions to the same value.
For example, for an agent with two actions, set the standard deviation of each action to a different
value while using the same decay rate for both standard deviations.
opt = rlDDPGAgentOptions;
opt.NoiseOptions.StandardDeviation = [0.1 0.2];
opt.NoiseOptions.StandardDeviationDecayRate = 1e-4;
3-98
rlDDPGAgentOptions


rlOptimizer.


rlOptimizer.
TargetSmoothFactor — Smoothing factor for target actor and critic updates

1e-3 (default) | positive scalar less than or equal to 1
Smoothing factor for target actor and critic updates, specified as a positive scalar less than or equal
to 1. For more information, see “Target Update Methods”.
TargetUpdateFrequency — Number of steps between target actor and critic updates

Number of steps between target actor and critic updates, specified as a positive integer. For more
information, see “Target Update Methods”.
ResetExperienceBufferBeforeTraining — Option for clearing the experience buffer

Option for clearing the experience buffer before training, specified as a logical value.
SequenceLength — Maximum batch-training trajectory length when using RNN

Maximum batch-training trajectory length when using a recurrent neural network, specified as a
positive integer. This value must be greater than 1 when using a recurrent neural network and 1
otherwise.
MiniBatchSize — Size of random experience mini-batch

Size of random experience mini-batch, specified as a positive integer. During each training episode,
the agent randomly samples experiences from the experience buffer when computing gradients for
updating the critic properties. Large mini-batches reduce the variance when computing gradients but
increase the computational effort.
NumStepsToLookAhead — Number of future rewards used to estimate the value of the policy
Number of future rewards used to estimate the value of the policy, specified as a positive integer. For
more information, see [1], Chapter 7.
3-99
3 Objects
Note that if parallel training is enabled (that is if an rlTrainingOptions option object in which the
UseParallel property is set to true is passed to train) then NumStepsToLookAhead must be set
to 1, otherwise an error is generated. This guarantees that experiences are stored contiguously.
ExperienceBufferLength — Experience buffer size

Experience buffer size, specified as a positive integer. During training, the agent computes updates
using a mini-batch of experiences randomly sampled from the buffer.


equal to 1.
Object Functions
Examples
Create DDPG Agent Options Object
This example shows how to create a DDPG agent option object.
Create an rlDDPGAgentOptions object that specifies the mini-batch size.
opt = rlDDPGAgentOptions('MiniBatchSize',48)
opt =
rlDDPGAgentOptions with properties:
NoiseOptions: [1x1 rl.option.OrnsteinUhlenbeckActionNoise]

TargetSmoothFactor: 1.0000e-03
3-100
rlDDPGAgentOptions
TargetUpdateFrequency: 1
ResetExperienceBufferBeforeTraining: 1
SequenceLength: 1
MiniBatchSize: 48
ExperienceBufferLength: 10000
SampleTime: 1
Algorithms
Noise Model
DDPG agents use an Ornstein-Uhlenbeck action noise model for exploration.

Ornstein-Uhlenbeck Action Noise
An OrnsteinUhlenbeckActionNoise object has the following numeric value properties.
Property Description Default Value

InitialAction Initial value of action 0
Mean Noise mean value 0
MeanAttractionConstant Constant specifying how quickly 0.15
the noise model output is
attracted to the mean
StandardDeviationDecayRa Decay rate of the standard 0
te deviation
StandardDeviation Initial value of noise standard 0.3
deviation
StandardDeviationMin Minimum standard deviation 0
At each sample time step k, the noise value v(k) is updated using the following formula, where Ts is
the agent sample time, and the initial value v(1) is defined by the InitialAction parameter.
v(k+1) = v(k) + MeanAttractionConstant.*(Mean - v(k)).*Ts

+ StandardDeviation(k).*randn(size(Mean)).*sqrt(Ts)
At each sample time step, the standard deviation decays as shown in the following code.
decayedStandardDeviation = StandardDeviation(k).*(1 - StandardDeviationDecayRate);

StandardDeviation(k+1) = max(decayedStandardDeviation,StandardDeviationMin);
You can calculate how many samples it will take for the standard deviation to be halved using this
simple formula.
halflife = log(0.5)/log(1-StandardDeviationDecayRate);
3-101
3 Objects
For continuous action signals, it is important to set the noise standard deviation appropriately to
encourage exploration. It is common to set StandardDeviation*sqrt(Ts) to a value between 1%
and 10% of your action range.
If your agent converges on local optima too quickly, promote agent exploration by increasing the
amount of noise; that is, by increasing the standard deviation. Also, to increase exploration, you can
reduce the StandardDeviationDecayRate.
Version History
Property names defining noise probability distribution in the

OrnsteinUhlenbeckActionNoise object have changed
The properties defining the probability distribution of the Ornstein-Uhlenbeck (OU) noise model have
been renamed. DDPG agents use OU noise for exploration.
• The Variance property has been renamed StandardDeviation.

• The VarianceDecayRate property has been renamed StandardDeviationDecayRate.
• The VarianceMin property has been renamed StandardDeviationMin.
The default values of these properties remain the same. When an

OrnsteinUhlenbeckActionNoise noise object saved from a previous MATLAB release is loaded,
the values of Variance, VarianceDecayRate, and VarianceMin are copied in the
StandardDeviation, StandardDeviationDecayRate, and StandardDeviationMin,
respectively.
The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the OU noise model, use the new property
names instead.
Update Code
This table shows how to update your code to use the new property names for rlDDPGAgentOptions
object ddpgopt.
Not Recommended Recommended

ddpgopt.NoiseOptions.Variance = 0.5; ddpgopt.NoiseOptions.StandardDeviation
= 0.5;
ddpgopt.NoiseOptions.VarianceDecayRate ddpgopt.NoiseOptions.StandardDeviation
= 0.1; DecayRate = 0.1;
ddpgopt.NoiseOptions.VarianceMin = 0; ddpgopt.NoiseOptions.StandardDeviation
Min = 0;
Target update method settings for DDPG agents have changed

Target update method settings for DDPG agents have changed. The following changes require
updates to your code:
3-102
rlDDPGAgentOptions
• The TargetUpdateMethod option has been removed. Now, DDPG agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.
To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.
Update Method TargetUpdateFrequency TargetSmoothFactor

Smoothing 1 Less than 1
Periodic Greater than 1 1
Periodic smoothing (new Greater than 1 Less than 1
method in R2020a)
The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.
Update Code
This table shows some typical uses of rlDDPGAgentOptions and how to update your code to use the
new option configuration.

opt = opt = rlDDPGAgentOptions;
rlDDPGAgentOptions('TargetUpdateMethod
',"smoothing");
opt = opt = rlDDPGAgentOptions;
rlDDPGAgentOptions('TargetUpdateMethod opt.TargetUpdateFrequency = 4;
',"periodic"); opt.TargetSmoothFactor = 1;
opt = rlDDPGAgentOptions; opt = rlDDPGAgentOptions;
opt.TargetUpdateMethod = "periodic"; opt.TargetUpdateFrequency = 5;
opt.TargetUpdateFrequency = 5; opt.TargetSmoothFactor = 1;
References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second
edition. Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press,
2018.
See Also
Topics
“Deep Deterministic Policy Gradient (DDPG) Agents”
3-103
3 Objects
rlDeterministicActorPolicy
Policy object to generate continuous deterministic actions for custom training loops and application
deployment
Description
This object implements a deterministic policy, which returns continuous deterministic actions given
an input observation. You can create an rlDeterministicActorPolicy object from an
rlContinuousDeterministicActor or extract it from an rlDDPGAgent or rlTD3Agent. You can
then train the policy object using a custom training loop or deploy it for your application using
generatePolicyBlock or generatePolicyFunction. This policy is always deterministic and
does not perform any exploration. For more information on policies and value functions, see “Create
Creation
Syntax
Description
policy = rlDeterministicActorPolicy(actor) creates the deterministic actor policy object

policy from the continuous deterministic actor actor. It also sets the Actor property of policy to
the input argument actor.
Properties
Actor — Continuous deterministic actor
Continuous deterministic actor, specified as an rlContinuousDeterministicActor object.



environment action channel, such as its dimensions, data type, and name. Note that the name of the
action channel specified in actionInfo (if any) is not used.
3-104

Example: 0.2
Object Functions
generatePolicyBlock Generate Simulink block that evaluates policy of an agent or policy object
observations
policy object
object
Examples
Create Deterministic Actor Policy from Continuous Deterministic Actor


from an environment
action as output.
layers = [
3-105
3 Objects
reluLayer
];
summary(model)
Initialized: true
Inputs:
actor =

UseDevice: "cpu"
act{1}
0.4013
0.0578
policy =
rlDeterministicActorPolicy with properties:

SampleTime: -1
act{1}
ans = 2×1
3-106
0.4313
-0.3002
Version History
See Also
Functions
rlMaxQPolicy | rlEpsilonGreedyPolicy | rlAdditiveNoisePolicy |
rlStochasticActorPolicy | rlTD3Agent | rlDDPGAgent | generatePolicyBlock |
Blocks
RL Agent
Topics
3-107
3 Objects
rlDeterministicActorRepresentation
(Not recommended) Deterministic actor representation for reinforcement learning agents
Note rlDeterministicActorRepresentation is not recommended. Use

rlContinuousDeterministicActor instead. For more information, see
“rlDeterministicActorRepresentation is not recommended”.
Description
This object implements a function approximator to be used as a deterministic actor within a
reinforcement learning agent with a continuous action space. A deterministic actor takes
observations as inputs and returns as outputs the action that maximizes the expected cumulative
long-term reward, thereby implementing a deterministic policy. After you create an
rlDeterministicActorRepresentation object, use it to create a suitable agent, such as an
rlDDPGAgent agent. For more information on creating representations, see “Create Policies and
Value Functions”.
Creation
Syntax
actor = rlDeterministicActorRepresentation(net,observationInfo,
actionInfo,'Observation',obsName,'Action',actName)
actor = rlDeterministicActorRepresentation({basisFcn,W0},observationInfo,
actionInfo)
actor = rlDeterministicActorRepresentation( ___ ,options)
Description
actor = rlDeterministicActorRepresentation(net,observationInfo,
actionInfo,'Observation',obsName,'Action',actName) creates a deterministic actor using
the deep neural network net as approximator. This syntax sets the ObservationInfo and ActionInfo
properties of actor to the inputs observationInfo and actionInfo, containing the specifications
for observations and actions, respectively. actionInfo must specify a continuous action space,
discrete action spaces are not supported. obsName must contain the names of the input layers of net
that are associated with the observation specifications. The action names actName must be the
names of the output layers of net that are associated with the action specifications.
actor = rlDeterministicActorRepresentation({basisFcn,W0},observationInfo,
actionInfo) creates a deterministic actor using a custom basis function as underlying
approximator. The first input argument is a two-elements cell in which the first element contains the
handle basisFcn to a custom basis function, and the second element contains the initial weight
matrix W0. This syntax sets the ObservationInfo and ActionInfo properties of actor respectively to
the inputs observationInfo and actionInfo.
actor = rlDeterministicActorRepresentation( ___ ,options) creates a deterministic

actor using the additional options set options, which is an rlRepresentationOptions object.
3-108
This syntax sets the Options property of actor to theoptions input argument. You can use this
syntax with any of the previous input-argument combinations.
Input Arguments

dlNetwork object
following:

The network input layers must be in the same order and with the same data type and dimensions as
the signals defined in ObservationInfo. Also, the names of these input layers must match the
observation names listed in obsName.
The network output layer must have the same data type and dimension as the signal defined in
ActionInfo. Its name must be the action name specified in actName.
rlDeterministicActorRepresentation objects support recurrent deep neural networks.
obsName — Observation names

string | character vector | cell array of character vectors
Observation names, specified as a cell array of strings or character vectors. The observation names
must be the names of the input layers in net.
Example: {'my_obs'}
actName — Action name

string | character vector | single-element cell array containing a character vector
Action name, specified as a single-element cell array that contains a character vector. It must be the
name of the output layer of net.
Example: {'my_act'}

function handle
defined function can either be an anonymous function or a function on the MATLAB path. The action
to be taken based on the current observation, which is the output of the actor, is the vector a =
W'*B, where W is a weight matrix containing the learnable parameters and B is the column vector
returned by the custom basis function.
3-109
3 Objects
When creating a deterministic actor representation, your basis function must have the following
signature.
Here obs1 to obsN are observations in the same order and with the same data type and dimensions
as the signals defined in observationInfo
W0 — Initial value of the basis function weights

column vector
Initial value of the basis function weights, W, specified as a matrix having as many rows as the length
space.
Properties
Options — Representation options
Representation options, specified as an rlRepresentationOptions object. Available options

include the optimizer used for training and the learning rate.


rlDeterministicActorRepresentation sets the ObservationInfo property of actor to the

input observationInfo.


Action specifications for a continuous action space, specified as an rlNumericSpec object defining
properties such as dimensions, data type and name of the action signals. The deterministic actor
representation does not support discrete actions.
rlDeterministicActorRepresentation sets the ActionInfo property of actor to the input

observationInfo.
discrete action.
3-110
Object Functions
getAction Obtain action from agent, actor, or policy object given environment observations
Examples
Create Deterministic Actor from Deep Neural Network
doubles.
Create a deep neural network approximator for the actor. The input of the network (here called
myobs) must accept a four-element vector (the observation vector just defined by obsInfo), and its
output must be the action (here called myact) and be a two-element vector, as defined by actInfo.
net = [featureInputLayer(4,'Normalization','none','Name','myobs')
fullyConnectedLayer(2,'Name','myact')];
Create the critic with rlQValueRepresentation, using the network, the observations and action
specification objects, as well as the names of the network input and output layers.
actor = rlDeterministicActorRepresentation(net,obsInfo,actInfo, ...
'Observation',{'myobs'},'Action',{'myact'})
actor =
rlDeterministicActorRepresentation with properties:

act = getAction(actor,{rand(4,1)}); act{1}
-0.5054
1.5390
You can now use the actor to create a suitable agent (such as an rlACAgent, rlPGAgent, or
rlDDPGAgent agent).
3-111
3 Objects
Create Deterministic Actor from Custom Basis Function
continuous four-dimensional space, so that a single observation is a column vector containing 3
doubles.
The deterministic actor does not support discrete action spaces. Therefore, create a continuous
action space specification object (or alternatively use getActionInfo to extract the specification
object from an environment). For this example, define the action space as a continuous two-
dimensional space, so that a single action is a column vector containing 2 doubles.
Create a custom basis function. Each element is a function of the observations defined by obsInfo.
myBasisFcn = @(myobs) [myobs(2)^2; myobs(1); 2*myobs(2)+myobs(1); -myobs(3)]
myBasisFcn = function_handle with value:

@(myobs)[myobs(2)^2;myobs(1);2*myobs(2)+myobs(1);-myobs(3)]
The output of the actor is the vector W'*myBasisFcn(myobs), which is the action taken as a result
of the given observation. The weight matrix W contains the learnable parameters and must have as
many rows as the length of the basis function output and as many columns as the dimension of the
action space.
W0 = rand(4,2);
function and the initial weight matrix. The second and third arguments are, respectively, the
actor = rlDeterministicActorRepresentation({myBasisFcn,W0},obsInfo,actInfo)
actor =
rlDeterministicActorRepresentation with properties:

To check your actor, use the getAction function to return the action from a given observation, using
the current parameter matrix.
a = getAction(actor,{[1 2 3]'});
a{1}
ans =
2x1 dlarray
3-112
2.0595
2.3788
You can now use the actor (along with an critic) to create a suitable continuous action space agent.
Create Deterministic Actor from Recurrent Neural Network
environment.
obsinfo = rlNumericSpec([4 1]);

actinfo = rlNumericSpec([2 1]);
numObs = obsinfo.Dimension(1);
numAct = actinfo.Dimension(1);
Create a recurrent deep neural network for the actor. To create a recurrent neural network, use a
sequenceInputLayer as the input layer and include at least one lstmLayer.
net = [sequenceInputLayer(numObs,'Normalization','none','Name','state')
fullyConnectedLayer(10,'Name','fc1')
reluLayer('Name','relu1')
lstmLayer(8,'OutputMode','sequence','Name','ActorLSTM')
fullyConnectedLayer(numAct,'Name','action')
tanhLayer('Name','tanh1')];
actorOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(net,obsinfo,actinfo,...
'Observation',{'state'},'Action',{'tanh1'});
Version History
rlDeterministicActorRepresentation is not recommended

rlDeterministicActorRepresentation is not recommended. Use

rlContinuousDeterministicActor instead.
The following table shows some typical uses of rlDeterministicActorRepresentation, and how
to update your code with rlContinuousDeterministicActor instead. The first table entry uses a
neural network, the second one uses a basis function.
3-113
3 Objects
rlDeterministicActorRepresentation: rlContinuousDeterministicActor:
myActor = myActor =
rlDeterministicActorRepresentation(net rlContinuousDeterministicActor(net,obs
,obsInfo,actInfo,'Observation',obsName Info,actInfo,'ObservationInputNames',o
s,'Action',actNames) with actInfo defining bsNames). Use this syntax to create a
a continuous action space and net having deterministic actor object with a continuous
observations as inputs and a single output layer action space.
with as many elements as the number of
dimensions of the continuous action space.
rep = rep =
rlDeterministicActorRepresentation({ba rlContinuousDeterministicActor({basisF
sisFcn,W0},obsInfo,actInfo), where the cn,W0},obsInfo,actInfo). Use this syntax to
basis function has observations as inputs and create a deterministic actor object with a
actions as outputs, W0 is a matrix with as many continuous action space.
columns as the number of possible actions, and
actInfo defines a continuous action space.
See Also
Functions
rlContinuousDeterministicActor | rlRepresentationOptions | getActionInfo |
getObservationInfo
Topics
3-114
rlDiscreteCategoricalActor
Stochastic categorical actor with a discrete action space for reinforcement learning agents
Description
reinforcement learning agent with a discrete action space. A discrete categorical actor takes an
environment state as input and returns as output a random action sampled from a categorical (also
known as Multinoulli) probability distribution of the expected cumulative long term reward, thereby
implementing a stochastic policy. After you create an rlDiscreteCategoricalActor object, use it
to create a suitable agent, such as rlACAgent or rlPGAgent. For more information on creating
representations, see “Create Policies and Value Functions”.
Creation
Syntax
actor = rlDiscreteCategoricalActor(net,observationInfo,actionInfo)
actor = rlDiscreteCategoricalActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames)
actor = rlDiscreteCategoricalActor({basisFcn,W0},observationInfo,actionInfo)
actor = rlDiscreteCategoricalActor( ___ ,UseDevice=useDevice)
Description
actor = rlDiscreteCategoricalActor(net,observationInfo,actionInfo) creates a

stochastic actor with a discrete action space, using the deep neural network net as function
approximator. For this actor, actionInfo must specify a discrete action space. The network input
layers are automatically associated with the environment observation channels according to the
dimension specifications in observationInfo. The network must have a single output layer with as
many elements as the number of possible discrete actions, as specified in actionInfo. This function
sets the ObservationInfo and ActionInfo properties of actor to the inputs observationInfo
and actionInfo, respectively.
Note actor does not enforce constraints set by the action specification; therefore, when using this
actor = rlDiscreteCategoricalActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames) specifies the names of the network input
layers to be associated with the environment observation channels. The function assigns, in
3-115
3 Objects
actor = rlDiscreteCategoricalActor({basisFcn,W0},observationInfo,actionInfo)
creates a discrete space stochastic actor using a custom basis function as underlying approximator.
The first input argument is a two-element cell array whose first element is the handle basisFcn to a
custom basis function and whose second element is the initial weight matrix W0. This function sets
the ObservationInfo and ActionInfo properties of actor to the inputs observationInfo and
actionInfo, respectively.
actor = rlDiscreteCategoricalActor( ___ ,UseDevice=useDevice) specifies the device

used to perform computational operations on the actor object, and sets the UseDevice property of
actor to the useDevice input argument. You can use this syntax with any of the previous input-
argument combinations.
Input Arguments

following:

The network must have the environment observation channels as inputs and a single output layer
with as many elements as the number of possible discrete actions. Since the output of the network
must represent the probability of executing each possible action, the software automatically adds a
softmaxLayer as a final output layer if you do not specify it explicitly. When computing the action,
the actor then randomly samples the distribution to return an action.
rlDiscreteCategoricalActor objects support recurrent deep neural networks. For an example,

see “Create Discrete Categorical Actor from Deep Recurrent Neural Network” on page 3-122.
3-116

channels

function handle
defined function can either be an anonymous function or a function on the MATLAB path. The number
of the action to be taken based on the current observation, which is the output of the actor, is
randomly sampled from a categorical distribution with probabilities p = softmax(W'*B), where W
is a weight matrix containing the learnable parameters and B is the column vector returned by the
custom basis function. Each element of p represents the probability of executing the corresponding
action from the observed state.

matrix
space.
Properties

3-117
3 Objects
rlDiscreteCategoricalActor sets the ObservationInfo property of actor to the input

observationInfo.


rlFiniteSetSpec object
Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of
the environment action channel, such as its dimensions, data type, and name. Note that the function
does not use the name of the action channel specified in actionInfo.
rlDiscreteCategoricalActor sets the ActionInfo property of critic to the input

actionInfo.

Computing Toolbox).
used with MATLAB.
Example: "gpu"
Object Functions
observations
3-118

action) input data
action input data
policy object
object
Examples
Create Discrete Categorical Actor from Deep Neural Network
doubles.
specification object from an environment). For this example, define the action space as consisting of
three values, -10, 0, and 10.
To approximate the policy within the actor, use a deep neural network.
The input of the network must accept a four-element vector (the observation vector just defined by
obsInfo), and its output must be a three-element vector. Each element of the output vector must be
between 0 and 1 since it represents the probability of executing each of the three possible actions (as
defined by actInfo). Using softmax as the output layer enforces this requirement (the software
automatically adds a softmaxLayer as a final output layer if you do not specify it explicitly). When
computing the action, the actor then randomly samples the distribution to return an action.
Create a neural network as an array of layer objects.

net = [ featureInputLayer(4)
fullyConnectedLayer(3) ];
summary(net)
Initialized: true
Inputs:
3-119
3 Objects
Create the actor with rlDiscreteCategoricalActor, using the network, the observations and
action specification objects. When the network has multiple input layers, they are automatically
associated with the environment observation channels according to the dimension specifications in
obsInfo.
To check your actor, use getAction to return an action from a random observation vector, given the
act

{[0]}
You can now use the actor to create a suitable agent, such as rlACAgent, or rlPGAgent.
Create Discrete Categorical Actor from Deep Neural Network Specifying Input Layer Name
doubles.
To approximate the policy within the actor, use a neural network.
The input of the network (here called netOin) must accept a four-element vector (the observation
vector just defined by obsInfo), and its output (here called actionProb) must be a three-element
vector. Each element of the output vector must be between 0 and 1 since it represents the probability
of executing each of the three possible actions (as defined by actInfo). Using softmax as the output
layer enforces this requirement (however, the software automatically adds a softmaxLayer as a final
output layer if you do not specify it explicitly). When computing the action, the actor then randomly
samples the distribution to return an action.
Create a network as an array of layer objects. Specify a name for the input layer, so you can later
explicitly associate it with the observation channel.
net = [ featureInputLayer(4,Name="netOin")
softmaxLayer(Name="actionProb") ];
(weights).
3-120
summary(net)
Initialized: true
Inputs:
Create the actor with rlDiscreteCategoricalActor, using the network, the observations and
action specification objects, and the name of the network input layer.
actor = rlDiscreteCategoricalActor(net, ...

obsInfo,actInfo,...
Observation="netOin");
To validate your actor, use getAction to return an action from a random observation, given the
act{1}
ans = 0
To return the probability distribution of the possible actions as a function of a random observation,
and given the current network weights, use evaluate.
prb = evaluate(actor,{rand(obsInfo.Dimension)})
prb = 1x1 cell array

{3x1 single}
prb{1}
0.5606
0.2619
0.1776
You can now use the actor to create a suitable agent, such as rlACAgent, or rlPGAgent.
Create Discrete Categorical Actor from Custom Basis Function
consisting of two channels, the first being a two-dimensional vector in a continuous space, the second
being a two dimensional vector that can assume only three values, -[1 1], [0 0], and [1 1]. Therefore a
single observation consists of two two-dimensional vectors, one continuous, the other discrete.
obsInfo = [rlNumericSpec([2 1]) rlFiniteSetSpec({-[1 1],[0 0],[1 1]})];
3-121
3 Objects
Create a discrete action space specification object (or alternatively use getActionInfo to extract
the specification object from an environment with a discrete action space). For this example, define
the action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).
Create a custom basis function. Each element is a function of the observation defined by obsInfo.
myBasisFcn = @(obsC,obsD) [obsC(1)^2-obsD(2)^2;

obsC(2)^2-obsD(1)^2;
exp(obsC(2))+abs(obsD(1));
exp(obsC(1))+abs(obsD(2))];
The output of the actor is the action, among the ones defined in actInfo, corresponding to the
element of softmax(W'*myBasisFcn(obsC,obsD)) which has the highest value. W is a weight
matrix, containing the learnable parameters, which must have as many rows as the length of the
basis function output, and as many columns as the number of possible actions.
W0 = rand(4,3);
function and the initial parameter matrix. The second and third arguments are, respectively, the
actor = rlDiscreteCategoricalActor({myBasisFcn,W0},obsInfo,actInfo);
To check your actor use the getAction function to return one of the three possible actions,
depending on a given random observation and on the current parameter matrix.
getAction(actor,{rand(2,1),[1 1]})

{[3]}
Note that the discrete set constrain is not enforced.
getAction(actor,{rand(2,1),[0.5 -0.7]})

{[3]}
You can now use the actor (along with an critic) to create a suitable discrete action space agent (such
as rlACAgent, rlPGAgent, or rlPPOAgent).
Create Discrete Categorical Actor from Deep Recurrent Neural Network
This example shows you how to create a stochastic actor with a discrete action space using a
recurrent neural network. You can also use a recurrent neural network for a continuous stochastic
actor.
3-122
For this example, use the same environment used in “Train PG Agent to Balance Cart-Pole System”.
Load the environment and obtain the observation and action specifications.
env = rlPredefinedEnv('CartPole-Discrete');
To approximate the policy within the actor, use a recurrent deep neural network.
Create a neural network as an array of layer objects. To create a recurrent network, use a
Specify a name for the input layer, so you can later explicitly associate it with the observation
channel.
net = [
sequenceInputLayer( ...
Name="netOin")
reluLayer
numel(actInfo.Elements)) ];
(weights).
summary(net)
Initialized: true
Inputs:
1 'netOin' Sequence input with 4 dimensions
Create a discrete categorical actor using the network, the environment specifications, and the name
of the network input layer to be associated with the observation channel.
actor = rlDiscreteCategoricalActor(net, ...
obsInfo,actInfo,...
Observation="netOin");
To check your actor use getAction to return one of the two possible actions, depending on a given
random observation and on the current network weights.
act{1}
ans = -10
To return the probability of each of the two possible action, use evaluate. Note that the type of the
returned numbers is single, not double.
prob = evaluate(actor,{rand(obsInfo.Dimension)});
prob{1}
3-123
3 Objects
0.4704
0.5296
You can use getState and setState to extract and set the current state of the recurrent neural
network in the actor.
getState(actor)
ans=2×1 cell array

{8x1 single}
{8x1 single}
actor = setState(actor, ...

{-0.01*single(rand(8,1)), ...
0.01*single(rand(8,1))});
To evaluate the actor using sequential observations, use the sequence length (time) dimension. For
example, obtain actions for 5 independent sequences each one consisting of 9 sequential
observations.
[action,state] = getAction(actor, ...

{rand([obsInfo.Dimension 5 9])});
Display the action corresponding to the seventh element of the observation sequence in the fourth
sequence.
action = action{1};
action(1,1,4,7)
ans = 10
Display the updated state of the recurrent neural network.
state

{8x5 single}
{8x5 single}
For more information on input and output format for recurrent neural networks, see the Algorithms
You can now use the actor (along with an critic) to create a suitable discrete action space agent (such
as rlACAgent, rlPGAgent, or rlPPOAgent).
Version History
3-124
See Also
Functions
rlContinuousDeterministicActor | rlContinuousGaussianActor | getActionInfo |
getObservationInfo
Topics
3-125
3 Objects
rlDQNAgent
Deep Q-network (DQN) reinforcement learning agent
Description
The deep Q-network (DQN) algorithm is a model-free, online, off-policy reinforcement learning
method. A DQN agent is a value-based reinforcement learning agent that trains a critic to estimate
the return or future rewards. DQN is a variant of Q-learning, and it operates only within discrete
action spaces.
For more information, “Deep Q-Network (DQN) Agents”. For more information on the different types
of reinforcement learning agents, see “Reinforcement Learning Agents”.
Creation
Syntax
agent = rlDQNAgent(observationInfo,actionInfo)
agent = rlDQNAgent(observationInfo,actionInfo,initOpts)
agent = rlDQNAgent(critic)
agent = rlDQNAgent(critic,agentOptions)
Description
agent = rlDQNAgent(observationInfo,actionInfo) creates a DQN agent for an environment

with the given observation and action specifications, using default initialization options. The critic in
the agent uses a default vector (that is, multi-output) Q-value deep neural network built from the
observation specification observationInfo and the action specification actionInfo. The
agent = rlDQNAgent(observationInfo,actionInfo,initOpts) creates a DQN agent for an

environment with the given observation and action specifications. The agent uses a default network
configured using options specified in the initOpts object. For more information on the initialization
options, see rlAgentInitializationOptions.
Create Agent from Critic
agent = rlDQNAgent(critic) creates a DQN agent with the specified critic network using a
default option set for a DQN agent.
agent = rlDQNAgent(critic,agentOptions) creates a DQN agent with the specified critic

network and sets the AgentOptions property to the agentOptions input argument. Use this syntax
after any of the input arguments in the previous syntaxes..
3-126
rlDQNAgent
Input Arguments

critic — Critic
Critic, specified as an rlQValueFunction or as the generally more efficient

rlVectorQValueFunction object. For more information on creating critics, see “Create Policies
and Value Functions”.
Your critic can use a recurrent neural network as its function approximator. However, only
rlVectorQValueFunction supports recurrent neural networks. For an example, see “Create DQN
Agent with Recurrent Neural Network” on page 3-136.
Properties

signals.
If you create the agent by specifying a critic object, the value of ObservationInfo matches the
value specified in critic.

or rlNumericSpec.


Since a DDPG agent operates in a discrete action space, you must specify actionInfo as an
rlFiniteSetSpec object.
If you create the agent by specifying a critic object, the value of ActionInfo matches the value
specified in critic.
also construct the specification manually using rlFiniteSetSpec.

rlDQNAgentOptions object
Agent options, specified as an rlDQNAgentOptions object.
3-127
3 Objects
If you create a DQN agent with a default critic that uses a recurrent neural network, the default value
of AgentOptions.SequenceLength is 32.



values.

Object Functions
environment
observations
3-128
rlDQNAgent
Examples
Create DQN Agent from Observation and Action Specifications


rng(0)
Create a deep Q-network agent from the environment observation and action specifications.

{[1]}
Create DQN Agent Using Initialization Options


3-129
3 Objects
rng(0)
agent = rlDQNAgent(obsInfo,actInfo,initOpts);
Extract the deep neural network from both the critic.
The default DQN agent uses a multi-output Q-value critic approximator. A multi-output approximator
has observations as inputs and state-action values as outputs. Each output element represents the
expected cumulative long-term reward for taking the corresponding discrete action from the state
indicated by the observation inputs.
neurons
criticNet.Layers
ans =

Plot the critic network
3-130
rlDQNAgent
To check your agent, use getAction to return the action from random observations.

{[0]}
Create a DQN Agent Using a Multi-Output Critic
Create an environment interface and obtain its observation and action specifications. For this
example load the predefined environment used for the “Train DQN Agent to Balance Cart-Pole
System” example. This environment has a continuous four-dimensional observation space (the
positions and velocities of both cart and pole) and a discrete one-dimensional action space consisting
on the application of two possible forces, -10N or 10N.
Create the predefined environment.
3-131
3 Objects
To approximate the Q-value function within the critic, use a deep neural network. For DQN agents
with a discrete action space, you have the option to create a multi-output Q-value function critic,
which is generally more efficient than a comparable single-output critic.
A network for this critic must take only the observation as input and return a vector of values for
each action. Therefore, it must have an input layer with as many elements as the dimension of the
observation space and an output layer having as many elements as the number of possible discrete
actions. Each output element represents the expected cumulative long-term reward following from
the observation given as input, when the corresponding action is taken.
Define the network as an array of layer objects, and get the dimensions of the observation space (that
is, prod(obsInfo.Dimension)) and the number of possible actions (that is,
numel(actInfo.Elements)) directly from the environment specification objects.
dnn = [
reluLayer
reluLayer
Convert the network to a dlnetwork object ad display the number of weights.
dnn = dlnetwork(dnn);
summary(dnn)
Initialized: true
Inputs:
Create the critic using rlVectorQValueFunction, the network dnn as well as the observation and
action specifications.
critic = rlVectorQValueFunction(dnn,obsInfo,actInfo);
Check that the critic works with a random observation input.
-0.0361
0.0913
Create the DQN agent using the critic.
agent =
rlDQNAgent with properties:
3-132
rlDQNAgent

AgentOptions: [1x1 rl.option.rlDQNAgentOptions]
SampleTime: 1
Specify agent options, including training options for the critic.
agent.AgentOptions.UseDoubleDQN=false;
agent.AgentOptions.TargetUpdateMethod="periodic";
agent.AgentOptions.TargetUpdateFrequency=4;
agent.AgentOptions.ExperienceBufferLength=100000;

{[10]}
Create a DQN Agent Using a Single-Output Critic
Create an environment interface and obtain its observation and action specifications. For this
example load the predefined environment used for the “Train DQN Agent to Balance Cart-Pole
System” example. This environment has a continuous four-dimensional observation space (the
Create the predefined environment.
Create a deep neural network to be used as approximation model within the critic. For DQN agents,
you have the option to create a multi-output Q-value function critic, which is generally more efficient
than a comparable single-output critic. However, for this example, create a single-output Q-value
function critic instead.
3-133
3 Objects
The network for this critic must have two input layers, one for the observation and the other for the
action, and return a scalar value representing the expected cumulative long-term reward following
from the given observation and action.
action spaces from the environment specification objects and specify a name for the input layers, so
% Observation path
obsPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="netOin")
reluLayer
fullyConnectedLayer(24,Name="fcObsPath")];
% Action path
actPath = [
featureInputLayer(prod(actInfo.Dimension),Name="netAin")
fullyConnectedLayer(24,Name="fcActPath")];
% Common path (concatenate inputs along dim #1)

commonPath = [
concatenationLayer(1,2,Name="cat")
reluLayer
fullyConnectedLayer(1,Name="out")];
% Add paths to network

net = layerGraph;
net = addLayers(net,obsPath);
net = addLayers(net,actPath);
% Connect layers
net = connectLayers(net,'fcObsPath','cat/in1');
net = connectLayers(net,'fcActPath','cat/in2');
% Plot network
plot(net)
3-134
rlDQNAgent


summary(net)
Initialized: true
Inputs:
2 'netAin' 1 features
Create the critic using rlQValueFunction. Specify the names of the layers to be associated with the
observation and action channels.
critic = rlQValueFunction(net, ...
obsInfo, ...
actInfo, ...
ObservationInputNames="netOin", ...
ActionInputNames="netAin");
Check the critic with a random observation and action input.

ans = single
-0.0232
3-135
3 Objects
Create the DQN agent using the critic.
agent =
rlDQNAgent with properties:

AgentOptions: [1x1 rl.option.rlDQNAgentOptions]
SampleTime: 1
Specify agent options, including training options for the critic.
agent.AgentOptions.UseDoubleDQN=false;
agent.AgentOptions.TargetUpdateMethod="periodic";
agent.AgentOptions.TargetUpdateFrequency=4;
agent.AgentOptions.ExperienceBufferLength=100000;

{[10]}
Create DQN Agent with Recurrent Neural Network
Pole System” example. This environment has a continuous four-dimensional observation space (the
To approximate the Q-value function within the critic, use a recurrent deep neural network. For DQN
agents, only the vector function approximator, rlVectorQValueFunction, supports recurrent
neural networks models. For vector Q-value function critics, the number of elements of the output
layer has to be equal to the number of possible actions: numel(actInfo.Elements).
3-136
rlDQNAgent
Define the network as an array of layer objects. Get the dimensions of the observation space from the
environment specification object (prod(obsInfo.Dimension)). To create a recurrent neural
network, use a sequenceInputLayer as the input layer and include an lstmLayer as one of the
other network layers.
net = [
reluLayer
lstmLayer(20,OutputMode="sequence");
reluLayer
Convert to a dlnetwork object and display the number of weights.

summary(net);
Initialized: true
Inputs:
Create a critic using the recurrent neural network.

Check your critic with a random input observation.

0.0136
0.0067
Define some training options for the critic.

criticOptions = rlOptimizerOptions( ...
LearnRate=1e-3, ...
GradientThreshold=1);
Specify options for creating the DQN agent. To use a recurrent neural network, you must specify a
SequenceLength greater than 1.
agentOptions = rlDQNAgentOptions(...
UseDoubleDQN=false, ...
TargetSmoothFactor=5e-3, ...
ExperienceBufferLength=1e6, ...
SequenceLength=32, ...
CriticOptimizerOptions=criticOptions);
agentOptions.EpsilonGreedyExploration.EpsilonDecay = 1e-4;
Create the agent. The actor and critic networks are initialized randomly.
3-137
3 Objects
agent = rlDQNAgent(critic,agentOptions);
Check your agent using getAction to return the action from a random observation.
getAction(agent,rand(obsInfo.Dimension))

{[-10]}
You can now test and train the agent against the environment.
Version History
See Also
rlAgentInitializationOptions | rlDQNAgentOptions | rlVectorQValueFunction |
rlQValueFunction | Deep Network Designer
Topics
“Deep Q-Network (DQN) Agents”
3-138
rlDQNAgentOptions
rlDQNAgentOptions
Options for DQN agent
Description
Use an rlDQNAgentOptions object to specify options for deep Q-network (DQN) agents. To create a
DQN agent, use rlDQNAgent.
For more information, see “Deep Q-Network (DQN) Agents”.
Learning Agents”.
Creation
Syntax
opt = rlDQNAgentOptions
opt = rlDQNAgentOptions(Name,Value)
Description
opt = rlDQNAgentOptions creates an options object for use as an argument when creating a DQN
agent using all default settings. You can modify the object properties using dot notation.
opt = rlDQNAgentOptions(Name,Value) sets option properties on page 3-139 using name-value

pairs. For example, rlDQNAgentOptions('DiscountFactor',0.95) creates an option set with a
quotes.
Properties
UseDoubleDQN — Flag for using double DQN
Flag for using double DQN for value function target updates, specified as a logical value. For most
application set UseDoubleDQN to "on". For more information, see “Deep Q-Network (DQN) Agents”.
EpsilonGreedyExploration — Options for epsilon-greedy exploration

EpsilonGreedyExploration object
Options for epsilon-greedy exploration, specified as an EpsilonGreedyExploration object with the

following properties.
3-139
3 Objects

Epsilon Probability threshold to either 1
randomly select an action or
select the action that maximizes
the state-action value function.
A larger value of Epsilon
means that the agent randomly
explores the action space at a
higher rate.
EpsilonMin Minimum value of Epsilon 0.01
EpsilonDecay Decay rate 0.0050
At the end of each training time step, if Epsilon is greater than EpsilonMin, then it is updated
using the following formula.
Epsilon = Epsilon*(1-EpsilonDecay)
If your agent converges on local optima too quickly, you can promote agent exploration by increasing
Epsilon.
To specify exploration options, use dot notation after creating the rlDQNAgentOptions object opt.
For example, set the epsilon value to 0.9.
opt.EpsilonGreedyExploration.Epsilon = 0.9;


rlOptimizer.
TargetSmoothFactor — Smoothing factor for target critic updates

Smoothing factor for target critic updates, specified as a positive scalar less than or equal to 1. For
more information, see “Target Update Methods”.
TargetUpdateFrequency — Number of steps between target critic updates

Number of steps between target critic updates, specified as a positive integer. For more information,
see “Target Update Methods”.


3-140
rlDQNAgentOptions
Maximum batch-training trajectory length when using a recurrent neural network for the critic,
specified as a positive integer. This value must be greater than 1 when using a recurrent neural
network for the critic and 1 otherwise.

When using a recurrent neural network for the critic, MiniBatchSize is the number of experience
trajectories in a batch, where each trajectory has length equal to SequenceLength.
more information, see chapter 7 of [1].
N-step Q learning is not supported when using a recurrent neural network for the critic. In this case,
NumStepsToLookAhead must be 1.



equal to 1.
Object Functions
3-141
3 Objects
Examples
Create DQN Agent Options Object
This example shows how to create a DQN agent options object.
Create an rlDQNAgentOptions object that specifies the agent mini-batch size.
opt = rlDQNAgentOptions('MiniBatchSize',48)
opt =
rlDQNAgentOptions with properties:
UseDoubleDQN: 1
EpsilonGreedyExploration: [1x1 rl.option.EpsilonGreedyExploration]
SequenceLength: 1
MiniBatchSize: 48
SampleTime: 1
Version History
Target update method settings for DQN agents have changed

Target update method settings for DQN agents have changed. The following changes require updates
to your code:
• The TargetUpdateMethod option has been removed. Now, DQN agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.
To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.

Smoothing 1 Less than 1
3-142
rlDQNAgentOptions

Periodic Greater than 1 1
Periodic smoothing (new Greater than 1 Less than 1
method in R2020a)
The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.
Update Code
This table shows some typical uses of rlDQNAgentOptions and how to update your code to use the
new option configuration.

opt = opt = rlDQNAgentOptions;
rlDQNAgentOptions('TargetUpdateMethod'
,"smoothing");
opt = opt = rlDQNAgentOptions;
rlDQNAgentOptions('TargetUpdateMethod' opt.TargetUpdateFrequency = 4;
,"periodic"); opt.TargetSmoothFactor = 1;
opt = rlDQNAgentOptions; opt = rlDQNAgentOptions;
opt.TargetUpdateMethod = "periodic"; opt.TargetUpdateFrequency = 5;
opt.TargetUpdateFrequency = 5; opt.TargetSmoothFactor = 1;
References
2018.
See Also
Topics
“Deep Q-Network (DQN) Agents”
3-143
3 Objects
rlEpsilonGreedyPolicy
Policy object to generate discrete epsilon-greedy actions for custom training loops
Description
This object implements an epsilon-greedy policy, which returns either the action that maximizes a
discrete action-space Q-value function, with probability 1-Epsilon, or a random action otherwise,
given an input observation. You can create an rlEpsilonGreedyPolicy object from an
rlQValueFunction or rlVectorQValueFunction object, or extract it from an rlQAgent,
rlDQNAgent or rlSARSAAgent. You can then train the policy object using a custom training loop or
deploy it for your application. If UseEpsilonGreedyAction is set to 0 the policy is deterministic,
therefore in this case it does not explore. This object is not compatible with generatePolicyBlock
and generatePolicyFunction. For more information on policies and value functions, see “Create
Creation
Syntax
policy = rlEpsilonGreedyPolicy(qValueFunction)
Description
policy = rlEpsilonGreedyPolicy(qValueFunction) creates the epsilon-greedy policy object

policy from the discrete action-space Q-value function qValueFunction. It also sets the
QValueFunction property of policy to the input argument qValueFunction.
Properties
QValueFunction — Discrete action-space Q-value function
Discrete action-space Q-value function approximator, specified as an rlQValueFunction or

rlVectorQValueFunction object.
ExplorationOptions — Noise model options

Exploration options, specified as an EpsilonGreedyExploration object. For more information see

the EpsilonGreedyExploration property in rlQAgentOptions.
UseEpsilonGreedyAction — Option to enable epsilon-greedy actions

Option to enable epsilon-greedy actions, specified as a logical value: either true (default, enabling
epsilon-greedy actions, which helps exploration) or false (epsilon-greedy actions not enabled). When
epsilon-greedy actions are disabled the policy is deterministic and therefore it does not explore.
3-144
Example: false
EnableNoiseDecay — Option to enable noise decay

Option to enable noise decay, specified as a logical value: either true (default, enabling noise decay)
or false (disabling noise decay).
Example: false



the environment action channel, such as its dimensions, data type, and name. Note that the name of
the action channel specified in actionInfo (if any) is not used.

Example: 0.2
Object Functions
observations
policy object
object
Examples
3-145
3 Objects
Create Epsilon-Greedy Policy Object from Vector Q-Value Function
Create observation and action specification objects. For this example, define the observation space as
a continuous four-dimensional space, so that a single observation is a column vector containing four
doubles, and the action space as a finite set consisting of two possible row vectors, [1 0] and [0 1].
actInfo = rlFiniteSetSpec({[1 0],[0 1]});

Create a vector Q-value function approximator to use as critic. A vector Q-value function must accept
an observation as input and return a single vector with as many elements as the number of possible
discrete actions.
To approximate the vector Q-value function within the critic, use a neural network. Define a single
path from the network input to its output as an array of layer objects.
layers = [
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))
];
summary(model)
Initialized: true
Inputs:
Create a vector Q-value function using model, and the observation and action specifications.
qValueFcn = rlVectorQValueFunction(model,obsInfo,actInfo)
qValueFcn =
rlVectorQValueFunction with properties:

UseDevice: "cpu"
Check the critic with a random observation input.

getValue(qValueFcn,{rand(obsInfo.Dimension)})
0.6486
-0.3103
3-146
Create a policy object from qValueFcn.
policy = rlEpsilonGreedyPolicy(qValueFcn)
policy =
rlEpsilonGreedyPolicy with properties:
QValueFunction: [1x1 rl.function.rlVectorQValueFunction]

ExplorationOptions: [1x1 rl.option.EpsilonGreedyExploration]
UseEpsilonGreedyAction: 1
EnableEpsilonDecay: 1
SampleTime: -1
getAction(policy,{rand(obsInfo.Dimension)})

{[1 0]}
Version History
See Also
Functions
rlMaxQPolicy | rlDeterministicActorPolicy | rlAdditiveNoisePolicy |
rlStochasticActorPolicy | rlQValueFunction | rlVectorQValueFunction | rlSARSAAgent
| rlQAgent | rlDQNAgent
Blocks
RL Agent
Topics
3-147
3 Objects
rlFiniteSetSpec
Create discrete action or observation data specifications for reinforcement learning environments
Description
An rlFiniteSetSpec object specifies discrete action or observation data specifications for
reinforcement learning environments.
Creation
Syntax
spec = rlFiniteSetSpec(elements)
Description
spec = rlFiniteSetSpec(elements) creates a data specification with a discrete set of actions or

observations, setting the Elements property.
Properties
Elements — Set of valid actions or observations
vector | cell array
Set of valid actions or observations for the environment, specified as one of the following:
• Vector — Specify valid numeric values for a single action or single observation.
• Cell array — Specify valid numeric value combinations when you have more than one action or
observation. Each entry of the cell array must have the same dimensions.
Name — Name of the rlFiniteSetSpec object

string
Name of the rlFiniteSetSpec object, specified as a string. Use this property to set a meaningful
name for your finite set.
Description — Description of the rlFiniteSetSpec object

string
Description of the rlFiniteSetSpec object, specified as a string. Use this property to specify a
meaningful description of the finite set values.
Dimension — Size of each element

vector (default)
Size of each element, specified as a vector.
3-148
rlFiniteSetSpec
If you specify Elements as a vector, then Dimension is [1 1]. Otherwise, if you specify a cell array,
then Dimension indicates the size of the entries in Elements.
DataType — Information about the type of data

"double" (default) | string
Information about the type of data, specified as a string, such as "double" or "single".
Object Functions
rlSimulinkEnv Create reinforcement learning environment using dynamic model
implemented in Simulink
rlFunctionEnv Specify custom reinforcement learning environment dynamics
using functions
rlValueFunction Value function approximator object for reinforcement learning
agents
rlQValueFunction Q-Value function approximator object for reinforcement learning
agents
rlVectorQValueFunction Vector Q-value function approximator for reinforcement learning
agents
rlContinuousDeterministicActor Deterministic actor with a continuous action space for
reinforcement learning agents
rlDiscreteCategoricalActor Stochastic categorical actor with a discrete action space for
rlContinuousGaussianActor Stochastic Gaussian actor with a continuous action space for
Examples
Open the model.
open_system(mdl)
respectively.
angle.
obsInfo =
LowerLimit: -Inf
UpperLimit: Inf
3-149
3 Objects
Name: [0×0 string]

Dimension: [3 1]
DataType: "double"
2 Nm.
actInfo =

Name: [0×0 string]
Dimension: [1 1]
DataType: "double"
objects.

env =
ResetFcn : []
UseFastRestart : on
env =
UseFastRestart : on
3-150
rlFiniteSetSpec
Specify Discrete Value Set for Multiple Actions
If the actor for your reinforcement learning agent has multiple outputs, each with a discrete action
space, you can specify the possible discrete actions combinations using an rlFiniteSetSpec object.
Suppose that the valid values for a two-output system are [1 2] for the first output and [10 20 30]
for the second output. Create a discrete action space specification for all possible input combinations.
actionSpec = rlFiniteSetSpec({[1 10],[1 20],[1 30],...

[2 10],[2 20],[2 30]})
actionSpec =
Elements: {6x1 cell}

Name: [0x0 string]
Dimension: [1 2]
DataType: "double"
Version History
See Also
rlNumericSpec | rlSimulinkEnv | getActionInfo | getObservationInfo |
rlFunctionEnv
3-151
3 Objects
rlFunctionEnv
Specify custom reinforcement learning environment dynamics using functions
Description
Use rlFunctionEnv to define a custom reinforcement learning environment. You provide MATLAB
functions that define the step and reset behavior for the environment. This object is useful when you
want to customize your environment beyond the predefined environments available with
rlPredefinedEnv.
Creation
Syntax
env = rlFunctionEnv(obsInfo,actInfo,stepfcn,resetfcn)
Description
env = rlFunctionEnv(obsInfo,actInfo,stepfcn,resetfcn) creates a reinforcement

learning environment using the provided observation and action specifications, obsInfo and
actInfo, respectively. You also set the StepFcn and ResetFcn properties using MATLAB functions.
Input Arguments
obsInfo — Observation specification

Observation specification, specified as an rlFiniteSetSpec or rlNumericSpec object or an array

containing a mix of such objects. These objects define properties such as the dimensions, data types,
and names of the observation signals.
actInfo — Action specification

rlFiniteSetSpec object | rlNumericSpec object
Action specification, specified as an rlFiniteSetSpec or rlNumericSpec object. These objects

define properties such as the dimensions, data types, and names of the action signals.
Properties
StepFcn — Step behavior for the environment
function name | function handle | anonymous function handle
Step behavior for the environment, specified as a function name, function handle, or anonymous
function.
StepFcn is a function that you provide which describes how the environment advances to the next
state from a given action. When using a function name or function handle, this function must have
two inputs and four outputs, as illustrated by the following signature.
3-152
rlFunctionEnv
[Observation,Reward,IsDone,LoggedSignals] = myStepFunction(Action,LoggedSignals)
To use additional input arguments beyond the required set, specify StepFcn using an anonymous
function handle.
The step function computes the values of the observation and reward for the given action in the
environment. The required input and output arguments are as follows.
• Action — Current action, which must match the dimensions and data type specified in actInfo.
• Observation — Returned observation, which must match the dimensions and data types
specified in obsInfo.
• Reward — Reward for the current step, returned as a scalar value.
• IsDone — Logical value indicating whether to end the simulation episode. The step function that
you define can include logic to decide whether to end the simulation based on the observation,
reward, or any other values.
• LoggedSignals — Any data that you want to pass from one step to the next, specified as a
structure.
For an example showing multiple ways to define a step function, see “Create MATLAB Environment
Using Custom Functions”.
ResetFcn — Reset behavior for the environment

function name | function handle | anonymous function handle
Reset behavior for the environment, specified as a function, function handle, or anonymous function
handle.
The reset function that you provide must have no inputs and two outputs, as illustrated by the
following signature.
[InitialObservation,LoggedSignals] = myResetFunction
To use input arguments with your reset function, specify ResetFcn using an anonymous function
handle.
The reset function sets the environment to an initial state and computes the initial values of the
observation signals. For example, you can create a reset function that randomizes certain state
values, such that each training episode begins from different initial conditions.
The sim function calls the reset function to reset the environment at the start of each simulation, and
the train function calls it at the start of each training episode.
The InitialObservation output must match the dimensions and data type of obsInfo.
To pass information from the reset condition into the first step, specify that information in the reset
function as the output structure LoggedSignals.
For an example showing multiple ways to define a reset function, see “Create MATLAB Environment
Using Custom Functions”.
LoggedSignals — Information to pass to next step

structure
Information to pass to the next step, specified as a structure. When you create the environment,
whatever you define as the LoggedSignals output of ResetFcn initializes this property. When a
3-153
3 Objects
step occurs, the software populates this property with data to pass to the next step, as defined in
StepFcn.
Object Functions
getActionInfo Obtain action data specifications from reinforcement learning environment,
agent, or experience buffer
getObservationInfo Obtain observation data specifications from reinforcement learning
environment, agent, or experience buffer
sim Simulate trained reinforcement learning agents within specified environment
validateEnvironment Validate custom reinforcement learning environment
Examples
Create Custom MATLAB Environment
Create a reinforcement learning environment by supplying custom dynamic functions in MATLAB®.

Using rlFunctionEnv, you can create a MATLAB reinforcement learning environment from an
observation specification, action specification, and step and reset functions that you define.
For this example, create an environment that represents a system for balancing a cart on a pole. The
observations from the environment are the cart position, cart velocity, pendulum angle, and
pendulum angle derivative. (For additional details about this environment, see “Create MATLAB
Environment Using Custom Functions”.) Create an observation specification for those signals.
oinfo = rlNumericSpec([4 1]);

oinfo.Name = 'CartPole States';
oinfo.Description = 'x, dx, theta, dtheta';
The environment has a discrete action space where the agent can apply one of two possible force
values to the cart, –10 N or 10 N. Create the action specification for those actions.
ActionInfo = rlFiniteSetSpec([-10 10]);

ActionInfo.Name = 'CartPole Action';
Next, specify the custom step and reset functions. For this example, use the supplied functions
myResetFunction.m and myStepFunction.m. For details about these functions and how they are
constructed, see “Create MATLAB Environment Using Custom Functions”.
Construct the custom environment using the defined observation specification, action specification,
and function names.
env = rlFunctionEnv(oinfo,ActionInfo,'myStepFunction','myResetFunction');
You can create agents for env and train them within the environment as you would for any other
reinforcement learning environment.
As an alternative to using function names, you can specify the functions as function handles. For
more details and an example, see “Create MATLAB Environment Using Custom Functions”.
3-154
rlFunctionEnv
Version History
See Also
rlPredefinedEnv | rlCreateEnvTemplate
Topics
“Create MATLAB Environment Using Custom Functions”
3-155
3 Objects
rlIsDoneFunction
Is-done function approximator object for neural network-based environment
Description
specify the is-done function approximator using an rlIsDoneFunction object. Do so when you do
not know a ground-truth termination signal for your environment.
The is-done function approximator object uses a deep neural network as internal approximation
model to predict the termination signal for the environment given one of the following input
combinations.

Creation
Syntax
isdFcnAppx = rlIsDoneFunction(net,observationInfo,actionInfo,Name=Value)
Description
isdFcnAppx = rlIsDoneFunction(net,observationInfo,actionInfo,Name=Value)
creates the is-done function approximator object isdFcnAppx using the deep neural network net
and sets the ObservationInfo and ActionInfo properties.
When creating an is-done function approximator you must specify the names of the deep neural
network inputs using one of the following combinations of name-value pair arguments.

You can also specify the UseDeterministicPredict and UseDevice properties using optional
name-value pair arguments. For example, to use a GPU for prediction, specify UseDevice="gpu".
Input Arguments

dlnetwork object
3-156
rlIsDoneFunction



ObservationInputNames when you expect the termination signal to depend on the current

expect the termination signal to depend on the current action value.

NextObservationInputNames when you expect the termination signal to depend on the next
Properties

3-157
3 Objects
or rlNumericSpec.


signals.
rlNumericSpec.
UseDeterministicPredict — Option to predict the terminal signal deterministically

Option to predict the terminal signal deterministically, specified as one of the following values.
• true — Use deterministic network prediction.

• false — Use stochastic network prediction.

Computing Toolbox).
used with MATLAB.
Object Functions
Examples
Create Is-Done Function and Predict Termination
3-158
rlIsDoneFunction
To approximate the is-done function, use a deep neural network. The network has one input channel
for the next observations. The single output channel is for the predicted termination signal.
Create the neural network as a vecto of layer object.

commonPath = [
Name="nextState")
net = layerGraph(commonPath);
plot(net)
Covert the network to a dlnetwork object and display the number of weights.
summary(net);
3-159
3 Objects
Initialized: true
Inputs:
Create an is-done function approximator object.
isDoneFcnAppx = rlIsDoneFunction(...
Using this is-done function approximator object, you can predict the termination signal based on the
next observation. For example, predict the termination signal for a random next observation. Since
for this example the termination signal only depends on the next observation, use empty cell arrays
for the current action and observation inputs.
predIsDone = predict(isDoneFcnAppx,{},{},{nxtobs})
predIsDone = 0
You can obtain the termination probability using evaluate.
predIsDoneProb = evaluate(isDoneFcnAppx,{nxtobs})
predIsDoneProb = 1x1 cell array

{2x1 single}
predIsDoneProb{1}
0.5405
0.4595
The first number is the probability of obtaining a 0 (no termination predicted), the second one is the
probability of obtaining a 1 (termination predicted).
Version History
See Also
Objects
rlNeuralNetworkEnvironment | rlContinuousDeterministicTransitionFunction |
rlIsDoneFunction | evaluate | accelerate | gradient
3-160
rlIsDoneFunction
Topics
3-161
3 Objects
rlMaxQPolicy
Policy object to generate discrete max-Q actions for custom training loops and application
deployment
Description
This object implements a max-Q policy, which returns the action that maximizes a discrete action-
space Q-value function, given an input observation. You can create an rlMaxQPolicy object from an
rlQValueFunction or rlVectorQValueFunction object, or extract it from an rlQAgent,
rlDQNAgent or rlSARSAAgent. You can then train the policy object using a custom training loop or
deploy it for your application using generatePolicyBlock or generatePolicyFunction. This
policy is always deterministic and does not perform any exploration. For more information on policies
and value functions, see “Create Policies and Value Functions”.
Creation
Syntax
policy = rlMaxQPolicy(qValueFunction)
Description
policy = rlMaxQPolicy(qValueFunction) creates the max-Q policy object policy from the
discrete action-space Q-value function qValueFunction. It also sets the QValueFunction property
of policy to the input argument qValueFunction.
Properties
QValueFunction — Discrete action-space Q-value function
Discrete action-space Q-value function approximator, specified as an rlQValueFunction or

rlVectorQValueFunction object.



the environment action channel, such as its dimensions, data type, and name. Note that the name of
the action channel specified in actionInfo (if any) is not used.
3-162
rlMaxQPolicy

Sample time of policy, specified as a positive scalar or as -1 (default). Setting this parameter to -1
allows for event-based simulations.
Example: 0.2
Object Functions
observations
policy object
object
Examples
Create Max-Q Policy Object from Vector Q-Value Function
Create observation and action specification objects. For this example, define the observation space as
a continuous four-dimensional space, so that a single observation is a column vector containing four
doubles, and the action space as a finite set consisting of two possible values, -1 and 1.

actInfo = rlFiniteSetSpec([-1 1]);
Alternatively, you can use getObservationInfo and getActionInfo to extract the specification
objects from an environment.
Create a vector Q-value function approximator to use as critic. A vector Q-value function must accept
an observation as input and return a single vector with as many elements as the number of possible
discrete actions.
To approximate the vector Q-value function within the critic, use a neural network. Define a single
path from the network input to its output as an array of layer objects.
layers = [
3-163
3 Objects
reluLayer
];
summary(model)
Initialized: true
Inputs:
Create a vector Q-value function using model, and the observation and action specifications.
qValueFcn = rlVectorQValueFunction(model,obsInfo,actInfo)
qValueFcn =

UseDevice: "cpu"
getValue(qValueFcn,{rand(obsInfo.Dimension)})
0.6486
-0.3103
Create a policy object from qValueFcn.
policy = rlMaxQPolicy(qValueFcn)
policy =
rlMaxQPolicy with properties:
QValueFunction: [1x1 rl.function.rlVectorQValueFunction]

SampleTime: -1
getAction(policy,{rand(obsInfo.Dimension)})

{[-1]}
3-164
rlMaxQPolicy
Version History
See Also
Functions
rlEpsilonGreedyPolicy | rlDeterministicActorPolicy | rlAdditiveNoisePolicy |
rlStochasticActorPolicy | rlQValueFunction | rlVectorQValueFunction | rlSARSAAgent
| rlQAgent | rlDQNAgent | generatePolicyBlock | generatePolicyFunction
Blocks
RL Agent
Topics
3-165
3 Objects
rlMBPOAgent
Model-based policy optimization reinforcement learning agent
Description
A model-based policy optimization (MBPO) agent is a model-based, online, off-policy, reinforcement
learning method. An MBPO agent contains an internal model of the environment, which it uses to
generate additional experiences without interacting with the environment.
During training, the MBPO agent generates real experiences by interacting with the environment.
These experiences are used to train the internal environment model, which is used to generate
additional experiences. The training algorithm then uses both the real and generated experiences to
update the agent policy.
Creation
Syntax
agent = rlMBPOAgent(baseAgent,envModel)
agent = rlMBPOAgent( ___ ,agentOptions)
Description
agent = rlMBPOAgent(baseAgent,envModel) creates a model-based policy optimization agent

with default options and sets the BaseAgent and EnvModel properties.
agent = rlMBPOAgent( ___ ,agentOptions) creates a model-based policy optimization agent

using specified options and sets the AgentOptions property.
Properties
BaseAgent — Base reinforcement learning agent
rlDQNAgent | rlDDPGAgent | rlTD3Agent | rlSACAgent
Base reinforcement learning agent, specified as an off-policy agent object.
For environments with a discrete action space, specify a DQN agent using an rlDQNAgent object.
For environments with a continuous action space, use one of the following agent objects.
• rlDDPGAgent — DDPG agent

• rlTD3Agent — TD3 agent
• rlSACAgent — SAC agent
EnvModel — Environment model

rlNeuralNetworkEnvironment
3-166
rlMBPOAgent
Environment model, specified as an rlNeuralNetworkEnvironment object. This environment

contains transition functions, a reward function, and an is-done function.

rlMBPOAgentOptions object
Agent options, specified as an rlMBPOAgentOptions object.
RolloutHorizon — Current roll-out horizon value

positive integer
Current roll-out horizon value, specified as a positive integer. For more information on setting the
initial horizon value and the horizon update method, see rlMBPOAgentOptions.
ModelExperienceBuffer — Model experience buffer

Model experience buffer, specified as an rlReplayMemory object. During training the agent stores
each of its generated experiences (S,A,R,S',D) in a buffer. Here:


true | false
Option to use exploration policy when selecting actions, specified as one of the following logical
values.
The initial value of UseExplorationPolicy matches the value specified in BaseAgent. If you
change the value of UseExplorationPolicy in either the base agent or the MBPO agent, the same
value is used for the other agent.


signals.
The value of ObservationInfo matches the corresponding value specified in BaseAgent.

3-167
3 Objects
Action specification, specified as a reinforcement learning specification object or an array of

specification objects defining properties such as dimensions, data type, and names of the action
signals.
The value of ActionInfo matches the corresponding value specified in BaseAgent.

The initial value of SampleTime matches the value specified in BaseAgent. If you change the value
of SampleTime in either the base agent or the MBPO agent, the same value is used for the other
agent.
Object Functions
Examples
Create MBPO Agent
Create an environment interface and extract observation and action specifications.
Create a base off-policy agent. For this example, use a SAC agent.
agentOpts = rlSACAgentOptions;
agentOpts.MiniBatchSize = 256;
baseagent = rlSACAgent(obsInfo,actInfo,initOpts,agentOpts);
Check your agent with a random input observation.
getAction(baseagent,{rand(obsInfo.Dimension)})

{[-7.2875]}
3-168
rlMBPOAgent
The neural network environment uses a function approximator object to approximate the
environment transition function. The function approximator object uses one or more neural networks
as approximator model. To account for modeling uncertainty, you can specify multiple transition
models. For this example, create a single transition model.
Create a neural network to use as approximation model within the transition function object. Define
each network path as an array of layer objects. Specify a name for the input and output layers, so you
can later explicitly associate them with the appropriate channel.
% Observation and action paths
obsPath = featureInputLayer(obsInfo.Dimension(1),Name="obsIn");
actionPath = featureInputLayer(actInfo.Dimension(1),Name="actIn");
% Common path: concatenate along dimension 1

reluLayer
reluLayer
fullyConnectedLayer(obsInfo.Dimension(1),Name="nextObsOut")];
% Add layers to layerGraph object

transNet = layerGraph(obsPath);
transNet = addLayers(transNet,actionPath);
transNet = addLayers(transNet,commonPath);
% Connect layers
transNet = connectLayers(transNet,"obsIn","concat/in1");
transNet = connectLayers(transNet,"actIn","concat/in2");

transNet = dlnetwork(transNet);
% Display number of weights

summary(transNet)
Initialized: true
Inputs:
1 'obsIn' 4 features
2 'actIn' 1 features
Create the transition function approximator object.

transitionFcnAppx = rlContinuousDeterministicTransitionFunction( ...
transNet,obsInfo,actInfo,...
ObservationInputNames="obsIn",...
ActionInputNames="actIn",...
NextObservationOutputNames="nextObsOut");
Create a neural network to use as a reward model for the reward function approximator object.
% Observation and action paths
actionPath = featureInputLayer(actInfo.Dimension(1),Name="actIn");
nextObsPath = featureInputLayer(obsInfo.Dimension(1),Name="nextObsIn");
3-169
3 Objects
% Common path: concatenate along dimension 1

reluLayer
reluLayer
reluLayer

rewardNet = layerGraph(nextObsPath);
rewardNet = addLayers(rewardNet,actionPath);
rewardNet = addLayers(rewardNet,commonPath);
% Connect layers
rewardNet = connectLayers(rewardNet,"nextObsIn","concat/in1");
rewardNet = connectLayers(rewardNet,"actIn","concat/in2");

rewardNet = dlnetwork(rewardNet);

summary(transNet)
Initialized: true
Inputs:
Create the reward function approximator object.

rewardFcnAppx = rlContinuousDeterministicRewardFunction( ...
rewardNet,obsInfo,actInfo, ...
ActionInputNames="actIn",...
NextObservationInputNames="nextObsIn");
Create an is-done model for the reward function approximator object.

% Define main path
net = [featureInputLayer(obsInfo.Dimension(1),Name="nextObsIn");
reluLayer
reluLayer
softmaxLayer(Name="isdoneOut")];
% Convert to layergraph object

isDoneNet = layerGraph(net);

isDoneNet = dlnetwork(isDoneNet);

summary(transNet)
3-170
rlMBPOAgent
Initialized: true
Inputs:
Create the reward function approximator object.
isdoneFcnAppx = rlIsDoneFunction(isDoneNet,obsInfo,actInfo, ...

NextObservationInputNames="nextObsIn");
Create the neural network environment using the observation and action specifications and the three
function approximator objects.
generativeEnv = rlNeuralNetworkEnvironment( ...

obsInfo,actInfo,...
transitionFcnAppx,rewardFcnAppx,isdoneFcnAppx);
Specify options for creating an MBPO agent. Specify the optimizer options for the transition network
and use default values for all other options.
MBPOAgentOpts = rlMBPOAgentOptions;
MBPOAgentOpts.TransitionOptimizerOptions = rlOptimizerOptions(...
LearnRate=1e-4,...
GradientThreshold=1.0);
Create the MBPO agent.
agent = rlMBPOAgent(baseagent,generativeEnv,MBPOAgentOpts);

{[7.8658]}
Version History
See Also
Objects
rlMBPOAgentOptions | rlNeuralNetworkEnvironment
Topics
“Train MBPO Agent to Balance Cart-Pole System”
3-171
3 Objects
rlMBPOAgentOptions
Options for MBPO agent
Description
Use an rlMBPOAgentOptions object to specify options for model-based policy optimization (MBPO)
agents. To create an MBPO agent, use rlMBPOAgent.
For more information, see “Model-Based Policy Optimization Agents”.
Creation
Syntax
opt = rlMBPOAgentOptions
opt = rlMBPOAgentOptions(Name=Value)
Description
opt = rlMBPOAgentOptions creates an option object for use as an argument when creating an
MBPO agent using all default options. You can modify the object properties using dot notation.
opt = rlMBPOAgentOptions(Name=Value) sets option properties on page 3-172 using name-

value pair arguments. For example, rlMBPOAgentOptions(DiscountFactor=0.95) creates an
option set with a discount factor of 0.95. You can specify multiple name-value pair arguments.
Properties
NumEpochForTrainingModel — Number of epochs
Number of epochs for training the environment model, specified as a positive integer.
NumMiniBatches — Number of mini-batches

10 (default) | positive integer | "all"
Number of mini-batches used in each environment model training epoch, specified as a positive
scalar or "all". When you specify NumMiniBatches to "all", the agent selects the number of mini-
batches such that all samples in the base agents experience buffer are used to train the model.

Size of random experience mini-batch for training environment model, specified as a positive integer.
During each model training episode, the agent randomly samples experiences from the experience
buffer when computing gradients for updating the environment model properties. Large mini-batches
reduce the variance when computing gradients but increase the computational effort.
3-172
rlMBPOAgentOptions
ModelExperienceBufferLength — Generated experience buffer size

Generated experience buffer size, specified as a positive integer. When the agent generates
experiences, they are added to the model experience buffer.
RealSampleRatio — Ratio of real experiences in a mini-batch

0.2 (default) | nonnegative scalar less than or equal to 1
Ratio of real experiences in a mini-batch for agent training, specified as a nonnegative scalar less
than or equal to 1.
TransitionOptimizerOptions — Transition function optimizer options

rlOptimizerOptions object | array of rlOptimizerOptions objects
Transition function optimizer options, specified as one of the following:
• rlOptimizerOptions object — When your neural network environment has a single transition
function or if you want to use the same options for multiple transition functions, specify a single
options object.
• Array of rlOptimizerOptions objects — When your neural network environment agent has
multiple transition functions and you want to use different optimizer options for the transition
functions, specify an array of options objects with length equal to the number of transition
functions.
Using these objects, you can specify training parameters for the transition deep neural network
approximators as well as the optimizer algorithms and parameters.
If you have previously trained transition models and do not want the MBPO agent to modify these
models during training, set TransitionOptimizerOptions.LearnRate to 0.
RewardOptimizerOptions — Reward function optimizer options

Reward function optimizer options, specified as an rlOptimizerOptions object. Using this object,
you can specify training parameters for the reward deep neural network approximator as well as the
optimizer algorithm and its parameters.
If you specify a ground-truth reward function using a custom function, the MBPO agent ignores these
options.
If you have a previously trained reward model and do not want the MBPO agent to modify the model
during training, set RewardOptimizerOptions.LearnRate to 0.
IsDoneOptimizerOptions — Is-done function optimizer options

Is-done function optimizer options, specified as an rlOptimizerOptions object. Using this object,
you can specify training parameters for the is-done deep neural network approximator as well as the
optimizer algorithm and its parameters.
If you specify a ground-truth is-done function using a custom function, the MBPO agent ignores these
options.
3-173
3 Objects
If you have a previously trained is-done model and do not want the MBPO agent to modify the model
during training, set IsDoneOptimizerOptions.LearnRate to 0.
ModelRolloutOptions — Model roll-out options

rlModelRolloutOptions object
Model roll-out options for controlling the number and length of generated experience trajectories,
specified as an rlModelRolloutOptions object with the following fields. At the start of each epoch,
the agent generates the roll-out trajectories and adds them to the model experience buffer. To modify
the roll-out options, use dot notation.
NumRollout — Number of trajectories

Number of trajectories for generating samples, specified as a positive integer.
Horizon — Initial trajectory horizon

Initial trajectory horizon, specified as a positive integer.
HorizonUpdateSchedule — Option for increasing horizon length

"none" (default) | "piecewise"
Option for increasing the horizon length, specified as one of the following values.
• "none" — Do not increase the horizon length.

• "piecewise" — Increase the horizon length by one after every N model training epochs, where
N is equal to HorizonUpdateFrequency.
RolloutHorizonUpdateFrequency — Number of epochs after which the horizon increases

Number of epochs after which the horizon increases, specified as a positive integer. When
RolloutHorizonSchedule is "none" this option is ignored.
HorizonMax — Maximum horizon length

Maximum horizon length, specified as a positive integer greater than or equal to RolloutHorizon.
When RolloutHorizonSchedule is "none" this option is ignored.
HorizonUpdateStartEpoch — Training epoch at which to start generating trajectories

Training epoch at which to start generating trajectories, specified as a positive integer.
NoiseOptions — Exploration model options

[] (default) | EpsilonGreedyExploration object | GaussianActionNoise object
Exploration model options for generating experiences using the internal environment model,
specified as one of the following:
• [] — Use the exploration policy of the base agent. You must use this option when training a SAC
base agent.
3-174
rlMBPOAgentOptions
• EpsilonGreedyExploration object — You can use this option when training a DQN base agent.
• GaussianActionNoise object — You can use this option when training a DDPG or TD3 base
agent.
The exploration model uses only the initial noise option values and does not update the values during
training.
To specify NoiseOptions, create a default model object. Then, specify any nondefault model
properties using dot notation.
• Specify epsilon greedy exploration options.

opt = rlMBPOAgentOptions;
opt.ModelRolloutOptions.NoiseOptions = ...
rl.option.EpsilonGreedyExploration;
opt.ModelRolloutOptions.NoiseOptions.EpsilonMin = 0.03;
• Specify Gaussian action noise options.
opt = rlMBPOAgentOptions;
opt.ModelRolloutOptions.NoiseOptions = ...
rl.option.GaussianActionNoise;
opt.ModelRolloutOptions.NoiseOptions.StandardDeviation = sqrt(0.15);
For more information on noise models, see “Noise Models” on page 3-176.
Object Functions
rlMBPOAgent Model-based policy optimization reinforcement learning agent
Examples
Create MBPO Agent Options Object
Create an MBPO agent options object, specifying the ratio of real experiences to use for training the
agent as 30%.
opt = rlMBPOAgentOptions(RealSampleRatio=0.3)
opt =
rlMBPOAgentOptions with properties:
NumEpochForTrainingModel: 1
NumMiniBatches: 10
MiniBatchSize: 128
TransitionOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
RewardOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
IsDoneOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
ModelExperienceBufferLength: 100000
ModelRolloutOptions: [1x1 rl.option.rlModelRolloutOptions]
RealSampleRatio: 0.3000
You can modify options using dot notation. For example, set the mini-batch size to 64.
opt.MiniBatchSize = 64;
3-175
3 Objects
Algorithms
Noise Models
Gaussian Action Noise
A GaussianActionNoise object has the following numeric value properties. When generating
experiences, MBPO agents do not update their exploration model parameters.

Mean Noise model mean 0
StandardDeviation Noise model standard deviation sqrt(0.2)
te deviation
(not used for generating

samples)
StandardDeviationMin Minimum standard deviation, 0.1
which must be less than
StandardDeviation

samples)
LowerLimit Noise sample lower limit -Inf
UpperLimit Noise sample upper limit Inf
At each time step k, the Gaussian noise v is sampled as shown in the following code.
w = Mean + rand(ActionSize).*StandardDeviation(k);
v(k+1) = min(max(w,LowerLimit),UpperLimit);
Epsilon Greedy Exploration
An EpsilonGreedyExploration object has the following numeric value properties. When

generating experiences, MBPO agents do not update their exploration model parameters.

higher rate.

samples)
3-176
rlMBPOAgentOptions


samples)
Version History
See Also
Objects
rlMBPOAgent | rlNeuralNetworkEnvironment
Topics
3-177
3 Objects
rlMDPEnv
Create Markov decision process environment for reinforcement learning
Description
A Markov decision process (MDP) is a discrete time stochastic control process. It provides a
mathematical framework for modeling decision making in situations where outcomes are partly
random and partly under the control of the decision maker. MDPs are useful for studying optimization
problems solved using reinforcement learning. Use rlMDPEnv to create a Markov decision process
environment for reinforcement learning in MATLAB.
Creation
Syntax
env = rlMDPEnv(MDP)
Description
env = rlMDPEnv(MDP) creates a reinforcement learning environment env with the specified MDP
model.
Input Arguments
MDP — Markov decision process model

GridWorld object | GenericMDP object
Markov decision process model, specified as one of the following:
• GridWorld object created using createGridWorld.

• GenericMDP object created using createMDP.
Properties
Model — Markov decision process model
GridWorld object | GenericMDP object
Markov decision process model, specified as a GridWorld object or GenericMDP object.
ResetFcn — Reset function

function handle
Reset function, specified as a function handle.
3-178
rlMDPEnv
Object Functions
validateEnvironment Validate custom reinforcement learning environment
Examples
For this example, consider a 5-by-5 grid world with the following rules:
1 A 5-by-5 grid world bounded by borders, with 4 possible actions (North = 1, South = 2, East = 3,
West = 4).
2 The agent begins from cell [2,1] (second row, first column).
3 The agent receives reward +10 if it reaches the terminal state at cell [5,5] (blue).
4 The environment contains a special jump from cell [2,4] to cell [4,4] with +5 reward.
5 The agent is blocked by obstacles in cells [3,3], [3,4], [3,5] and [4,3] (black cells).
6 All other actions result in -1 reward.
First, create a GridWorld object using the createGridWorld function.

GW = createGridWorld(5,5)
GW =
GridWorld with properties:
3-179
3 Objects
GridSize: [5 5]
CurrentState: "[1,1]"
States: [25x1 string]
Actions: [4x1 string]
T: [25x25x4 double]
R: [25x25x4 double]
ObstacleStates: [0x1 string]
TerminalStates: [0x1 string]
ProbabilityTolerance: 8.8818e-16
Now, set the initial, terminal and obstacle states.
GW.CurrentState = '[2,1]';
GW.TerminalStates = '[5,5]';
GW.ObstacleStates = ["[3,3]";"[3,4]";"[3,5]";"[4,3]"];
Update the state transition matrix for the obstacle states and set the jump rule over the obstacle
states.
updateStateTranstionForObstacles(GW)
GW.T(state2idx(GW,"[2,4]"),:,:) = 0;
GW.T(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 1;
Next, define the rewards in the reward transition matrix.
nS = numel(GW.States);
nA = numel(GW.Actions);
GW.R = -1*ones(nS,nS,nA);
GW.R(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 5;
GW.R(:,state2idx(GW,GW.TerminalStates),:) = 10;
Now, use rlMDPEnv to create a grid world environment using the GridWorld object GW.
env = rlMDPEnv(GW)
env =

ResetFcn: []
You can visualize the grid world environment using the plot function.
plot(env)
3-180
rlMDPEnv
Version History
See Also
createGridWorld | rlPredefinedEnv
Topics
“Train Reinforcement Learning Agent in Basic Grid World”
“Create Custom Grid World Environments”
“Train Reinforcement Learning Agent in MDP Environment”
3-181
3 Objects
rlMultiAgentTrainingOptions
Options for training multiple reinforcement learning agents
Description
Use an rlMultiAgentTrainingOptions object to specify training options for multiple agents. To
train the agents, use train.
For more information on training agents, see “Train Reinforcement Learning Agents”.
Creation
Syntax
trainOpts = rlMultiAgentTrainingOptions
trainOpts = rlMultiAgentTrainingOptions(Name,Value)
Description
trainOpts = rlMultiAgentTrainingOptions returns the default options for training multiple

reinforcement learning agents. Use training options to specify parameters for the training session,
such as the maximum number of episodes to train, criteria for stopping training, and criteria for
saving agents. After configuring the options, use trainOpts as an input argument for train.
trainOpts = rlMultiAgentTrainingOptions(Name,Value) creates a training option set and

sets object “Properties” on page 3-182 using one or more name-value pair arguments.
Properties
AgentGroups — Agent grouping indices
"auto" (default) | cell array of positive integers | cell array of integer arrays
Agent grouping indices, specified as a cell array of positive integers or a cell array of integer arrays.
For instance, consider a training scenario with 4 agents. You can group the agents in the following
ways:
• Allocate each agent in a separate group:
trainOpts = rlMultiAgentTrainingOptions("AgentGroups","auto")
• Specify four agent groups with one agent in each group:
trainOpts = rlMultiAgentTrainingOptions("AgentGroups",{1,2,3,4})
• Specify two agent groups with two agents each:
trainOpts = rlMultiAgentTrainingOptions("AgentGroups",{[1,2],[3,4]})
• Specify three agent groups:
3-182
trainOpts = rlMultiAgentTrainingOptions("AgentGroups",{[1,4],2,3})
AgentGroups and LearningStrategy must be used together to specify whether agent groups
learn in a centralized manner or decentralized manner.
Example: AgentGroups={1,2,[3,4]}
LearningStrategy — Learning strategy for each agent group

"decentralized" (default) | "centralized"
Learning strategy for each agent group, specified as either "decentralized" or "centralized".
In decentralized training, agents collect their own set of experiences during the episodes and learn
independently from those experiences. In centralized training, the agents share the collected
experiences and learn from them together.
AgentGroups and LearningStrategy must be used together to specify whether agent groups
learn in a centralized manner or decentralized manner. For example, you can use the following
command to configure training for three agent groups with different learning strategies. The agents
with indices [1,2] and [3,5] learn in a centralized manner, while agent 4 learns in a decentralized
manner.
AgentGroups={[1,2],4,[3,5]},...
LearningStrategy=["centralized","decentralized","centralized"] )
Example: LearningStrategy="centralized"
MaxEpisodes — Maximum number of episodes to train the agents

Maximum number of episodes to train the agents, specified as a positive integer. Regardless of other
criteria for termination, training terminates after MaxEpisodes.
Example: MaxEpisodes=1000
MaxStepsPerEpisode — Maximum number of steps to run per episode

Maximum number of steps to run per episode, specified as a positive integer. In general, you define
episode termination conditions in the environment. This value is the maximum number of steps to run
in the episode if other termination conditions are not met.
Example: MaxStepsPerEpisode=1000
ScoreAveragingWindowLength — Window length for averaging

5 (default) | positive integer scalar | positive integer vector
Window length for averaging the scores, rewards, and number of steps for each agent, specified as a
scalar or vector.
Specify a scalar to apply the same window length to all agents. To use a different window length for
each agent, specify ScoreAveragingWindowLength as a vector. In this case, the order of the
elements in the vector correspond to the order of the agents used during environment creation.
For options expressed in terms of averages, ScoreAveragingWindowLength is the number of

episodes included in the average. For instance, if StopTrainingCriteria is "AverageReward"
and StopTrainingValue is 500 for a given agent then for that agent, training terminates when the
3-183
3 Objects
average reward over the number of episodes specified in ScoreAveragingWindowLength equals or

exceeds 500. For the other agents, training continues until:
• All agents reach their stop criteria.

• The number of episodes reaches MaxEpisodes.
• You stop training by clicking the Stop Training button in Episode Manager or pressing Ctrl-C at
the MATLAB command line.
Example: ScoreAveragingWindowLength=10
StopTrainingCriteria — Training termination condition

"AverageSteps" (default) | "AverageReward" | "EpisodeReward" | "GlobalStepCount" |
"EpisodeCount" | ...
Training termination condition, specified as one of the following strings:
• "AverageSteps" — Stop training when the running average number of steps per episode equals
or exceeds the critical value specified by the option StopTrainingValue. The average is
computed using the window 'ScoreAveragingWindowLength'.
• "AverageReward" — Stop training when the running average reward equals or exceeds the
critical value.
• "EpisodeReward" — Stop training when the reward in the current episode equals or exceeds the
critical value.
• "GlobalStepCount" — Stop training when the total number of steps in all episodes (the total
number of times the agent is invoked) equals or exceeds the critical value.
• "EpisodeCount" — Stop training when the number of training episodes equals or exceeds the
critical value.
Example: StopTrainingCriteria="AverageReward"
StopTrainingValue — Critical value of training termination condition

500 (default) | scalar | vector
Critical value of the training termination condition, specified as a scalar or a vector.
Specify a scalar to apply the same termination criterion to all agents. To use a different termination
criterion for each agent, specify StopTrainingValue as a vector. In this case, the order of the
elements in the vector corresponds to the order of the agents used during environment creation.
For a given agent, training ends when the termination condition specified by the
StopTrainingCriteria option equals or exceeds this value. For the other agents, the training
continues until:

• The number of episodes reaches maxEpisodes.
For instance, if StopTrainingCriteria is "AverageReward" and StopTrainingValue is 100 for

a given agent, then for that agent training terminates when the average reward over the number of
episodes specified in ScoreAveragingWindowLength equals or exceeds 100.
Example: StopTrainingValue=100
3-184
SaveAgentCriteria — Condition for saving agents during training

"None" (default) | "EpisodeReward" | "AverageSteps" | "AverageReward" |
"GlobalStepCount" | "EpisodeCount" | ...
Condition for saving agents during training, specified as one of the following strings:
• "None" — Do not save any agents during training.

• "EpisodeReward" — Save the agent when the reward in the current episode equals or exceeds
the critical value.
• "AverageSteps" — Save the agent when the running average number of steps per episode
equals or exceeds the critical value specified by the option StopTrainingValue. The average is
• "AverageReward" — Save the agent when the running average reward over all episodes equals
or exceeds the critical value.
• "GlobalStepCount" — Save the agent when the total number of steps in all episodes (the total
• "EpisodeCount" — Save the agent when the number of training episodes equals or exceeds the
critical value.
Set this option to store candidate agents that perform well according to the criteria you specify. When
you set this option to a value other than "none", the software sets the SaveAgentValue option to
500. You can change that value to specify the condition for saving the agent.
For instance, suppose you want to store for further testing any agent that yields an episode reward
that equals or exceeds 100. To do so, set SaveAgentCriteria to "EpisodeReward" and set the
SaveAgentValue option to 100. When an episode reward equals or exceeds 100, train saves the
corresponding agent in a MAT file in the folder specified by the SaveAgentDirectory option. The
MAT file is called AgentK.mat, where K is the number of the corresponding episode. The agent is
stored within that MAT file as saved_agent.
Example: SaveAgentCriteria="EpisodeReward"
SaveAgentValue — Critical value of condition for saving agents

"none" (default) | 500 | scalar | vector
Critical value of the condition for saving agents, specified as a scalar or a vector.
Specify a scalar to apply the same saving criterion to each agent. To save the agents when one meets
a particular criterion, specify SaveAgentValue as a vector. In this case, the order of the elements in
the vector corresponds to the order of the agents used when creating the environment. When a
criteria for saving an agent is met, all agents are saved in the same MAT file.
When you specify a condition for saving candidate agents using SaveAgentCriteria, the software
sets this value to 500. Change the value to specify the condition for saving the agent. See the
SaveAgentCriteria option for more details.
Example: SaveAgentValue=100
SaveAgentDirectory — Folder for saved agents

"savedAgents" (default) | string | character vector
Folder for saved agents, specified as a string or character vector. The folder name can contain a full
or relative path. When an episode occurs that satisfies the condition specified by the
3-185
3 Objects
SaveAgentCriteria and SaveAgentValue options, the software saves the agents in a MAT file in
this folder. If the folder does not exist, train creates it. When SaveAgentCriteria is "none", this
option is ignored and train does not create a folder.
Example: SaveAgentDirectory = pwd + "\run1\Agents"
StopOnError — Option to stop training when error occurs

Option to stop training when an error occurs during an episode, specified as "on" or "off". When
this option is "off", errors are captured and returned in the SimulationInfo output of train, and
training continues to the next episode.
Example: StopOnError = "off"
Verbose — Option to display training progress on the command line

false (0) (default) | true (1)
Option to display training progress on the command line, specified as the logical value false (0) or
true (1). Set to true to write information from each training episode to the MATLAB command line
during training.
Example: Verbose = true
Plots — Option to display training progress with Episode Manager

"training-progress" (default) | "none"
Option to display training progress with Episode Manager, specified as "training-progress" or

"none". By default, calling train opens the Reinforcement Learning Episode Manager, which
graphically and numerically displays information about the training progress, such as the reward for
each episode, average reward, number of episodes, and total number of steps. (For more information,
see train.) To turn off this display, set this option to "none".
Example: Plots = "none"
Object Functions
Examples
Configure Options for Multi Agent Training
Create an options set for training 5 reinforcement learning agents. Set the maximum number of
episodes and the maximum number of steps per episode to 1000. Configure the options to stop
training when the average reward equals or exceeds 480, and turn on both the command-line display
and Reinforcement Learning Episode Manager for displaying training results. You can set the options
using name-value pair arguments when you create the options set. Any options that you do not
explicitly set have their default values.
AgentGroups={[1,2],3,[4,5]},...
LearningStrategy=["centralized","decentralized","centralized"],...
MaxEpisodes=1000,...
MaxStepsPerEpisode=1000,...
3-186
StopTrainingCriteria="AverageReward",...
StopTrainingValue=480,...
Verbose=true,...
Plots="training-progress")
trainOpts =
rlMultiAgentTrainingOptions with properties:
AgentGroups: {[1 2] [3] [4 5]}

LearningStrategy: ["centralized" "decentralized" ... ]
MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: 5
StopTrainingCriteria: "AverageReward"
StopTrainingValue: 480
SaveAgentCriteria: "none"
SaveAgentValue: "none"
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"
Alternatively, create a default options set and use dot notation to change some of the values.
trainOpts = rlMultiAgentTrainingOptions;
trainOpts.AgentGroups = {[1,2],3,[4,5]};
trainOpts.LearningStrategy = ["centralized","decentralized","centralized"];
trainOpts.Verbose = true;
trainOpts
trainOpts =
AgentGroups: {[1 2] [3] [4 5]}

LearningStrategy: ["centralized" "decentralized" ... ]
MaxEpisodes: 1000
Verbose: 1
StopOnError: "on"
You can now use trainOpts as an input argument to the train command.
3-187
3 Objects
Configure Options for Training a Multi-Agent Environment
Create an options object for concurrently training three agents in the same environment.
Set the maximum number of episodes and the maximum steps per episode to 1000. Configure the
options to stop training the first agent when its average reward over 5 episodes equals or exceeds
400, the second agent when its average reward over 10 episodes equals or exceeds 500, and the
third when its average reward over 15 episodes equals or exceeds 600. The order of agents is the one
used during environment creation.
Save the agents when the reward for the first agent in the current episode exceeds 100, or when the
reward for the second agent exceeds 120, the reward for the third agent equals or exceeds 140.
Turn on both the command-line display and Reinforcement Learning Episode Manager for displaying
training results. You can set the options using name-value pair arguments when you create the
options set. Any options that you do not explicitly set have their default values.
MaxEpisodes=1000,...
MaxStepsPerEpisode=1000,...
ScoreAveragingWindowLength=[5 10 15],...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=[400 500 600],...
SaveAgentCriteria="EpisodeReward",...
SaveAgentValue=[100 120 140],...
Verbose=true,...
Plots="training-progress")
trainOpts =
AgentGroups: "auto"
LearningStrategy: "decentralized"
MaxEpisodes: 1000
ScoreAveragingWindowLength: [5 10 15]
StopTrainingValue: [400 500 600]
SaveAgentCriteria: "EpisodeReward"
SaveAgentValue: [100 120 140]
Verbose: 1
StopOnError: "on"
trainOpts = rlMultiAgentTrainingOptions;
trainOpts.ScoreAveragingWindowLength = [5 10 15];
3-188
trainOpts.StopTrainingValue = [400 500 600];
trainOpts.SaveAgentCriteria = "EpisodeReward";
trainOpts.SaveAgentValue = [100 120 140];
trainOpts
trainOpts =
AgentGroups: "auto"
MaxEpisodes: 1000
ScoreAveragingWindowLength: [5 10 15]
Verbose: 1
StopOnError: "on"
You can specify a scalar to apply the same criterion to all agents. For example, use a window length of
10 for all three agents.
trainOpts.ScoreAveragingWindowLength = 10
trainOpts =
AgentGroups: "auto"
MaxEpisodes: 1000
Verbose: 1
StopOnError: "on"
Version History
3-189
3 Objects
See Also
train | rlTrainingOptions
Topics
3-190
Environment model with deep neural network transition models
Description
Use an rlNeuralNetworkEnvironment object to create a reinforcement learning environment that
computes state transitions using deep neural networks.
Using an rlNeuralNetworkEnvironment object you can:
• Create an internal environment model for a model-based policy optimization (MBPO) agent.
• Create an environment for training other types of reinforcement learning agents. You can identify
the state-transition network using experimental or simulated data.
Such environments can compute environment rewards and termination conditions using deep neural
networks or custom functions.
Creation
Syntax
env = rlNeuralNetworkEnvironment(obsInfo,actInfo,transitionFcn,rewardFcn,
isDoneFcn)
Description
env = rlNeuralNetworkEnvironment(obsInfo,actInfo,transitionFcn,rewardFcn,
isDoneFcn) creates a model for an environment with the observation and action specifications
specified in obsInfo and actInfo, respectively. This syntax sets the TransitionFcn, RewardFcn,
and IsDoneFcn properties.
Input Arguments
obsInfo — Observation specifications


signals.
or rlNumericSpec.
actInfo — Action specifications

3-191
3 Objects

rlNumericSpec.
Properties
TransitionFcn — Environment transition function
rlContinuousDeterministicTransitionFunction object |
rlContinuousGaussianTransitionFunction object | array of transition objects
Environment transition function, specified as one of the following:
• rlContinuousDeterministicTransitionFunction object — Use this option when you expect

the environment transitions to be deterministic.
• rlContinuousGaussianTransitionFunction object — Use this option when you expect the
environment transitions to be stochastic.
• Vector of transition objects — Use multiple transition models for an MBPO agent.
RewardFcn — Environment reward function

rlContinuousDeterministicRewardFunction object |
rlContinuousGaussianRewardFunction object | function handle
Environment reward function, specified as one of the following:
• rlContinuousDeterministicRewardFunction object — Use this option when you do not

know a ground-truth reward signal for your environment and you expect the reward signal to be
deterministic.
• rlContinuousGaussianRewardFunction object — Use this option when you do not know a
ground-truth reward signal for your environment and you expect the reward signal to be
stochastic.
• Function handle — Use this option when you know a ground-truth reward signal for your
environment. When you use an rlNeuralNetworkEnvironment object to create an
rlMBPOAgent object, the custom reward function must return a batch of rewards given a batch of
inputs.
IsDoneFcn — Environment is-done function

rlIsDoneFunction object | function handle
Environment is-done function, specified as one of the following:
• rlIsDoneFunction object — Use this option when you do not know a ground-truth termination
signal for your environment.
• Function handle — Use this option when you know a ground-truth termination signal for your
environment. When you use an rlNeuralNetworkEnvironment object to create an
rlMBPOAgent object, the custom is-done function must return a batch of termination signals
given a batch of inputs.
Observation — Observation values

cell array
3-192
Observation values, specified as a cell array with length equal to the number of specification objects
in obsInfo. The order of the observations in Observation must match the order in obsInfo. Also,
the dimensions of each element of the cell array must match the dimensions of the corresponding
observation specification in obsInfo.
To evaluate whether the transition models are well-trained, you can manually evaluate the
environment for a given observation value using the step function. Specify the observation values
before calling step.
When you use this neural network environment object within an MBPO agent, this property is
ignored.
TransitionModelNum — Transition model index

Transition model index, specified as a positive integer.
To evaluate whether the transition models are well-trained, you can manually evaluate the
environment for a given observation value using the step function. To select which transition model
in TransitionFcn to evaluate, specify the transition model index before calling step.
When you use this neural network environment object within an MBPO agent, this property is
ignored.
Object Functions
rlMBPOAgent Model-based policy optimization reinforcement learning agent
Examples
Create Neural Network Environment
Get the dimension of the observation and action spaces.

numActions = actInfo.Dimension(1);
Create a deterministic transition function based on a deep neural network with two input channels
(current observations and actions) and one output channel (predicted next observation).
% Create network layers.
statePath = featureInputLayer(numObservations, ...
actionPath = featureInputLayer(numActions, ...
3-193
3 Objects
fullyConnectedLayer(numObservations,Name="nextObservation")];
% Combine network layers.

transitionNetwork = layerGraph(statePath);
transitionNetwork = addLayers(transitionNetwork,actionPath);
transitionNetwork = addLayers(transitionNetwork,commonPath);
transitionNetwork = connectLayers( ...
transitionNetwork,"state","concat/in1");
transitionNetwork = connectLayers( ...
transitionNetwork,"action","concat/in2");
% Create dlnetwork object.

transitionNetwork = dlnetwork(transitionNetwork);
% Create transition function object.

transitionFcn = rlContinuousDeterministicTransitionFunction(...
transitionNetwork,obsInfo,actInfo,...
Create a deterministic reward function with two input channels (current action and next
observations) and one output channel (predicted reward value).

numObservations,Name="nextState");
reluLayer(Name="relu1")
meanPath = [reluLayer(Name="rewardMeanRelu")
fullyConnectedLayer(1,Name="rewardMean")];
stdPath = [reluLayer(Name="rewardStdRelu")
fullyConnectedLayer(1,Name="rewardStdFc")
softplusLayer(Name="rewardStd")];

rewardNetwork = layerGraph(statePath);
rewardNetwork = addLayers(rewardNetwork,actionPath);
rewardNetwork = addLayers(rewardNetwork,nextStatePath);
rewardNetwork = addLayers(rewardNetwork,commonPath);
rewardNetwork = addLayers(rewardNetwork,meanPath);
rewardNetwork = addLayers(rewardNetwork,stdPath);
rewardNetwork = connectLayers( ...

rewardNetwork,"nextState","concat/in1");
rewardNetwork,"action","concat/in2");
rewardNetwork,"state","concat/in3");
rewardNetwork,"fc2","rewardMeanRelu");
rewardNetwork,"fc2","rewardStdRelu");
3-194

rewardNetwork = dlnetwork(rewardNetwork);
% Create reward function object.

rewardFcn = rlContinuousGaussianRewardFunction(...
rewardNetwork,obsInfo,actInfo,...
ObservationInputNames="state",...
NextObservationInputNames="nextState", ...
RewardMeanOutputNames="rewardMean", ...
RewardStandardDeviationOutputNames="rewardStd");
Create an is-done function with one input channel (next observations) and one output channel
(predicted termination signal).
commonPath = [featureInputLayer(numObservations, ...
Normalization="none",Name="nextState");
isDoneNetwork = layerGraph(commonPath);

isDoneNetwork = dlnetwork(isDoneNetwork);
% Create is-done function object.

isDoneFcn = rlIsDoneFunction(isDoneNetwork, ...
Create a neural network environment using the transition, reward, and is-done functions.
env = rlNeuralNetworkEnvironment( ...
transitionFcn,rewardFcn,isDoneFcn);
Create Neural Network Environment Using Custom Functions
numActions = actInfo.Dimension(1);
Create a deterministic transition function based on a deep neural network with two input channels
(current observations and actions) and one output channel (predicted next observation).
statePath = featureInputLayer(numObservations,...
3-195
3 Objects
actionPath = featureInputLayer(numActions,...
fullyConnectedLayer(numObservations,Name="nextObservation")];

transitionNetwork = layerGraph(statePath);
transitionNetwork = addLayers(transitionNetwork,actionPath);
transitionNetwork = addLayers(transitionNetwork,commonPath);
transitionNetwork = connectLayers(transitionNetwork,"state","concat/in1");
transitionNetwork = connectLayers(transitionNetwork,"action","concat/in2");

transitionNetwork = dlnetwork(transitionNetwork);
% Create transition function object.

transitionFcn = rlContinuousDeterministicTransitionFunction(...
transitionNetwork,obsInfo,actInfo,...
You can define a known reward function for your environment using a custom function. Your custom
reward function must take the observations, actions, and next observations as cell-array inputs and
return a scalar reward value. For this example, use the following custom reward function, which
computes the reward based on the next observation.
type cartPoleRewardFunction.m
function reward = cartPoleRewardFunction(obs,action,nextObs)

% Compute reward value based on the next observation.
if iscell(nextObs)
nextObs = nextObs{1};
end
% Distance at which to fail the episode

xThreshold = 2.4;
% Reward each time step the cart-pole is balanced

rewardForNotFalling = 1;
% Penalty when the cart-pole fails to balance

penaltyForFalling = -5;
x = nextObs(1,:);
distReward = 1 - abs(x)/xThreshold;
isDone = cartPoleIsDoneFunction(obs,action,nextObs);
reward = zeros(size(isDone));
reward(logical(isDone)) = penaltyForFalling;
reward(~logical(isDone)) = ...
3-196
0.5 * rewardForNotFalling + 0.5 * distReward(~logical(isDone));

end
You can define a known is-done function for your environment using a custom function. Your custom
is-done function must take the observations, actions, and next observations as cell-array inputs and
return a logical termination signal. For this example, use the following custom is-done function, which
computes the termination signal based on the next observation.
type cartPoleIsDoneFunction.m
function isDone = cartPoleIsDoneFunction(obs,action,nextObs)

% Compute termination signal based on next observation.
if iscell(nextObs)
nextObs = nextObs{1};
end
% Angle at which to fail the episode

thetaThresholdRadians = 12 * pi/180;
% Distance at which to fail the episode

xThreshold = 2.4;
x = nextObs(1,:);
theta = nextObs(3,:);
isDone = abs(x) > xThreshold | abs(theta) > thetaThresholdRadians;

end
Create a neural network environment using the transition function object and the custom reward and
is-done functions.
env = rlNeuralNetworkEnvironment(obsInfo,actInfo,transitionFcn,...
@cartPoleRewardFunction,@cartPoleIsDoneFunction);
Version History
See Also
Objects
rlMBPOAgent | rlMBPOAgentOptions | rlContinuousDeterministicTransitionFunction |
rlIsDoneFunction
Topics
3-197
3 Objects
rlNumericSpec
Create continuous action or observation data specifications for reinforcement learning environments
Description
An rlNumericSpec object specifies continuous action or observation data specifications for
reinforcement learning environments.
Creation
Syntax
spec = rlNumericSpec(dimension)
spec = rlNumericSpec(dimension,Name,Value)
Description
spec = rlNumericSpec(dimension) creates a data specification for continuous actions or

observations and sets the Dimension property.
spec = rlNumericSpec(dimension,Name,Value) sets “Properties” on page 3-198 using name-

value pair arguments.
Properties
LowerLimit — Lower limit of the data space
-Inf (default) | scalar | matrix
Lower limit of the data space, specified as a scalar or matrix of the same size as the data space. When
LowerLimit is specified as a scalar, rlNumericSpec applies it to all entries in the data space.
UpperLimit — Upper limit of the data space

Inf (default) | scalar | matrix
Upper limit of the data space, specified as a scalar or matrix of the same size as the data space. When
UpperLimit is specified as a scalar, rlNumericSpec applies it to all entries in the data space.
Name — Name of the rlNumericSpec object

string
Name of the rlNumericSpec object, specified as a string.
Description — Description of the rlNumericSpec object

string
Description of the rlNumericSpec object, specified as a string.
3-198
rlNumericSpec
Dimension — Dimension of the data space

numeric vector
Dimension of the data space, specified as a numeric vector.
DataType — Information about the type of data

"double" (default) | string
Information about the type of data, specified as a string, such as "double" or "single".
Object Functions
rlSimulinkEnv Create reinforcement learning environment using dynamic model
implemented in Simulink
rlFunctionEnv Specify custom reinforcement learning environment dynamics
using functions
rlValueFunction Value function approximator object for reinforcement learning
agents
rlQValueFunction Q-Value function approximator object for reinforcement learning
agents
rlVectorQValueFunction Vector Q-value function approximator for reinforcement learning
agents
rlContinuousDeterministicActor Deterministic actor with a continuous action space for
rlDiscreteCategoricalActor Stochastic categorical actor with a discrete action space for
rlContinuousGaussianActor Stochastic Gaussian actor with a continuous action space for
Examples
Open the model.

open_system(mdl)
respectively.
angle.
obsInfo =
3-199
3 Objects
LowerLimit: -Inf
UpperLimit: Inf
Name: [0×0 string]
Dimension: [3 1]
DataType: "double"
2 Nm.
actInfo =

Name: [0×0 string]
Dimension: [1 1]
DataType: "double"
objects.
env =
ResetFcn : []
UseFastRestart : on
env =
UseFastRestart : on
3-200
rlNumericSpec
Version History
See Also
rlFiniteSetSpec | rlSimulinkEnv | getActionInfo | getObservationInfo |
rlFunctionEnv
Topics
“Train DDPG Agent for Adaptive Cruise Control”
3-201
3 Objects
rlOptimizerOptions
Optimization options for actors and critics
Description
Use an rlOptimizerOptions object to specify an optimization options set for actors and critics.
Creation
Syntax
optOpts = rlOptimizerOptions
optOpts = rlOptimizerOptions(Name=Value)
Description
optOpts = rlOptimizerOptions creates a default optimizer option set to use as a

CriticOptimizerOptions or ActorOptimizerOptions property of an agent option object, or as
a last argument of rlOptimizer to create an optimizer object. You can modify the object properties
using dot notation.
optOpts = rlOptimizerOptions(Name=Value) creates an options set with the specified

properties using one or more name-value arguments.
Properties
LearnRate — Learning rate used in training the actor or critic function approximator
0.01 (default) | positive scalar
Learning rate used in training the actor or critic function approximator, specified as a positive scalar.
If the learning rate is too low, then training takes a long time. If the learning rate is too high, then
training might reach a suboptimal result or diverge.
Example: LearnRate=0.025
GradientThreshold — Gradient threshold value for the training of the actor or critic
function approximator
Inf (default) | positive scalar
Gradient threshold value used in training the actor or critic function approximator, specified as Inf
or a positive scalar. If the gradient exceeds this value, the gradient is clipped as specified by the
GradientThresholdMethod option. Clipping the gradient limits how much the network parameters
can change in a training iteration.
Example: GradientThreshold=1
GradientThresholdMethod — Gradient threshold method used in training the actor or critic

"l2norm" (default) | "global-l2norm" | "absolute-value"
3-202
rlOptimizerOptions
Gradient threshold method used in training the actor or critic function approximator. This is the
specific method used to clip gradient values that exceed the gradient threshold, and it is specified as
one of the following values.
• "l2norm" — If the L2 norm of the gradient of a learnable parameter is larger than

GradientThreshold, then scale the gradient so that the L2 norm equals GradientThreshold.
• "global-l2norm" — If the global L2 norm L is larger than GradientThreshold, then scale all
gradients by a factor of GradientThreshold/L. The global L2 norm considers all learnable
parameters.
• "absolute-value" — If the absolute value of an individual partial derivative in the gradient of a
learnable parameter is larger than GradientThreshold, then scale the partial derivative to have
magnitude equal to GradientThreshold and retain the sign of the partial derivative.
For more information, see “Gradient Clipping” in the Algorithms section of trainingOptions in
Deep Learning Toolbox.
Example: GradientThresholdMethod="absolute-value"
L2RegularizationFactor — Factor for L2 regularization used in training the actor or critic

Factor for L2 regularization (weight decay) used in training the actor or critic function approximator,
specified as a nonnegative scalar. For more information, see “L2 Regularization” in the Algorithms
section of trainingOptions in Deep Learning Toolbox.
To avoid overfitting when using a representation with many parameters, consider increasing the
L2RegularizationFactor option.
Example: L2RegularizationFactor=0.0005
Algorithm — Algorithm used for training actor or critic function approximator

"adam" (default) | "sgdm" | "rmsprop"
Algorithm used for training the actor or critic function approximator, specified as one of the following
values.
• "adam" — Use the Adam (adaptive movement estimation) algorithm. You can specify the decay
rates of the gradient and squared gradient moving averages using the GradientDecayFactor
and SquaredGradientDecayFactor fields of the OptimizerParameters option.
• "sgdm" — Use the stochastic gradient descent with momentum (SGDM) algorithm. You can
specify the momentum value using the Momentum field of the OptimizerParameters option.
• "rmsprop" — Use the RMSProp algorithm. You can specify the decay rate of the squared gradient
moving average using the SquaredGradientDecayFactor fields of the OptimizerParameters
option.
For more information about these algorithms, see “Stochastic Gradient Descent” in the Algorithms
Example: Optimizer="sgdm"
OptimizerParameters — Parameters for the training algorithm used for training the actor
or critic function approximator
OptimizerParameters object
3-203
3 Objects
Parameters for the training algorithm used for training the actor or critic function approximator,
specified as an OptimizerParameters object with the following parameters.
Parameter Description
Momentum Contribution of previous step, specified as a
scalar from 0 to 1. A value of 0 means no
contribution from the previous step. A value of 1
means maximal contribution.
This parameter applies only when Optimizer is

"sgdm". In that case, the default value is 0.9.
This default value works well for most problems.
Epsilon Denominator offset, specified as a positive scalar.
The optimizer adds this offset to the denominator
in the network parameter updates to avoid
division by zero.

"adam" or "rmsprop". In that case, the default
value is 10–8. This default value works well for
most problems.
GradientDecayFactor Decay rate of gradient moving average, specified
as a positive scalar from 0 to 1.

"adam". In that case, the default value is 0.9.
SquaredGradientDecayFactor Decay rate of squared gradient moving average,
specified as a positive scalar from 0 to 1.

value is 0.999. This default value works well for
most problems.
When a particular property of OptimizerParameters is not applicable to the optimizer type

specified in Algorithm, that property is set to "Not applicable".
To change property values, create an rlOptimizerOptions object and use dot notation to access
and change the properties of OptimizerParameters.
repOpts = rlRepresentationOptions;
repOpts.OptimizerParameters.GradientDecayFactor = 0.95;
Object Functions
rlQAgentOptions Options for Q-learning agent
rlSARSAAgentOptions Options for SARSA agent
rlDQNAgentOptions Options for DQN agent
rlPGAgentOptions Options for PG agent
rlDDPGAgentOptions Options for DDPG agent
rlTD3AgentOptions Options for TD3 agent
rlACAgentOptions Options for AC agent
3-204
rlOptimizerOptions
rlPPOAgentOptions Options for PPO agent

rlTRPOAgentOptions Options for TRPO agent
rlSACAgentOptions Options for SAC agent
rlOptimizer Creates an optimizer object for actors and critics
Examples
Create Optimizer Options Object
Use rlOprimizerOptions to create a default optimizer option object to use for the training of a
critic function approximator.
myCriticOpts = rlOptimizerOptions
myCriticOpts =
rlOptimizerOptions with properties:
LearnRate: 0.0100
Algorithm: "adam"
OptimizerParameters: [1x1 rl.option.OptimizerParameters]
Using dot notation, change the training algorithm to stochastic gradient descent with momentum and
set the value of the momentum parameter to 0.6.
myCriticOpts.Algorithm = "sgdm";
myCriticOpts.OptimizerParameters.Momentum = 0.6;
Create an AC agent option object, and set its CriticOptimizerOptions property to

myCriticOpts.
myAgentOpt = rlACAgentOptions;
myAgentOpt.CriticOptimizerOptions = myCriticOpts;
You can now use myAgentOpt as last input argument to rlACAgent when creating your AC agent.
Create Optimizer Options Object Specifying Property Values
Use rlOprimizerOptions to create an optimizer option object to use for the training of an actor
function approximator. Specify a learning rate of 0.2 and set the GradientThresholdMethod to
"absolute-value".
myActorOpts=rlOptimizerOptions(LearnRate=0.2, ...
GradientThresholdMethod="absolute-value")
myActorOpts =
rlOptimizerOptions with properties:
LearnRate: 0.2000
3-205
3 Objects
GradientThresholdMethod: "absolute-value"
Algorithm: "adam"
Using dot notation, change the a GradientThreshold to 10.
myActorOpts.GradientThreshold = 10;
Create an AC agent option object and set its ActorOptimizerOptions property to myActorOpts.
myAgentOpt = rlACAgentOptions( ...

"ActorOptimizerOptions",myActorOpts);
You can now use myAgentOpt as last input argument to rlACAgent when creating your AC agent.
Version History
See Also
Functions
rlOptimizer
Topics
3-206
rlPGAgent
rlPGAgent
Policy gradient reinforcement learning agent
Description
The policy gradient (PG) algorithm is a model-free, online, on-policy reinforcement learning method.
A PG agent is a policy-based reinforcement learning agent that uses the REINFORCE algorithm to
directly compute an optimal policy which maximizes the long-term reward. The action space can be
either discrete or continuous.
For more information on PG agents and the REINFORCE algorithm, see “Policy Gradient Agents”. For
more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
agent = rlPGAgent(observationInfo,actionInfo)
agent = rlPGAgent(observationInfo,actionInfo,initOpts)
agent = rlPGAgent(actor)
agent = rlPGAgent(actor,critic)
agent = rlPGAgent( ___ ,agentOptions)
Description
agent = rlPGAgent(observationInfo,actionInfo) creates a policy gradient agent for an

environment with the given observation and action specifications, using default initialization options.
The actor and critic in the agent use default deep neural networks built from the observation
specification observationInfo and the action specification actionInfo. The ObservationInfo
and ActionInfo properties of agent are set to the observationInfo and actionInfo input
agent = rlPGAgent(observationInfo,actionInfo,initOpts) creates a policy gradient

agent for an environment with the given observation and action specifications. The agent uses default
networks in which each hidden fully connected layer has the number of units specified in the
initOpts object. Policy gradient agents do not support recurrent neural networks. For more
information on the initialization options, see rlAgentInitializationOptions.
agent = rlPGAgent(actor) creates a PG agent with the specified actor network. By default, the
UseBaseline property of the agent is false in this case.
3-207
3 Objects
agent = rlPGAgent(actor,critic) creates a PG agent with the specified actor and critic
networks. By default, the UseBaseline option is true in this case.
agent = rlPGAgent( ___ ,agentOptions) creates a PG agent and sets the AgentOptions
Input Arguments

Agent initialization options, specified as an rlAgentInitializationOptions object. Policy

gradient agents do not support recurrent neural networks.
actor — Actor

critic — Baseline critic

Baseline critic that estimates the discounted long-term reward, specified as an rlValueFunction
object. For more information on creating critic approximators, see “Create Policies and Value
Functions”.
Properties

signals.

or rlNumericSpec.


3-208
rlPGAgent

rlPGAgentOptions object
Agent options, specified as an rlPGAgentOptions object.

values.

Object Functions
environment
observations
3-209
3 Objects

Examples
Create Discrete Policy Gradient Agent from Observation and Action Specifications
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to the pole).
Obtain observation and action specification objects.
rng(0)
agent = rlPGAgent(obsInfo,actInfo);

{[-2]}
Create Continuous Policy Gradient Agent Using Initialization Options
3-210
rlPGAgent
network must have 128 neurons (instead of the default number, 256). Policy gradient agents do not
support recurrent networks, so setting the UseRNN option to true generates an error when the agent
is created.
rng(0)
agent = rlPGAgent(obsInfo,actInfo,initOpts);
neurons
criticNet.Layers
ans =

Plot actor and critic networks, and display their number of weights.
3-211
3 Objects
summary(actorNet)
Initialized: true
Number of learnables: 18.9M
Inputs:
1 'input_1' 50x50x1 images
3-212
rlPGAgent
summary(criticNet)
Initialized: true
Inputs:

{[0.9228]}
Create a Discrete PG Agent from Actor and Baseline Critic
specifications. For this example, load the environment used in the example “Train PG Agent with
Baseline to Control Double Integrator System”. The observation from the environment is a vector
3-213
3 Objects
containing the position and velocity of a mass. The action is a scalar representing a force, applied to
the mass, having three possible values (-2, 0, or 2 Newton).
env = rlPredefinedEnv("DoubleIntegrator-Discrete");
obsInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Dimension: [2 1]
DataType: "double"
actInfo =
Elements: [-2 0 2]
Name: "force"
Dimension: [1 1]
DataType: "double"
For policy gradient agents, the baseline critic estimates a value function, therefore it must take the
observation signal as input and return a scalar value.
Define the network as an array of layer objects, and get the dimension of the observation space from
baselineNet = [
reluLayer

baselineNet = dlnetwork(baselineNet);
summary(baselineNet)
Initialized: true
Inputs:
Create a critic to use as a baseline. Policy gradient agents use an rlValueFunction object to
implement the critic.
baseline = rlValueFunction(baselineNet,obsInfo);
3-214
rlPGAgent
getValue(baseline,{rand(obsInfo.Dimension)})
ans = single
-0.1204
To approximate the policy within the actor, use a deep neural network. For policy gradient agents, the
actor executes a stochastic policy, which for discrete action spaces is implemented by a discrete
categorical actor. In this case the network must take the observation signal as input and return a
probability for each action. Therefore the output layer must have as many elements as the number of
possible actions.
Define the network as an array of layer objects, and get the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
reluLayer
actorNet = dlnetwork(actorNet);
summary(actorNet)
Initialized: true
Inputs:
Create the actor using rlDiscreteCategoricalActor, as well as the observation and action
specifications.
actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);

{[-2]}
Create the PG agent using the actor and the baseline critic.
agent = rlPGAgent(actor,baseline)
agent =

SampleTime: 1
3-215
3 Objects
Specify options for the agent, including training options for the actor and critic.
agent.AgentOptions.UseBaseline = true;
agent.AgentOptions.DiscountFactor = 0.99;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

{[-2]}
Create a Continuous PG Agent from Actor and Baseline Critic
in the example “Train DDPG Agent to Control Double Integrator System”.
obsInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Dimension: [2 1]
DataType: "double"
actInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Dimension: [1 1]
DataType: "double"
3-216
rlPGAgent
actInfo.LowerLimit = -2;
actInfo.UpperLimit = 2;
For policy gradient agents, the baseline critic estimates a value function, therefore it must take the
observation signal as input and return a scalar value. To approximate the value function within the
baseline, use a neural network.
baselineNet = [
reluLayer
Initialized: true
Inputs:
Create a critic to use as a baseline. Policy gradient agents use an rlValueFunction object to
implement the critic.
ans = single
-0.1204
To approximate the policy within the actor, use a deep neural network as approximation model. For
policy gradient agents, the actor executes a stochastic policy, which for continuous action spaces is
implemented by a continuous Gaussian actor. In this case the network must take the observation
signal as input and return both a mean value and a standard deviation value for each action.
Therefore it must have two output layers (one for the mean values the other for the standard
deviation values), each having as many elements as the dimension of the action space.
3-217
3 Objects
% Input path
inPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="obs_in")
reluLayer(Name="ip_out") ];
% Mean path
meanPath = [
fullyConnectedLayer(16,Name="mp_fc1")
reluLayer
tanhLayer(Name="tanh"); % range: -1,1
scalingLayer(Name="mp_out", ...
Scale=actInfo.UpperLimit) ]; % range: -2,2
% Standard deviation path

sdevPath = [
fullyConnectedLayer(16,Name="sp_fc1")
reluLayer
fullyConnectedLayer(1);
softplusLayer(Name="sp_out") ]; % non negative

actorNet = layerGraph(inPath);
actorNet = addLayers(actorNet,meanPath);
actorNet = addLayers(actorNet,sdevPath);
% Connect output of inPath to meanPath input

actorNet = connectLayers(actorNet,"ip_out","mp_fc1/in");
% Connect output of inPath to variancePath input
actorNet = connectLayers(actorNet,"ip_out","sp_fc1/in");
% plot network
plot(actorNet)
3-218
rlPGAgent


summary(actorNet)
Initialized: true
Inputs:
1 'obs_in' 2 features
Create the actor using rlContinuousGaussianActor, together with actorNet, the observation
and action specifications, as well as the names of the network input and output layers.
actor = rlContinuousGaussianActor(actorNet, ...
ObservationInputNames="obs_in", ...
ActionMeanOutputNames="mp_out", ...
ActionStandardDeviationOutputNames="sp_out");
Check the actor with a random input observation.


{[0.0963]}
3-219
3 Objects
Create the PG agent using the actor and the baseline critic.
agent = rlPGAgent(actor,baseline)
agent =

SampleTime: 1
Specify options for the agent, including training options for the actor and critic.
agent.AgentOptions.UseBaseline = true;

{[1.1197]}
Create a Discrete PG Agent with Recurrent Neural Networks
For this example, load the environment used in the example “Train PG Agent with Baseline to Control
Double Integrator System”. The observation from the environment is a vector containing the position
and velocity of a mass. The action is a scalar representing a force, applied to the mass, having three
possible values (-2, 0, or 2 Newton).
env = rlPredefinedEnv("DoubleIntegrator-Discrete");
Get observation and specification information.
Create a critic to use as a baseline. For policy gradient agents, the baseline critic estimates a value
function, therefore it must take the observation signal as input and return a scalar value. To
approximate the value function within the critic, use a recurrent neural network.
Define the network as an array of layer objects, and get the dimension of the observation space from
the environment specification object. To create a recurrent neural network for the critic, use
3-220
rlPGAgent
layers.
baselineNet = [
lstmLayer(32)
reluLayer

Initialized: true
Inputs:
Create the critic based on the network approximator model. Policy gradient agents use an
rlValueFunction object to implement the critic.
Check the baseline critic with a random input observation.

ans = single
-0.0065
Since the critic has a recurrent network, the actor must have a recurrent network too. Define a
recurrent neural network for the actor. For policy gradient agents, the actor executes a stochastic
policy, which for discrete action spaces is implemented by a discrete categorical actor. In this case
the network must take the observation signal as input and return a probability for each action.
Therefore the output layer must have as many elements as the number of possible actions.
actorNet = [
lstmLayer(32)
reluLayer

summary(actorNet)
Initialized: true
Inputs:
3-221
3 Objects
Create the actor. Policy gradient agents use stochastic actors, which for discrete action spaces are
implemented by rlDiscreteCategoricalActor objects.

{[0]}
Set some training options for the critic.
baselineOpts = rlOptimizerOptions( ...


Specify agent options, including training options for the actor and the critic.
agentOpts = rlPGAgentOptions(...
'UseBaseline',true, ...
'DiscountFactor', 0.99, ...
'CriticOptimizerOptions',baselineOpts, ...
'ActorOptimizerOptions', actorOpts);
create a PG agent using the actor, the critic and the agent option object.
agent = rlPGAgent(actor,baseline,agentOpts);
For PG agent with recurrent neural networks, the training sequence length is the whole episode.

{[0]}
Tips
• For continuous action spaces, the rlPGAgent agent does not enforce the constraints set by the
action specification, so you must enforce action space constraints within the environment.
Version History
3-222
rlPGAgent
See Also
rlAgentInitializationOptions | rlPGAgentOptions | rlQValueFunction |
Topics
“Policy Gradient Agents”
3-223
3 Objects
rlPGAgentOptions
Options for PG agent
Description
Use an rlPGAgentOptions object to specify options for policy gradient (PG) agents. To create a PG
agent, use rlPGAgent
For more information on PG agents, see “Policy Gradient Agents”.
Learning Agents”.
Creation
Syntax
opt = rlPGAgentOptions
opt = rlPGAgentOptions(Name,Value)
Description
opt = rlPGAgentOptions creates an rlPGAgentOptions object for use as an argument when

creating a PG agent using all default settings. You can modify the object properties using dot
notation.
opt = rlPGAgentOptions(Name,Value) sets option properties on page 3-224 using name-value

pairs. For example, rlPGAgentOptions('DiscountFactor',0.95) creates an option set with a
quotes.
Properties
UseBaseline — Use baseline for learning
Option to use baseline for learning, specified as a logical value. When UseBaseline is true, you
must specify a critic network as the baseline function approximator.
In general, for simpler problems with smaller actor networks, PG agents work better without a
baseline.

0 (default) | scalar value between 0 and 1
3-224
rlPGAgentOptions
minimizing this loss function.


rlOptimizer.


rlOptimizer.


equal to 1.
Object Functions
Examples
Create PG Agent Options Object
This example shows how to create and modify a PG agent options object.
Create a PG agent options object, specifying the discount factor.
opt = rlPGAgentOptions('DiscountFactor',0.9)
3-225
3 Objects
opt =
rlPGAgentOptions with properties:
UseBaseline: 1
EntropyLossWeight: 0
SampleTime: 1
Version History

The property UseDeterministicExploitation of the rlPGAgentOptions object will be removed

in a future release. Use the UseExplorationPolicy property of rlPGAgent instead.

3-226
rlPGAgentOptions
See Also
Topics
“Policy Gradient Agents”
3-227
3 Objects
rlPPOAgent
Proximal policy optimization reinforcement learning agent
Description
Proximal policy optimization (PPO) is a model-free, online, on-policy, policy gradient reinforcement
learning method. This algorithm alternates between sampling data through environmental interaction
and optimizing a clipped surrogate objective function using stochastic gradient descent. The action
space can be either discrete or continuous.
For more information on PPO agents, see “Proximal Policy Optimization Agents”. For more
Agents”.
Creation
Syntax
agent = rlPPOAgent(observationInfo,actionInfo)
agent = rlPPOAgent(observationInfo,actionInfo,initOpts)
agent = rlPPOAgent(actor,critic)
agent = rlPPOAgent( ___ ,agentOptions)
Description
agent = rlPPOAgent(observationInfo,actionInfo) creates a proximal policy optimization

(PPO) agent for an environment with the given observation and action specifications, using default
initialization options. The actor and critic in the agent use default deep neural networks built from
the observation specification observationInfo and the action specification actionInfo. The
agent = rlPPOAgent(observationInfo,actionInfo,initOpts) creates a PPO agent for an

environment with the given observation and action specifications. The agent uses default networks
configured using options specified in the initOpts object. Actor-critic agents do not support
recurrent neural networks. For more information on the initialization options, see
rlAgentInitializationOptions.
agent = rlPPOAgent(actor,critic) creates a PPO agent with the specified actor and critic,
using the default options for the agent.
3-228
rlPPOAgent
agent = rlPPOAgent( ___ ,agentOptions) creates a PPO agent and sets the AgentOptions
Input Arguments

actor — Actor

critic — Critic
Your critic can use a recurrent neural network as its function approximator. In this case, your actor
must also use a recurrent neural network. For an example, see “Create PPO Agent with Recurrent
Neural Networks” on page 3-240.
Properties

signals.

or rlNumericSpec.


3-229
3 Objects

rlPPOAgentOptions object
Agent options, specified as an rlPPOAgentOptions object.

values.

Object Functions
environment
observations
3-230
rlPPOAgent
Examples
Create Discrete PPO Agent from Observation and Action Specifications
Obtain observation and action specifications from the environment.
rng(0)
Create a PPO agent from the environment observation and action specifications.

{[-2]}
Create Continuous PPO Agent Using Initialization Options
3-231
3 Objects
rng(0)
Create a PPO actor-critic agent from the environment observation and action specifications.
agent = rlPPOAgent(obsInfo,actInfo,initOpts);
neurons
criticNet.Layers
ans =

3-232
rlPPOAgent
3-233
3 Objects

{[0.9228]}
Create Proximal Policy Optimization Agent
Create an environment interface, and obtain its observation and action specifications.
For PPO agents, the critic estimates a value function, therefore it must take the observation signal as
input and return a scalar value. Create a deep neural network to be used as approximation model
within the critic. Define the network as an array of layer objects.
criticNet = [
3-234
rlPPOAgent
reluLayer
];
Convert to a dlnetwork object and display the number of parameters.

criticNet = dlnetwork(criticNet);
summary(criticNet)
Initialized: true
Inputs:
Create the critic using criticNet. PPO agents use an rlValueFunction object to implement the
critic.
critic = rlValueFunction(criticNet,obsInfo);

ans = single
-0.2479
To approximate the policy within the actor use a neural network. For PPO agents, the actor executes
a stochastic policy, which for discrete action spaces is implemented by a discrete categorical actor. In
this case the approximator must take the observation signal as input and return a probability for each
action. Therefore the output layer must have as many elements as the number of possible actions.
Define the network as an array of layer objects, getting the dimension of the observation space and
actorNet = [
reluLayer
fullyConnectedLayer(numel(actInfo.Dimension))
];

summary(actorNet)
Initialized: true
Inputs:
Create the actor using actorNet. PPO agents use an rlDiscreteCategoricalActor object to
3-235
3 Objects

{[-10]}
Create a PPO agent using the actor and the critic.
agent =
rlPPOAgent with properties:
AgentOptions: [1x1 rl.option.rlPPOAgentOptions]

SampleTime: 1
agent.AgentOptions.ExperienceHorizon = 1024;

{[-10]}
Create Continuous PPO Agent
in the example “Train DDPG Agent to Control Double Integrator System”. The observation from the
environment is a vector containing the position and velocity of a mass. The action is a scalar
representing a force, applied to the mass, ranging continuously from -2 to 2 Newton.
obsInfo =
3-236
rlPPOAgent
LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Dimension: [2 1]
DataType: "double"
actInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Dimension: [1 1]
DataType: "double"
The actor and critic networks are initialized randomly. Ensure reproducibility by fixing the seed of the
random generator.
rng(0)
input and return a scalar value. To approximate the value function within the critic, use a neural
network. Define the network as an array of layer objects.
criticNet = [
reluLayer

summary(criticNet)
Initialized: true
Inputs:
critic.
3-237
3 Objects
ans = single
-0.0899
To approximate the policy within the actor, use a neural network. For PPO agents, the actor executes
a stochastic policy, which for continuous action spaces is implemented by a continuous Gaussian
actor. In this case the network must take the observation signal as input and return both a mean
value and a standard deviation value for each action. Therefore it must have two output layers (one
for the mean values the other for the standard deviation values), each having as many elements as
the dimension of the action space.
action spaces, and the action range limits from the environment specification objects. Specify a name
for the input and output layers, so you can later explicitly associate them with the appropriate
environment channel.
% Define common input path layer

commonPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="comPathIn")
reluLayer
fullyConnectedLayer(1,Name="comPathOut") ];
% Define mean value path

meanPath = [
fullyConnectedLayer(15,Name="meanPathIn")
reluLayer
tanhLayer;
scalingLayer(Name="meanPathOut",Scale=actInfo.UpperLimit) ];
% Define standard deviation path

sdevPath = [
fullyConnectedLayer(15,'Name',"stdPathIn")
reluLayer
softplusLayer(Name="stdPathOut") ];

actorNet = layerGraph(commonPath);
% Connect paths
actorNet = connectLayers(actorNet,"comPathOut","meanPathIn/in");
actorNet = connectLayers(actorNet,"comPathOut",'stdPathIn/in');
% Plot network
plot(actorNet)
3-238
rlPPOAgent

summary(actorNet)
Initialized: true
Inputs:
1 'comPathIn' 2 features
Create the actor using actorNet. PPO agents use an rlContinuousGaussianActor object to
implement the actor for continuous action spaces.
actor = rlContinuousGaussianActor(actorNet, obsInfo, actInfo, ...

'ActionMeanOutputNames',"meanPathOut",...
'ActionStandardDeviationOutputNames',"stdPathOut",...
'ObservationInputNames',"comPathIn");

{[-0.2267]}
Create a PPO agent using the actor and the critic.
3-239
3 Objects
agent =
rlPPOAgent with properties:
AgentOptions: [1x1 rl.option.rlPPOAgentOptions]

SampleTime: 1
Specify training options for the critic.


{[0.2719]}
Create PPO Agent with Recurrent Neural Networks
Pole System” example.
Get observation and action information. This environment has a continuous four-dimensional
observation space (the positions and velocities of both cart and pole) and a discrete one-dimensional
action space consisting on the application of two possible forces, -10N or 10N.
input and return a scalar value. To approximate the value function within the critic, use a neural
network.
the environment specification object. To create a recurrent neural network, use a
3-240
rlPPOAgent
layers.
criticNet = [
reluLayer
lstmLayer(8)
Convert to a dlnetwork object and display the number of learnable parameters.

summary(criticNet)
Initialized: true
Inputs:
Create the critic using criticNetwork. PPO agents use an rlValueFunction object to implement
the critic.

ans = single
0.0017
Since the critic has a recurrent network, the actor must have a recurrent network too. For PPO
agents, the actor executes a stochastic policy, which for discrete action spaces is implemented by a
discrete categorical actor. In this case the network must take the observation signal as input and
actorNet = [
reluLayer
lstmLayer(8)
softmaxLayer
];
summary(actorNet)
Initialized: true
3-241
3 Objects
Number of learnables: 4k
Inputs:
Create the actor using actorNetwork. PPO agents use an rlDiscreteCategoricalActor object
to implement the actor for discrete action spaces.

{[-10]}
Set some training option for the critic.

LearnRate=1e-2, ...
actorOptions = rlOptimizerOptions( ...

LearnRate=1e-3, ...
Create the agent options object.
agentOptions = rlPPOAgentOptions(...
AdvantageEstimateMethod="finite-horizon", ...
ClipFactor=0.1, ...
CriticOptimizerOptions=criticOptions, ...
ActorOptimizerOptions=actorOptions);
When recurrent neural networks are used, the MiniBatchSize property is the length of the learning
trajectory.
agentOptions.MiniBatchSize
ans = 128
Create the agent using the actor and critic, as well as the agent options object.
agent = rlPPOAgent(actor,critic,agentOptions);
Check your agent with a random observation input.
getAction(agent,rand(obsInfo.Dimension))

{[-10]}
3-242
rlPPOAgent
Tips
• For continuous action spaces, this agent does not enforce the constraints set by the action
specification. In this case, you must enforce action space constraints within the environment.
• While tuning the learning rate of the actor network is necessary for PPO agents, it is not
necessary for TRPO agents.
Version History
See Also
rlAgentInitializationOptions | rlPPOAgentOptions | rlValueFunction |
Topics
“Proximal Policy Optimization Agents”
3-243
3 Objects
rlPPOAgentOptions
Options for PPO agent
Description
Use an rlPPOAgentOptions object to specify options for proximal policy optimization (PPO) agents.
To create a PPO agent, use rlPPOAgent.
For more information on PPO agents, see “Proximal Policy Optimization Agents”.
Learning Agents”.
Creation
Syntax
opt = rlPPOAgentOptions
opt = rlPPOAgentOptions(Name,Value)
Description
opt = rlPPOAgentOptions creates an rlPPOAgentOptions object for use as an argument when

creating a PPO agent using all default settings. You can modify the object properties using dot
notation.
opt = rlPPOAgentOptions(Name,Value) sets option properties on page 3-244 using name-value

arguments. For example, rlPPOAgentOptions('DiscountFactor',0.95) creates an option set
with a discount factor of 0.95. You can specify multiple name-value arguments. Enclose each
property name in quotes.
Properties
ExperienceHorizon — Number of steps the agent interacts with the environment before
learning
specified as a positive integer.
The ExperienceHorizon value must be greater than or equal to the MiniBatchSize value.
MiniBatchSize — Mini-batch size

Mini-batch size used for each learning epoch, specified as a positive integer. When the agent uses a
recurrent neural network, MiniBatchSize is treated as the training trajectory length.
3-244
rlPPOAgentOptions
The MiniBatchSize value must be less than or equal to the ExperienceHorizon value.
ClipFactor — Clip factor

0.2 (default) | positive scalar less than 1
Clip factor for limiting the change in each policy update step, specified as a positive scalar less than
1.

0.01 (default) | scalar value between 0 and 1
minimizing this loss function. For more information, see “Entropy Loss”.
NumEpoch — Number of epochs

Number of epochs for which the actor and critic networks learn from the current experience set,
AdvantageEstimateMethod — Method for estimating advantage values

"gae" (default) | "finite-horizon"
Method for estimating advantage values, specified as one of the following:
• "gae" — Generalized advantage estimator

• "finite-horizon" — Finite horizon estimation
For more information on these methods, see the training algorithm information in “Proximal Policy
Optimization Agents”.
GAEFactor — Smoothing factor for generalized advantage estimator

Smoothing factor for generalized advantage estimator, specified as a scalar value between 0 and 1,
inclusive. This option applies only when the AdvantageEstimateMethod option is "gae"
NormalizedAdvantageMethod — Method for normalizing advantage function

"none" (default) | "current | "moving"
Method for normalizing advantage function values, specified as one of the following:
• "none" — Do not normalize advantage values

• "current" — Normalize the advantage function using the mean and standard deviation for the
current mini-batch of experiences.
• "moving" — Normalize the advantage function using the mean and standard deviation for a
moving window of recent experiences. To specify the window size, set the
AdvantageNormalizingWindow option.
3-245
3 Objects
In some environments, you can improve agent performance by normalizing the advantage function
during training. The agent normalizes the advantage function by subtracting the mean advantage
value and scaling by the standard deviation.
AdvantageNormalizingWindow — Window size for normalizing advantage function

1e6 (default) | positive integer
Window size for normalizing advantage function values, specified as a positive integer. Use this
option when the NormalizedAdvantageMethod option is "moving".


rlOptimizer.


rlOptimizer.


equal to 1.
Object Functions
Examples
3-246
rlPPOAgentOptions
Create PPO Agent Options Object
Create a PPO agent options object, specifying the experience horizon.
opt = rlPPOAgentOptions('ExperienceHorizon',256)
opt =
rlPPOAgentOptions with properties:
ExperienceHorizon: 256
MiniBatchSize: 128
ClipFactor: 0.2000
EntropyLossWeight: 0.0100
NumEpoch: 3
AdvantageEstimateMethod: "gae"
GAEFactor: 0.9500
NormalizedAdvantageMethod: "none"
AdvantageNormalizingWindow: 1000000
SampleTime: 1
Version History

The property UseDeterministicExploitation of the rlPPOAgentOptions object will be

removed in a future release. Use the UseExplorationPolicy property of rlPPOAgent instead.
3-247
3 Objects

See Also
Topics
“Proximal Policy Optimization Agents”
3-248
rlPrioritizedReplayMemory
Replay memory experience buffer with prioritized sampling
Description
An off-policy reinforcement learning agent stores experiences in a circular experience buffer. During
training, the agent samples mini-batches of experiences from the buffer and uses these mini-batches
to update its actor and critic function approximators.
By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object
as their experience buffer. Agents uniformly sample data from this buffer. To perform nonuniform
prioritized sampling [1], which can improve sample efficiency when training your agent, use an
rlPrioritizedReplayMemory object. For more information on prioritized sampling, see
“Algorithms” on page 3-251.
Creation
Syntax
buffer = rlPrioritizedReplayMemory(obsInfo,actInfo)
buffer = rlPrioritizedReplayMemory(obsInfo,actInfo,maxLength)
Description
buffer = rlPrioritizedReplayMemory(obsInfo,actInfo) creates a prioritized replay

memory experience buffer that is compatible with the observation and action specifications in
obsInfo and actInfo, respectively.
buffer = rlPrioritizedReplayMemory(obsInfo,actInfo,maxLength) sets the maximum

length of the buffer by setting the MaxLength property.
Input Arguments


or rlNumericSpec.


such as dimensions, data types, and names of the action signals.
3-249
3 Objects
rlNumericSpec.
Properties
MaxLength — Maximum buffer length
Maximum buffer length, specified as a positive integer.
Length — Number of experiences in buffer

0 (default) | nonnegative integer
Number of experiences in buffer, specified as a nonnegative integer.
PriorityExponent — Priority exponent

Priority exponent to control the impact of prioritization during probability computation, specified as a
nonnegative scalar less than or equal to 1.
If the priority exponent is zero, the agent uses uniform sampling.
InitialImportanceSamplingExponent — Initial value of importance sampling exponent

Initial value of the importance sampling exponent, specified as a nonnegative scalar less than or
equal to 1
NumAnnealingSteps — Number of annealing steps

Number of annealing steps for updating the importance sampling exponent, specified as a positive
integer.
ImportanceSamplingExponent — Current value of importance sampling exponent

Current value of the importance sampling exponent, specified as a nonnegative scalar less than or
equal to 1.
During training, ImportanceSamplingExponent is linearly increased from

InitialImportanceSamplingExponent to 1 over NumAnnealingSteps steps.
Object Functions
append Append experiences to replay memory buffer
3-250
sample Sample experiences from replay memory buffer

resize Resize replay memory experience buffer
allExperiences Return all experiences in replay memory buffer
Examples
Create DQN Agent With Prioritized Replay Memory


By default, the agent uses a replay memory experience buffer with uniform sampling.
Replace the default experience buffer with a prioritized replay memory buffer.
agent.ExperienceBuffer = rlPrioritizedReplayMemory(obsInfo,actInfo);
Configure the prioritized replay memory options. For example, set the initial importance sampling
exponent to 0.5 and the number of annealing steps for updating the exponent during training to 1e4.
agent.ExperienceBuffer.NumAnnealingSteps = 1e4;
agent.ExperienceBuffer.PriorityExponent = 0.5;
agent.ExperienceBuffer.InitialImportanceSamplingExponent = 0.5;
Limitations
• Prioritized experience replay does not support agents that use recurrent neural networks.
Algorithms
Prioritized replay memory samples experiences according to experience priorities. For a given
experience, the priority is defined as the absolute value of the associated temporal difference (TD)
error. A larger TD error indicates that the critic network is not well-trained for the corresponding
experience. Therefore, sampling such experiences during critic updates can help efficiently improve
the critic performance, which often improves the sample efficiency of agent training.
When using prioritized replay memory, agents use the following process when sampling a mini-batch
of experiences and updating a critic.
3-251
3 Objects
1 Compute the sampling probability P for each experience in the buffer based on the experience
priority.
α
p j
P( j) = α
∑iN= 1 p i
Here:
• N is the number of experiences in the replay memory buffer

• p is the experience priority.
• α is a priority exponent. To set α, use the PriorityExponent parameter.
2 Sample a mini-batch of experiences according to the computed probabilities.
3 Compute the importance sampling weights (w) for the sampled experiences.
−β
w′( j) = N ⋅ P j
w′ j
w j
max w′ i
i ∈ mini‐batch
Here, β is the importance sampling exponent. The ImportanceSamplingExponent parameter

contains the current value of β. To control β, set the ImportanceSamplingExponent and
NumAnnealingSteps parameters.
4 Compute the weighted loss using the importance sampling weights w and the TD error δ to
update a critic
5 Update the priorities of the sampled experiences based on the TD error.
p j = δ
6 Update the importance sampling exponent β by linearly annealing the exponent value until it
reaches 1.
1 − β0
β β+
NS
Here:
• β0 is the initial importance sampling exponent. To specify β0, use the

InitialImportanceSamplingExponent parameter.
• NS is the number of annealing steps. To specify Ns, use the NumAnnealingSteps parameter.
Version History
References
[1] Schaul, Tom, John Quan, Ioannis Antonoglou, and David Silver. 'Prioritized experience replay'.
arXiv:1511.05952 [Cs] 25 February 2016. https://arxiv.org/abs/1511.05952.
3-252
See Also
rlReplayMemory
3-253
3 Objects
rlQAgent
Q-learning reinforcement learning agent
Description
The Q-learning algorithm is a model-free, online, off-policy reinforcement learning method. A Q-
learning agent is a value-based reinforcement learning agent which trains a critic to estimate the
return or future rewards.
For more information on Q-learning agents, see “Q-Learning Agents”.
Learning Agents”.
Creation
Syntax
agent = rlQAgent(critic,agentOptions)
Description
agent = rlQAgent(critic,agentOptions) creates a Q-learning agent with the specified critic

network and sets the AgentOptions property.
Input Arguments
critic — Critic
Properties
rlQAgentOptions object
Agent options, specified as an rlQAgentOptions object.

values.
• false — Use the agent greedy policy when selecting actions.
3-254
rlQAgent
• true — Use the agent exploration policy when selecting actions.

Observation specifications, specified as a reinforcement learning specification object defining

properties such as dimensions, data type, and name of the observation signal.
The value of ObservationInfo matches the corresponding value specified in critic.

Action specification, specified as an rlFiniteSetSpec object.
The value of ActionInfo matches the corresponding value specified in critic.

Object Functions
environment
observations
Examples
3-255
3 Objects
Create a Q-Learning Agent
Create an environment interface. For this example, use the same environment as in the example
“Train Reinforcement Learning Agent in Basic Grid World”.
Get observation and action specifications.
Create a table approximation model derived from the environment observation and action
specifications.
qTable = rlTable(obsInfo,actInfo);
Create the critic using qTable. Q agents use an rlValueFunction object to implement the critic.
critic = rlQValueFunction(qTable,obsInfo,actInfo);
Create a Q-learning agent using the specified critic and an epsilon value of 0.05.
opt = rlQAgentOptions;
agent = rlQAgent(critic,opt)
agent =
rlQAgent with properties:
AgentOptions: [1x1 rl.option.rlQAgentOptions]

ObservationInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1
act = getAction(agent,{randi(numel(obsInfo.Elements))});
act{1}
ans = 1
Version History
See Also
Functions
rlQAgentOptions | rlQValueFunction
3-256
rlQAgent
Topics
“Q-Learning Agents”
3-257
3 Objects
rlQAgentOptions
Options for Q-learning agent
Description
Use an rlQAgentOptions object to specify options for creating Q-learning agents. To create a Q-
learning agent, use rlQAgent
For more information on Q-learning agents, see “Q-Learning Agents”.
Learning Agents”.
Creation
Syntax
opt = rlQAgentOptions
opt = rlQAgentOptions(Name,Value)
Description
opt = rlQAgentOptions creates an rlQAgentOptions object for use as an argument when

creating a Q-learning agent using all default settings. You can modify the object properties using dot
notation.
opt = rlQAgentOptions(Name,Value)sets option properties on page 3-258 using name-value

pairs. For example, rlQAgentOptions('DiscountFactor',0.95) creates an option set with a
quotes.
Properties

3-258
rlQAgentOptions

higher rate.
Epsilon.
To specify exploration options, use dot notation after creating the rlQAgentOptions object opt. For
example, set the epsilon value to 0.9.


rlOptimizer.


equal to 1.
3-259
3 Objects
Object Functions
rlQAgent Q-learning reinforcement learning agent
Examples
Create Q-Learning Agent Options Object
This example shows how to create an options object for a Q-Learning agent.
Create an rlQAgentOptions object that specifies the agent sample time.
opt = rlQAgentOptions('SampleTime',0.5)
opt =
rlQAgentOptions with properties:

SampleTime: 0.5000
You can modify options using dot notation. For example, set the agent discount factor to 0.95.
opt.DiscountFactor = 0.95;
Version History
See Also
Topics
“Q-Learning Agents”
3-260
rlQValueFunction
rlQValueFunction
Q-Value function approximator object for reinforcement learning agents
Description
This object implements a Q-value function approximator that you can use as a critic for a
reinforcement learning agent. A Q-value function maps an environment state-action pair to a scalar
value representing the predicted discounted cumulative long-term reward when the agent starts from
the given state and executes the given action. A Q-value function critic therefore needs both the
environment state and an action as inputs. After you create an rlQValueFunction critic, use it to
create an agent such as rlQAgent, rlDQNAgent, rlSARSAAgent, rlDDPGAgent, or rlTD3Agent.
For more information on creating representations, see “Create Policies and Value Functions”.
Creation
Syntax
critic = rlQValueFunction(net,observationInfo,actionInfo)
critic = rlQValueFunction(tab,observationInfo,actionInfo)
critic = rlQValueFunction({basisFcn,W0},observationInfo,actionInfo)
critic = rlQValueFunction( ___ ,Name=Value)
Description
critic = rlQValueFunction(net,observationInfo,actionInfo) creates the Q-value

function object critic. Here, net is the deep neural network used as an approximator, and it must
have both observation and action as inputs and a single scalar output. The network input layers are
automatically associated with the environment observation and action channels according to the
dimension specifications in observationInfo and actionInfo. This function sets the
ObservationInfo and ActionInfo properties of critic to the observationInfo and
critic = rlQValueFunction(tab,observationInfo,actionInfo) creates the Q-value

function object critic with discrete action and observation spaces from the Q-value table tab. tab
is a rlTable object containing a table with as many rows as the number of possible observations and
as many columns as the number of possible actions. The function sets the ObservationInfo and
ActionInfo properties of critic respectively to the observationInfo and actionInfo input
arguments, which in this case must be scalar rlFiniteSetSpec objects.
critic = rlQValueFunction({basisFcn,W0},observationInfo,actionInfo) creates a Q-

value function object critic using a custom basis function as underlying approximator. The first
input argument is a two-element cell array whose first element is the handle basisFcn to a custom
basis function and whose second element is the initial weight vector W0. Here the basis function must
have both observation and action as inputs and W0 must be a column vector. The function sets the
3-261
3 Objects
ObservationInfo and ActionInfo properties of critic to the observationInfo and

critic = rlQValueFunction( ___ ,Name=Value) specifies one or more name-value arguments.

You can specify the input and output layer names (to mandate their association with the environment
observation and action channels) for deep neural network approximators. For all types of
approximators, you can specify the computation device, for example UseDevice="gpu".
Input Arguments

Deep neural network used as the underlying approximator within the critic, specified as one of the
following:

The network must have both the environment observation and action as inputs and a single scalar as
output.
rlQValueFunction objects support recurrent deep neural networks.
The learnable parameters of the critic are the weights of the deep neural network. For a list of deep
tab — Q-value table

rlTable object
Q-value table, specified as an rlTable object containing an array with as many rows as the possible
observations and as many columns as the possible actions. The element (s,a) is the expected
cumulative long-term reward for taking action a from observed state s. The elements of this array are
the learnable parameters of the critic.

function handle
3-262
rlQValueFunction
defined function can either be an anonymous function or a function on the MATLAB path. The output
of the critic is the scalar c = W'*B, where W is a weight vector containing the learnable parameters,
and B is the column vector returned by the custom basis function.

environment observation channels defined in observationInfo and act is an input with the same
data type and dimension as the environment action channel defined in actionInfo.
For an example on how to use a basis function to create a Q-value function critic with a mixed
continuous and discrete observation space, see “Create Mixed Observation Space Q-Value Function
Critic from Custom Basis Function” on page 3-273.
Example: @(obs1,obs2,act) [act(2)*obs1(1)^2; abs(obs2(5)+act(1))]

column vector
Initial value of the basis function weights W, specified as a column vector having the same length as
the vector returned by the basis function.

Example: UseDevice="gpu"
ActionInputNames — Network input layers name corresponding to the environment action

channel
Network input layers name corresponding to the environment action channel, specified as a string
array or a cell array of character vectors. The function assigns the environment action channel
specified in actionInfo to the specified network input layer. Therefore, the specified network input
layer must have the same data type and dimensions as defined in actionInfo.
Note The function does not use the name or the description (if any) of the action channel specified in
actionInfo.
This name-value argument is supported only when the approximation model is a deep neural network.
Example: ActionInputNames="myNetOutput_Force"
ObservationInputNames — Network input layers names corresponding to the environment

observation channels
string array or a cell array of character vectors. The function assigns, in sequential order, each
3-263
3 Objects
environment observation channel specified in observationInfo to each specified network input

layer. Therefore, the specified network input layers, ordered as indicated in this argument, must have
the same data type and dimensions as the observation specifications, as ordered in
observationInfo.
This name-value argument is supported only when the approximation model is a deep neural network.
Example: ObservationInputNames={"NetInput1_airspeed","NetInput2_altitude"}
Properties

containing a mix of such objects. Each element in the array defines the properties of an environment
observation channel, such as its dimensions, data type, and name. Note that only the data type and
dimension of a channel are used by the software to create actors or critics, but not its (optional)
name.
rlQValueFucntion sets the ObservationInfo property of critic to the input argument

observationInfo.


Action specifications, specified as an rlFiniteSetSpec or rlNumericSpec object. This object

defines the properties of the environment action channel, such as its dimensions, data type, and
name. Note that the function does not use the name of the action channel specified in actionInfo.
rlQValueRepresentation sets the ActionInfo property of critic to the input actionInfo.

3-264
rlQValueFunction
Computing Toolbox).
used with MATLAB.
Example: "gpu"
Object Functions
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning
agent
rlSARSAAgent SARSA reinforcement learning agent
getValue Obtain estimated value from a critic given environment observations and
actions
getMaxQValue Obtain maximum estimated value over all possible actions from a Q-value
function critic with discrete action space, given environment observations
action) input data
action input data
policy object
object
Examples
Create Q-Value Function Critic from Deep Neural Network
doubles.
3-265
3 Objects
To approximate the Q-value function within the critic, use a deep neural network.
The network must have two inputs, one for the observation and one for the action. The observation
input must accept a four-element vector (the observation vector defined by obsInfo). The action
input must accept a two-element vector (the action vector defined by actInfo). The output of the
network must be a scalar, representing the expected cumulative long-term reward when the agent
starts from the given observation and takes the given action.
You can also obtain the number of observations from the obsInfo specification (regardless of
whether the observation space is a column vector, row vector, or matrix,
prod(obsInfo.Dimension) is its total number of dimensions, in this case four, similarly,
prod(actInfo.Dimension) is the number of dimension of the action space, in the case two).
Create the network as an array of layer objects.
% Observation path layers

obsPath = [featureInputLayer(prod(obsInfo.Dimension))
reluLayer
% Action path layers

actPath = [featureInputLayer(prod(actInfo.Dimension))
reluLayer
% Common path to output layers

% Concatenate two layers along dimension one
comPath = [concatenationLayer(1,2,Name="cct")
reluLayer
fullyConnectedLayer(1, Name="output")];

net = addLayers(layerGraph(obsPath),actPath);
% Connect layers
net = connectLayers(net,"obsout","cct/in1");
net = connectLayers(net,"actout","cct/in2");
% Plot network
plot(net)
3-266
rlQValueFunction
% Convert the network to a dlnetwork object

% Summarize properties
summary(net)
Initialized: true
Inputs:
Create the critic with rlQValueFunction, using the network as well as the observations and action
specification objects. When using this syntax, the network input layers are automatically associated
with the components of the observation and action signals according to the dimension specifications
in obsInfo and actInfo.
critic = rlQValueFunction(net,obsInfo,actInfo)
critic =
rlQValueFunction with properties:

UseDevice: "cpu"
3-267
3 Objects
To check your critic, use the getValue function to return the value of a random observation and
action, given the current network weights.
v = getValue(critic, ...
{rand(obsInfo.Dimension)}, ...
{rand(actInfo.Dimension)})
v = single
-1.1006
You can now use the critic (along with an with an actor) to create an agent relying on a Q-value
function critic (such as rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent).
Create Q-Value Function Critic from Deep Neural Network Specifying Layer Names
doubles.
To approximate the Q-value function within the critic, use a deep neural network .
The network must have two inputs, one for the observation and one for the action. The observation
input (here called netObsInput) must accept a four-element vector (the observation vector defined
by obsInfo). The action input (here called netActInput) must accept a two-element vector (the
action vector defined by actInfo). The output of the network must be a scalar, representing the
expected cumulative long-term reward when the agent starts from the given observation and takes
the given action.
You can also obtain the number of observations from the obsInfo specification object (regardless of
prod(obsInfo.Dimension) is its number of dimensions, in this case four, similarly,
prod(actInfo.Dimension) is the number of dimension of the action space, in the case two).
To create the neural network paths, use vectors of layer objects. Name the network input layers for
the observation and action netObsInput and netActInput, respectively.
% Observation path layers

obsPath = [featureInputLayer( ...
Name="netObsInput")
reluLayer
% Action path layers
3-268
rlQValueFunction
actPath = [featureInputLayer( ...

Name="netActInput")
reluLayer
% Common path to output layers

% Concatenate two layers along dimension one
comPath = [concatenationLayer(1,2,Name="cct")
reluLayer
fullyConnectedLayer(1, Name="output")];

% Connect layers
net = connectLayers(net,"obsout","cct/in1");
net = connectLayers(net,"actout","cct/in2");
% Plot network
plot(net)
3-269
3 Objects
% Summarize properties
summary(net);
Initialized: true
Inputs:
1 'netObsInput' 4 features
2 'netActInput' 2 features
Create the critic with rlQValueFunction, using the network, the observations and action
specification objects, and the names of the network input layers to be associated with the observation
and action from the environment.
critic = rlQValueFunction(net,...
ObservationInputNames="netObsInput",...
ActionInputNames="netActInput")
critic =

UseDevice: "cpu"
action, given the current network weights.
{rand(obsInfo.Dimension)}, ...
{rand(actInfo.Dimension)})
v = single
-1.1006
Create Q-Value Function Critic from Table
Create a finite set observation specification object (or alternatively use getObservationInfo to
extract the specification object from an environment with a discrete observation space). For this
example define the observation space as a finite set with of four possible values.
obsInfo = rlFiniteSetSpec([7 5 3 1]);
Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example define the
action space as a finite set with 2 possible values.
actInfo = rlFiniteSetSpec([4 8]);
3-270
rlQValueFunction
Create a table to approximate the value function within the critic. rlTable creates a value table
object from the observation and action specifications objects.
The table stores a value (representing the expected cumulative long term reward) for each possible
observation-action pair. Each row corresponds to an observation and each column corresponds to an
action. You can access the table using the Table property of the vTable object. The initial value of
each element is zero.
qTable.Table
ans = 4×2
0 0
0 0
0 0
0 0
You can initialize the table to any value, in this case an array containing the integer from 1 through 8.
qTable.Table=reshape(1:8,4,2)
qTable =
rlTable with properties:
Table: [4x2 double]
Create the critic using the table as well as the observations and action specification objects.
critic = rlQValueFunction(qTable,obsInfo,actInfo)
critic =

UseDevice: "cpu"
To check your critic, use the getValue function to return the value of a given observation and action,
using the current table entries.
v = getValue(critic,{5},{8})
v = 6
You can now use the critic (along with an with an actor) to create a discrete action space agent
relying on a Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).
Create Q-Value Function Critic from Custom Basis Function
3-271
3 Objects
continuous three-dimensional space, so that a single observation is a column vector containing three
doubles.
Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations and
actions respectively defined by obsInfo and actInfo.
myBasisFcn = @(myobs,myact) [ ...

myobs(2)^2; ...
myobs(1)+exp(myact(1)); ...
abs(myact(2)); ...
myobs(3) ]

@(myobs,myact)[myobs(2)^2;myobs(1)+exp(myact(1));abs(myact(2));myobs(3)]
The output of the critic is the scalar W'*myBasisFcn(myobs,myact), where W is a weight column
vector which must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the best
possible action. The elements of W are the learnable parameters.
Define an initial parameter vector.
W0 = [1;4;4;2];
Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second and third arguments are, respectively, the

obsInfo,actInfo)
critic =

UseDevice: "cpu"
To check your critic, use getValue to return the value of a given observation-action pair, using the
current parameter vector.
v = getValue(critic,{[1 2 3]'},{[4 5]'})
v = 252.3926
3-272
rlQValueFunction
Create Mixed Observation Space Q-Value Function Critic from Custom Basis Function
consisting of two channels, the first is a vector over a continuous two-dimensional space and the
second is a vector over a three-dimensional space that can assume only four values.

rlFiniteSetSpec({[1 0 -1], ...
[-1 2 1], ...
[0.1 0.2 0.3], ...
[0 0 0]})];
specification object from an environment). For this example, define the action space as discrete set
consisting of three possible actions, 1, 2, and 3.
actInfo = rlFiniteSetSpec({1,2,3});
the action respectively defined by obsInfo and actInfo. Note that the selected action, as defined,
has only one element, while each observation channel has two elements.
myBasisFcn = @(obsA,obsB,act) [obsA(1)+obsB(2)+obsB(3)+act(1);

obsA(2)+obsB(1)+obsB(2)-act(1);
obsA(1)+obsB(2)+obsB(3)+act(1)^2;
obsA(1)+obsB(1)+obsB(2)-act(1)^2];
The output of the critic is the scalar W'*myBasisFcn(obsA,obsB,act), where W is a weight column
vector that must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the action
specified as last input. The elements of W are the learnable parameters.
W0 = ones(4,1);
critic = rlQValueFunction({myBasisFcn,W0},obsInfo,actInfo)
critic =

UseDevice: "cpu"
To check your critic, use the getValue function to return the value of a given observation-action pair,
using the current parameter vector.
3-273
3 Objects
v = getValue(critic,{[-0.5 0.6],[1 0 -1]},{3})
v = -0.9000
Note that the critic does not enforce the set constraint for the discrete set elements.
v = getValue(critic,{[-0.5 0.6],[10 -10 -0.05]},{33})
v = -21.0000
You can now use the critic (along with an with an actor) to create an agent with a discrete action
space relying on a Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).
Version History
See Also
Functions
rlValueFunction | rlVectorQValueFunction | rlTable | getActionInfo |
getObservationInfo
Topics
3-274
rlQValueRepresentation
(Not recommended) Q-Value function critic representation for reinforcement learning agents
Note rlQValueRepresentation is not recommended. Use rlQValueFunction or

rlVectorQValueFunction instead. For more information, see “rlQValueRepresentation is not
recommended”.
Description
This object implements a Q-value function approximator to be used as a critic within a reinforcement
learning agent. A Q-value function is a function that maps an observation-action pair to a scalar value
representing the expected total long-term rewards that the agent is expected to accumulate when it
starts from the given observation and executes the given action. Q-value function critics therefore
need both observations and actions as inputs. After you create an rlQValueRepresentation critic,
use it to create an agent relying on a Q-value function critic, such as an rlQAgent, rlDQNAgent,
rlSARSAAgent, rlDDPGAgent, or rlTD3Agent. For more information on creating representations,
see “Create Policies and Value Functions”.
Creation
Syntax
critic = rlQValueRepresentation(net,observationInfo,actionInfo,'Observation',
obsName,'Action',actName)
critic = rlQValueRepresentation(tab,observationInfo,actionInfo)
critic = rlQValueRepresentation({basisFcn,W0},observationInfo,actionInfo)
obsName)
critic = rlQValueRepresentation( ___ ,options)
Description
Scalar Output Q-Value Critic
obsName,'Action',actName) creates the Q-value function critic. net is the deep neural
network used as an approximator, and must have both observations and action as inputs, and a single
scalar output. This syntax sets the ObservationInfo and ActionInfo properties of critic respectively
to the inputs observationInfo and actionInfo, containing the observations and action
specifications. obsName must contain the names of the input layers of net that are associated with
the observation specifications. The action name actName must be the name of the input layer of net
that is associated with the action specifications.
critic = rlQValueRepresentation(tab,observationInfo,actionInfo) creates the Q-

value function based critic with discrete action and observation spaces from the Q-value table tab.
3-275
3 Objects
tab is a rlTable object containing a table with as many rows as the possible observations and as
many columns as the possible actions. This syntax sets the ObservationInfo and ActionInfo properties
of critic respectively to the inputs observationInfo and actionInfo, which must be
rlFiniteSetSpec objects containing the specifications for the discrete observations and action
spaces, respectively.
creates a Q-value function based critic using a custom basis function as underlying approximator.
The first input argument is a two-elements cell in which the first element contains the handle
basisFcn to a custom basis function, and the second element contains the initial weight vector W0.
Here the basis function must have both observations and action as inputs and W0 must be a column
vector. This syntax sets the ObservationInfo and ActionInfo properties of critic respectively to the
inputs observationInfo and actionInfo.
Multi-Output Discrete Action Space Q-Value Critic
obsName) creates the multi-output Q-value function critic for a discrete action space. net is the
deep neural network used as an approximator, and must have only the observations as input and a
single output layer having as many elements as the number of possible discrete actions. This syntax
sets the ObservationInfo and ActionInfo properties of critic respectively to the inputs
observationInfo and actionInfo, containing the observations and action specifications. Here,
actionInfo must be an rlFiniteSetSpec object containing the specifications for the discrete
action space. The observation names obsName must be the names of the input layers of net.
creates the multi-output Q-value function critic for a discrete action space using a custom basis
function as underlying approximator. The first input argument is a two-elements cell in which the first
element contains the handle basisFcn to a custom basis function, and the second element contains
the initial weight matrix W0. Here the basis function must have only the observations as inputs, and
W0 must have as many columns as the number of possible actions. This syntax sets the
ObservationInfo and ActionInfo properties of critic respectively to the inputs observationInfo
and actionInfo.
Options
critic = rlQValueRepresentation( ___ ,options) creates the value function based critic
using the additional option set options, which is an rlRepresentationOptions object. This
syntax sets the Options property of critic to the options input argument. You can use this syntax
with any of the previous input-argument combinations.
Input Arguments

dlNetwork object
following:

3-276
For single output critics, net must have both observations and actions as inputs, and a scalar output,
representing the expected cumulative long-term reward when the agent starts from the given
observation and takes the given action. For multi-output discrete action space critics, net must have
only the observations as input and a single output layer having as many elements as the number of
possible discrete actions. Each output element represents the expected cumulative long-term reward
when the agent starts from the given observation and takes the corresponding action. The learnable
parameters of the critic are the weights of the deep neural network.
The network output layer must have the same data type and dimension as the signal defined in
ActionInfo. Its name must be the action name specified in actName.
rlQValueRepresentation objects support recurrent deep neural networks for multi-output

discrete action space critics.

string | character vector | cell array or character vectors
must be the names of the observation input layers in net.
Example: {'my_obs'}
actName — Action name

string | character vector | single-element cell array containing a character vector
Action name, specified as a single-element cell array that contains a string or character vector. It
must be the name of the action input layer of net.
Example: {'my_act'}
tab — Q-value table

rlTable object
Q-value table, specified as an rlTable object containing an array with as many rows as the possible
observations and as many columns as the possible actions. The element (s,a) is the expected
cumulative long-term reward for taking action a from observed state s. The elements of this array are
the learnable parameters of the critic.

function handle
of the critic is c = W'*B, where W is a weight vector or matrix containing the learnable parameters,
and B is the column vector returned by the custom basis function.
3-277
3 Objects
For a single-output Q-value critic, c is a scalar representing the expected cumulative long term
reward when the agent starts from the given observation and takes the given action. In this case,
your basis function must have the following signature.
For a multiple-output Q-value critic with a discrete action space, c is a vector in which each element
is the expected cumulative long term reward when the agent starts from the given observation and
takes the action corresponding to the position of the considered element. In this case, your basis
function must have the following signature.
Here, obs1 to obsN are observations in the same order and with the same data type and dimensions
as the signals defined in observationInfo and act has the same data type and dimensions as the
action specifications in actionInfo.
Example: @(obs1,obs2,act) [act(2)*obs1(1)^2; abs(obs2(5)+act(1))]

matrix
Initial value of the basis function weights, W. For a single-output Q-value critic, W is a column vector
having the same length as the vector returned by the basis function. For a multiple-output Q-value
critic with a discrete action space, W is a matrix which must have as many rows as the length of the
basis function output, and as many columns as the number of possible actions.
Properties



containing a mix of such objects. These objects define properties such as the dimensions, data type,
rlQValueRepresentation sets the ObservationInfo property of critic to the input

observationInfo.


Action specifications, specified as an rlFiniteSetSpec or rlNumericSpec object. These objects

define properties such as the dimensions, data type and name of the action signals.
rlQValueRepresentation sets the ActionInfo property of critic to the input actionInfo.
3-278
Object Functions
getValue Obtain estimated value from a critic given environment observations and actions
getMaxQValue Obtain maximum estimated value over all possible actions from a Q-value function
critic with discrete action space, given environment observations
Examples
Create Q-Value Function Critic from Deep Neural Network
doubles.
Create a deep neural network to approximate the Q-value function. The network must have two
inputs, one for the observation and one for the action. The observation input (here called myobs)
must accept a four-element vector (the observation vector defined by obsInfo). The action input
(here called myact) must accept a two-element vector (the action vector defined by actInfo). The
output of the network must be a scalar, representing the expected cumulative long-term reward when
the agent starts from the given observation and takes the given action.
% observation path layers

obsPath = [featureInputLayer(4, ...
'Normalization','none','Name','myobs')
fullyConnectedLayer(1,'Name','obsout')];
% action path layers

actPath = [featureInputLayer(2, ...
'Normalization','none','Name','myact')
fullyConnectedLayer(1,'Name','actout')];
% common path to output layers

comPath = [additionLayer(2,'Name', 'add') ...
fullyConnectedLayer(1, 'Name', 'output')];
% add layers to network object
3-279
3 Objects
% connect layers
net = connectLayers(net,'obsout','add/in1');
net = connectLayers(net,'actout','add/in2');
Create the critic with rlQValueRepresentation, using the network, the observations and action
specification objects, as well as the names of the network input layers.
critic = rlQValueRepresentation(net,obsInfo,actInfo, ...

'Observation',{'myobs'},'Action',{'myact'})
critic =
rlQValueRepresentation with properties:

action, using the current network weights.
v = getValue(critic,{rand(4,1)},{rand(2,1)})
v = single
0.1102
function critic (such as an rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent agent).
Create Multi-Output Q-Value Function Critic from Deep Neural Network
This example shows how to create a multi-output Q-value function critic for a discrete action space
using a deep neural network approximator.
This critic takes only the observation as input and produces as output a vector with as many elements
as the possible actions. Each element represents the expected cumulative long term reward when the
agent starts from the given observation and takes the action corresponding to the position of the
element in the output vector.
doubles.
specification object from an environment with a discrete action space). For this example, define the
action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).
3-280
Create a deep neural network approximator to approximate the Q-value function within the critic.
The input of the network (here called myobs) must accept a four-element vector, as defined by
obsInfo. The output must be a single output layer having as many elements as the number of
possible discrete actions (three in this case, as defined by actInfo).
net = [featureInputLayer(4,...
'Normalization','none','Name','myobs')
fullyConnectedLayer(3,'Name','value')];
Create the critic using the network, the observations specification object, and the name of the
network input layer.
critic = rlQValueRepresentation(net,obsInfo,actInfo,...
'Observation',{'myobs'})
critic =

To check your critic, use the getValue function to return the values of a random observation, using
the current network weights. There is one value for each of the three possible actions.
v = getValue(critic,{rand(4,1)})
0.7232
0.8177
-0.2212
You can now use the critic (along with an actor) to create a discrete action space agent relying on a
Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).
Create Q-Value Function Critic from Table
example define the observation space as a finite set with of 4 possible values.
specification object from an environment with a discrete action space). For this example define the
action space as a finite set with 2 possible values.
actInfo = rlFiniteSetSpec([4 8]);
Create a table to approximate the value function within the critic. rlTable creates a value table
object from the observation and action specifications objects.
3-281
3 Objects
The table stores a value (representing the expected cumulative long term reward) for each possible
observation-action pair. Each row corresponds to an observation and each column corresponds to an
action. You can access the table using the Table property of the vTable object. The initial value of
each element is zero.
qTable.Table
ans = 4×2
0 0
0 0
0 0
0 0
You can initialize the table to any value, in this case, an array containing the integer from 1 through
8.
qTable.Table=reshape(1:8,4,2)
qTable =
Table: [4x2 double]
Create the critic using the table as well as the observations and action specification objects.
critic = rlQValueRepresentation(qTable,obsInfo,actInfo)
critic =

To check your critic, use the getValue function to return the value of a given observation and action,
using the current table entries.
v = getValue(critic,{5},{8})
v = 6
You can now use the critic (along with an with an actor) to create a discrete action space agent
relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).
Create Q-Value Function Critic from Custom Basis Function
doubles.
3-282
two-dimensional space, so that a single action is a column vector containing 2 doubles.
actions respectively defined by obsInfo and actInfo.
myBasisFcn = @(myobs,myact) [...

myobs(2)^2; ...
myobs(1)+exp(myact(1)); ...
abs(myact(2)); ...
myobs(3)]

@(myobs,myact)[myobs(2)^2;myobs(1)+exp(myact(1));abs(myact(2));myobs(3)]
The output of the critic is the scalar W'*myBasisFcn(myobs,myact), where W is a weight column
W0 = [1;4;4;2];
critic = rlQValueRepresentation({myBasisFcn,W0},...
obsInfo,actInfo)
critic =

Options: [1×1 rl.option.rlRepresentationOptions]
To check your critic, use the getValue function to return the value of a given observation-action pair,
using the current parameter vector.
v = getValue(critic,{[1 2 3]'},{[4 5]'})
v =
1×1 dlarray
252.3926
3-283
3 Objects
function critic (such as an rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent agent).
Create Multi-Output Q-Value Function Critic from Custom Basis Function
This example shows how to create a multi-output Q-value function critic for a discrete action space
using a custom basis function approximator.
doubles.
action space as a finite set consisting of 3 possible values (named 7, 5, and 3 in this case).
function must return a column vector. Each vector element must be a function of the observations
defined by obsInfo.
myBasisFcn = @(myobs) [myobs(2)^2; ...
myobs(1); ...
exp(myobs(2)); ...
abs(myobs(1))]

@(myobs)[myobs(2)^2;myobs(1);exp(myobs(2));abs(myobs(1))]
The output of the critic is the vector c = W'*myBasisFcn(myobs), where W is a weight matrix
which must have as many rows as the length of the basis function output, and as many columns as the
Each element of c is the expected cumulative long term reward when the agent starts from the given
observation and takes the action corresponding to the position of the considered element. The
elements of W are the learnable parameters.

W0 = rand(4,3);
3-284
critic = rlQValueRepresentation({myBasisFcn,W0},...
obsInfo,actInfo)
critic =

the current parameter matrix. Note that there is one value for each of the three possible actions.
v =
3x1 dlarray
2.1395
1.2183
2.3342
Create Q-Value Function Critic from Recurrent Neural Network
Create an environment and obtain observation and action information.
numObs = obsInfo.Dimension(1);
numDiscreteAct = numel(actInfo.Elements);
Create a recurrent deep neural network for your critic. To create a recurrent neural network, use a
Create a recurrent neural network for a multi-output Q-value function representation.
criticNetwork = [
sequenceInputLayer(numObs,...
'Normalization','none','Name','state')
fullyConnectedLayer(50, 'Name', 'CriticStateFC1')
reluLayer('Name','CriticRelu1')
lstmLayer(20,'OutputMode','sequence',...
'Name','CriticLSTM');
reluLayer('Name','CriticRelu2')
fullyConnectedLayer(numDiscreteAct,...
'Name','output')];
Create a representation for your critic using the recurrent neural network.
3-285
3 Objects
criticOptions = rlRepresentationOptions(...
critic = rlQValueRepresentation(criticNetwork,...
obsInfo,actInfo,...
'Observation','state',criticOptions);
Version History
rlQValueRepresentation is not recommended

rlQValueRepresentation is not recommended. Use either rlQValueFunction or

rlVectorQValueFunction instead.
The following table shows some typical uses of rlQValueRepresentation to create neural
network-based critics, and how to update your code with one of the new Q-value approximator
objects instead.
Network-Based Q-Value Representation: Not Network-Based Q-Value Approximators:

myCritic = myCritic =
rlQValueRepresentation(net,obsInfo,act rlQValueFunction(net,obsInfo,actInfo,'
Info,'Observation',obsNames,'Action',a ObservationInputNames',obsNames,'Actio
ctNames), with net having both observations nInputNames',actNames). Use this syntax to
and actions as inputs and a single scalar output. create a single-output state-action value function
object for a critic that takes both observation and
action as inputs.
rlQValueRepresentation(net,obsInfo,act rlVectorQValueFunction(net,obsInfo,act
Info,'Observation',obsNames) with net Info,'ObservationInputNames',obsNames)
having only the observations as inputs and a . Use this syntax to create a multiple-output state-
single output layer having as many elements as action value function object for a critic with a
the number of possible discrete actions. discrete action space. This critic takes
observations as inputs, and outputs a vector in
which each element is the value of one of the
possible actions.
The following table shows some typical uses of rlQValueRepresentation to create table-based
critics with discrete observation and action spaces, and how to update your code with one of the new
Q-value approximator objects instead.
Table-Based Q-Value Representation: Not Table-Based Q-Value Approximators:

rep = rep =
rlQValueRepresentation(tab,obsInfo,act rlQValueFunction(tab,obsInfo,actInfo).
Info), where the table tab contains a vector Use this syntax to create a single-output state-
with as many elements as the number of possible action value function object for a critic that takes
observations plus the number of possible actions. both observations and actions as input.
3-286
Table-Based Q-Value Representation: Not Table-Based Q-Value Approximators:

rep = rep =
rlQValueRepresentation(tab,obsInfo,act rlVectorQValueFunction(tab,obsInfo,act
Info), where the table tab contains a Q-value Info). Use this syntax to create a multiple-
table with as many rows as the number of output state-action value function object for a
possible observations and as many columns as critic with a discrete action space. This critic
the number of possible actions. takes observations as inputs, and outputs a vector
in which each element is the value of one of the
possible actions. It is good practice to use critics
with vector outputs when possible.
The following table shows some typical uses of rlQValueRepresentation to create critics which
use a (linear in the learnable parameters) custom basis function, and how to update your code with
one of the new Q-value approximator objects instead. In these function calls, the first input argument
is a two-element cell array containing both the handle to the custom basis function and the initial
weight vector or matrix.
Custom Basis Function-Based Q-Value Custom Basis Function-Based Q-Value

Representation: Not Recommended Approximators: Recommended
rep = rep =
rlQValueRepresentation({basisFcn,W0},o rlQValueRepresentation({basisFcn,W0},o
bsInfo,actInfo), where the basis function has bsInfo,actInfo). Use this syntax to create a
both observations and action as inputs and W0 is single-output state-action value function object
a column vector. for a critic that takes both observation and action
as inputs.
rep = rep =
rlQValueRepresentation({basisFcn,W0},o rlVectorQValueRepresentation({basisFcn
bsInfo,actInfo), where the basis function has ,W0},obsInfo,actInfo). Use this syntax to
both observations and action as inputs and W0 is create a multiple-output state-action value
a matrix with as many columns as the number of function object for a critic with a discrete action
possible actions. space. This critic takes observations as inputs,
and outputs a vector in which each element is the
value of one of the possible actions. It is good
practice to use critics with vector outputs when
possible.
See Also
Functions
rlQValueFunction | rlRepresentationOptions | getActionInfo | getObservationInfo
Topics
3-287
3 Objects
rlReplayMemory
Replay memory experience buffer
Description
An off-policy reinforcement learning agent stores experiences in a circular experience buffer. During
training, the agent samples mini-batches of experiences from the buffer and uses these mini-batches
to update its actor and critic function approximators.
By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object
as their experience buffer. Agents uniformly sample data from this buffer. To perform nonuniform
prioritized sampling, use an rlPrioritizedReplayMemory object.
When you create a custom off-policy reinforcement learning agent, you can create an experience
buffer by using an rlReplayMemory object.
Creation
Syntax
buffer = rlReplayMemory(obsInfo,actInfo)
buffer = rlReplayMemory(obsInfo,actInfo,maxLength)
Description
buffer = rlReplayMemory(obsInfo,actInfo) creates a replay memory experience buffer that

is compatible with the observation and action specifications in obsInfo and actInfo, respectively.
buffer = rlReplayMemory(obsInfo,actInfo,maxLength) sets the maximum length of the

buffer by setting the MaxLength property.
Input Arguments


or rlNumericSpec.


such as dimensions, data types, and names of the action signals.
3-288
rlReplayMemory
rlNumericSpec.
Properties
MaxLength — Maximum buffer length
Maximum buffer length, specified as a positive integer.
Length — Number of experiences in buffer

0 (default) | nonnegative integer
Number of experiences in buffer, specified as a nonnegative integer.
Object Functions
append Append experiences to replay memory buffer
sample Sample experiences from replay memory buffer
resize Resize replay memory experience buffer
allExperiences Return all experiences in replay memory buffer
Examples

LowerLimit=0,...

LowerLimit=0,...
UpperLimit=[5;10]);
3-289
3 Objects
exp.IsDone = 0;
append(buffer,exp);
terminal condition.
for i = 1:100
end
NStepHorizon=10,...
horizon.
SequenceLength=20);
3-290
rlReplayMemory
LowerLimit=0,...
UpperLimit=[1;5]);
LowerLimit=0,...
UpperLimit=[5;10]);


for i = 1:50
exp(i).IsDone = 0;
end
append(buffer,exp);
Resize Experience Buffer in Reinforcement Learning Agent

3-291
3 Objects
By default, the agent uses an experience buffer with a maximum size of 10,000.
ans =
MaxLength: 10000
Length: 0
Increase the maximum size of the experience buffer to 20,000.
resize(agent.ExperienceBuffer,20000)
View the updated experience buffer.
ans =
MaxLength: 20000
Length: 0
Version History
See Also
3-292
rlRepresentationOptions
(Not recommended) Options set for reinforcement learning agent representations (critics and actors)
Note rlRepresentationOptions is not recommended. Use an rlOptimizerOptions object

within an agent options object instead. For more information, see “rlRepresentationOptions is not
recommended”.
Description
Use an rlRepresentationOptions object to specify an options set for critics
(rlValueRepresentation, rlQValueRepresentation) and actors
(rlDeterministicActorRepresentation, rlStochasticActorRepresentation).
Creation
Syntax
repOpts = rlRepresentationOptions
repOpts = rlRepresentationOptions(Name,Value)
Description
repOpts = rlRepresentationOptions creates a default option set to use as a last argument

when creating a reinforcement learning actor or critic. You can modify the object properties using dot
notation.
repOpts = rlRepresentationOptions(Name,Value) creates an options set with the specified

“Properties” on page 3-293 using one or more name-value pair arguments.
Properties
LearnRate — Learning rate for the representation
Learning rate for the representation, specified as a positive scalar. If the learning rate is too low, then
training takes a long time. If the learning rate is too high, then training might reach a suboptimal
result or diverge.
Example: 'LearnRate',0.025
Optimizer — Optimizer for representation

Optimizer for training the network of the representation, specified as one of the following values.
3-293
3 Objects
• "adam" — Use the Adam optimizer. You can specify the decay rates of the gradient and squared
gradient moving averages using the GradientDecayFactor and
SquaredGradientDecayFactor fields of the OptimizerParameters option.
• "sgdm" — Use the stochastic gradient descent with momentum (SGDM) optimizer. You can specify
the momentum value using the Momentum field of the OptimizerParameters option.
• "rmsprop" — Use the RMSProp optimizer. You can specify the decay rate of the squared gradient
option.
For more information about these optimizers, see “Stochastic Gradient Descent” in the Algorithms
Example: 'Optimizer',"sgdm"
OptimizerParameters — Applicable parameters for optimizer

Applicable parameters for the optimizer, specified as an OptimizerParameters object with the
following parameters.
Parameter Description
Momentum Contribution of previous step, specified as a
scalar from 0 to 1. A value of 0 means no
contribution from the previous step. A value of 1
means maximal contribution.

"sgdm". In that case, the default value is 0.9.
Epsilon Denominator offset, specified as a positive scalar.
The optimizer adds this offset to the denominator
in the network parameter updates to avoid
division by zero.

value is 10–8. This default value works well for
most problems.
GradientDecayFactor Decay rate of gradient moving average, specified
as a positive scalar from 0 to 1.

"adam". In that case, the default value is 0.9.
SquaredGradientDecayFactor Decay rate of squared gradient moving average,
specified as a positive scalar from 0 to 1.

value is 0.999. This default value works well for
most problems.
3-294

specified in the Optimizer option, that property is set to "Not applicable".
To change the default values, create an rlRepresentationOptions set and use dot notation to
access and change the properties of OptimizerParameters.
repOpts.OptimizerParameters.GradientDecayFactor = 0.95;
GradientThreshold — Threshold value for gradient

Threshold value for the representation gradient, specified as Inf or a positive scalar. If the gradient
exceeds this value, the gradient is clipped as specified by the GradientThresholdMethod option.
Clipping the gradient limits how much the network parameters change in a training iteration.
Example: 'GradientThreshold',1
GradientThresholdMethod — Gradient threshold method

"l2norm" (default) | "global-l2norm" | "absolute-value"
Gradient threshold method used to clip gradient values that exceed the gradient threshold, specified
as one of the following values.
• "l2norm" — If the L2 norm of the gradient of a learnable parameter is larger than

GradientThreshold, then scale the gradient so that the L2 norm equals GradientThreshold.
• "global-l2norm" — If the global L2 norm, L, is larger than GradientThreshold, then scale all
gradients by a factor of GradientThreshold/L. The global L2 norm considers all learnable
parameters.
• "absolute-value" — If the absolute value of an individual partial derivative in the gradient of a
learnable parameter is larger than GradientThreshold, then scale the partial derivative to have
magnitude equal to GradientThreshold and retain the sign of the partial derivative.
For more information, see “Gradient Clipping” in the Algorithms section of trainingOptions in
Deep Learning Toolbox.
Example: 'GradientThresholdMethod',"absolute-value"
L2RegularizationFactor — Factor for L2 regularization

Factor for L2 regularization (weight decay), specified as a nonnegative scalar. For more information,
see “L2 Regularization” in the Algorithms section of trainingOptions in Deep Learning Toolbox.
To avoid overfitting when using a representation with many parameters, consider increasing the
L2RegularizationFactor option.
Example: 'L2RegularizationFactor',0.0005
UseDevice — Computation device for training

Computation device used to perform deep neural network operations such as gradient computation,
parameter update and prediction during training. It is specified as either "cpu" or "gpu".
3-295
3 Objects
Computing Toolbox).
used with MATLAB.
Note Training or simulating an agent on a GPU involves device-specific numerical round off errors.
Note that if you want to use parallel processing to speed up training, you do not need to set
UseDevice. Instead, when training your agent, use an rlTrainingOptions object in which the
Example: 'UseDevice',"gpu"
Object Functions
rlValueRepresentation (Not recommended) Value function critic representation for
rlQValueRepresentation (Not recommended) Q-Value function critic representation for
rlDeterministicActorRepresentation (Not recommended) Deterministic actor representation for
rlStochasticActorRepresentation (Not recommended) Stochastic actor representation for
Examples
Configure Options for Creating Representation
Create an options set for creating a critic or actor representation for a reinforcement learning agent.
Set the learning rate for the representation to 0.05, and set the gradient threshold to 1. You can set
the options using Name,Value pairs when you create the options set. Any options that you do not
repOpts = rlRepresentationOptions('LearnRate',5e-2,...
'GradientThreshold',1)
repOpts =
rlRepresentationOptions with properties:
LearnRate: 0.0500
GradientThreshold: 1
UseDevice: "cpu"
Optimizer: "adam"
3-296
repOpts.LearnRate = 5e-2;
repOpts.GradientThreshold = 1
repOpts =
rlRepresentationOptions with properties:
LearnRate: 0.0500
GradientThreshold: 1
UseDevice: "cpu"
Optimizer: "adam"
If you want to change the properties of the OptimizerParameters option, use dot notation to
access them.
repOpts.OptimizerParameters.Epsilon = 1e-7;
repOpts.OptimizerParameters
ans =
OptimizerParameters with properties:
Momentum: "Not applicable"

Epsilon: 1.0000e-07
GradientDecayFactor: 0.9000
Version History
rlRepresentationOptions is not recommended

rlRepresentationOptions objects are no longer recommended. To specify optimization options

for actors and critics, use rlOptimizerOptions objects instead.
Specifically, you can create an agent options object and set its CriticOptimizerOptions and
ActorOptimizerOptions properties to suitable rlOptimizerOptions objects. Then you pass the
agent options object to the function that creates the agent. This workflow is shown in the following
table.
3-297
3 Objects
rlRepresentationOptions: Not rlOptimizerOptions: Recommended

Recommended
crtOpts = rlRepresentationOptions(...criticOpts = rlOptimizerOptions(...
'GradientThreshold',1); 'GradientThreshold',1);
critic = rlValueRepresentation(... agentOpts = rlACAgentOptions(...

net,obsInfo,'Observation',{'obs'},ctrOpts)
'CriticOptimizerOptions',crtOpts);
Alternatively, you can create the agent and then use dot notation to access the optimization options
for the agent actor and critic, for example:
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;.
See Also
Topics
3-298
rlSACAgent
rlSACAgent
Soft actor-critic reinforcement learning agent
Description
The soft actor-critic (SAC) algorithm is a model-free, online, off-policy, actor-critic reinforcement
learning method. The SAC algorithm computes an optimal policy that maximizes both the long-term
expected reward and the entropy of the policy. The policy entropy is a measure of policy uncertainty
given the state. A higher entropy value promotes more exploration. Maximizing both the reward and
the entropy balances exploration and exploitation of the environment. The action space can only be
continuous.
For more information, see “Soft Actor-Critic Agents”.
Learning Agents”.
Creation
Syntax
agent = rlSACAgent(observationInfo,actionInfo)
agent = rlSACAgent(observationInfo,actionInfo,initOptions)
agent = rlSACAgent(actor,critics)
agent = rlSACAgent( ___ ,agentOptions)
Description
agent = rlSACAgent(observationInfo,actionInfo) creates a SAC agent for an environment

with the given observation and action specifications, using default initialization options. The actor
and critics in the agent use default deep neural networks built using the observation specification
observationInfo and action specification actionInfo. The ObservationInfo and ActionInfo
properties of agent are set to the observationInfo and actionInfo input arguments,
respectively.
agent = rlSACAgent(observationInfo,actionInfo,initOptions) creates a SAC agent with

deep neural networks configured using the specified initialization options (initOptions).
agent = rlSACAgent(actor,critics) creates a SAC agent with the specified actor and critic
networks and default agent options.
3-299
3 Objects
agent = rlSACAgent( ___ ,agentOptions) sets the AgentOptions property for any of the
previous syntaxes.
Input Arguments
initOptions — Agent initialization options

actor — Actor
Actor that implements the policy, specified as an rlContinuousGaussianActor function

approximator object. For more information on creating actor approximators, see “Create Policies and
Value Functions”.
critics — Critic
rlQValueFunction object | two-element row vector of rlQValueFunction objects
Critic, specified as one of the following:
• rlQValueFunction object — Create a SAC agent with a single Q-value function.

• Two-element row vector of rlQValueFunction objects — Create a SAC agent with two critic
value functions. The two critic must be unique rlQValueFunction objects with the same
observation and action specifications. The critics can either have different structures or the same
structure but with different initial parameters.
For a SAC agent, each critic must be a single-output rlQValueFunction object that takes both the
action and observations as inputs.
For more information on creating critics, see “Create Policies and Value Functions”.
Properties

signals.

or rlNumericSpec.

3-300
rlSACAgent
Action specification for a continuous action space, specified as an rlNumericSpec object defining
properties such as dimensions, data type and name of the action signals.

rlSACAgentOptions object
Agent options, specified as an rlSACAgentOptions object.
If you create a SAC agent with default actor and critic that use recurrent neural networks, the default
value of AgentOptions.SequenceLength is 32.



values.

3-301
3 Objects
Object Functions
environment
observations
Examples
Create SAC Agent from Observation and Action Specifications
Create environment and obtain observation and action specifications. For this example, load the
environment used in the example “Train DDPG Agent to Control Double Integrator System”. The
observation from the environment is a vector containing the position and velocity of a mass. The
action is a scalar representing a force, applied to the mass, ranging continuously from -2 to 2
Newton.
rng(0)
Create a SAC agent from the environment observation and action specifications.
agent = rlSACAgent(obsInfo,actInfo);

{[0.0546]}
3-302
rlSACAgent
Create SAC Agent Using Initialization Options
network must have 128 neurons.
rng(0)
Create a SAC agent from the environment observation and action specifications using the
initialization options.
agent = rlSACAgent(obsInfo,actInfo,initOpts);
Extract the deep neural network from the actor.
Extract the deep neural networks from the two critics. Note that getModel(critics) only returns
the first critic network.
critics = getCritic(agent);
Display the layers of the first critic network, and verify that each hidden fully connected layer has 128
neurons.
criticNet1.Layers
ans =

3-303
3 Objects

Plot the networks of the actor and of the second critic, and display the number of weights.
summary(actorNet)
Initialized: true
Inputs:
plot(layerGraph(criticNet2))
3-304
rlSACAgent
summary(criticNet2)
Initialized: true
Inputs:

{[-0.9867]}
Create SAC Agent from Actor and Critics
Create an environment and obtain observation and action specifications. For this example, load the
environment used in the example “Train DDPG Agent to Control Double Integrator System”. The
observations from the environment is a vector containing the position and velocity of a mass. The
3-305
3 Objects
action is a scalar representing a force, applied to the mass, ranging continuously from -2 to 2
Newton.
A SAC agent uses two Q-value function critics. To approximate each Q-value function, use a neural
network. The network for a single-output Q-value function critic must have two input layers, one for
the observation and the other for the action, and return a scalar value representing the expected
cumulative long-term reward following from the given observation and action.
Define each network path as an array of layer objects, and the dimensions of the observation and
action spaces from the environment specification objects.
% Observation path
obsPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="obsPathIn")
reluLayer
fullyConnectedLayer(16,Name="obsPathOut")
];
% Action path
actPath = [
featureInputLayer(prod(actInfo.Dimension),Name="actPathIn")
reluLayer
fullyConnectedLayer(16,Name="actPathOut")
];
% Common path
commonPath = [
concatenationLayer(1,2,Name="concat")
reluLayer
];
% Add layers to layergraph object

criticNet = layerGraph;
criticNet = addLayers(criticNet,obsPath);
criticNet = addLayers(criticNet,actPath);
criticNet = addLayers(criticNet,commonPath);
% Connect layers
criticNet = connectLayers(criticNet,"obsPathOut","concat/in1");
criticNet = connectLayers(criticNet,"actPathOut","concat/in2");
To initialize the network weights differently for the two critics, create two different dlnetwork
objects. You must do this because if the agent constructor function does not accept two identical
critics.
criticNet1 = dlnetwork(criticNet);
Display the number of weights.

summary(criticNet1)
3-306
rlSACAgent
Initialized: true
Inputs:
1 'obsPathIn' 2 features
2 'actPathIn' 1 features
Create the two critics using rlQValueFunction, using the two networks with different weights.
Alternatively, if you use exactly the same network with the same weights, you must explicitly initialize
the network each time (to make sure weights are initialized differently) before passing it to
rlQValueFunction. To do so, use initialize.
critic1 = rlQValueFunction(criticNet1,obsInfo,actInfo, ...
ActionInputNames="actPathIn",ObservationInputNames="obsPathIn");
critic2 = rlQValueFunction(criticNet2,obsInfo,actInfo, ...

ActionInputNames="actPathIn",ObservationInputNames="obsPathIn");
Check the critics with a random observation and action input.

getValue(critic1,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})
ans = single
-0.1330
ans = single
-0.1526
To approximate the policy within the actor, use a deep neural network. Since SAC agents use a
continuous Gaussian actor, the network must take the observation signal as input and return both a
Do not add a tanhLayer or scalingLayer in the mean output path. The SAC agent internally
transforms the unbounded Gaussian distribution to the bounded distribution to compute the
probability density function and entropy properly.
action spaces from the environment specification objects, and specify a name for the input and output
% Define common input path
commonPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="netObsIn")
reluLayer(Name="CommonRelu")];
% Define path for mean value

meanPath = [
fullyConnectedLayer(300,Name="meanIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension),Name="MeanOut")
];
3-307
3 Objects
% Define path for standard deviation

stdPath = [
fullyConnectedLayer(300,Name="stdIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension))
softplusLayer(Name="StandardDeviationOut")];

actorNet = addLayers(actorNet,stdPath);
% Connect layers
actorNet = connectLayers(actorNet,"CommonRelu","meanIn/in");
actorNet = connectLayers(actorNet,"CommonRelu","stdIn/in");
% Convert to dlnetwork and display the number of weights.

summary(actorNet)
Initialized: true
Inputs:
Create the actor using actorNet, the observation and action specification objects, and the names of
the input and output layers.

ActionMeanOutputNames="MeanOut",...
ActionStandardDeviationOutputNames="StandardDeviationOut",...
ObservationInputNames="netObsIn");

{[-0.2589]}
Specify training options for the critics.

Optimizer="adam", ...
LearnRate=1e-3,...
GradientThreshold=1, ...
L2RegularizationFactor=2e-4);
Specify training options for the actor.

LearnRate=1e-3,...
3-308
rlSACAgent
Specify agent options, including training options for actor and critics.
agentOptions = rlSACAgentOptions;
agentOptions.SampleTime = env.Ts;
agentOptions.DiscountFactor = 0.99;
agentOptions.TargetSmoothFactor = 1e-3;
agentOptions.ExperienceBufferLength = 1e6;
agentOptions.MiniBatchSize = 32;
agentOptions.CriticOptimizerOptions = criticOptions;
agentOptions.ActorOptimizerOptions = actorOptions;
Create SAC agent using actor, critics, and options.
agent = rlSACAgent(actor,[critic1 critic2],agentOptions);

{[0.2504]}
Create SAC Agent using Recurrent Neural Networks
Integrator System”. The observations from the environment is a vector containing the position and
velocity of a mass. The action is a scalar representing a force, applied to the mass, ranging
continuously from -2 to 2 Newton.
action spaces from the environment specification objects. To create a recurrent neural network, use
layers.
% Define observation path

obsPath = [
sequenceInputLayer(prod(obsInfo.Dimension),Name="obsIn")
reluLayer
fullyConnectedLayer(300,Name = "obsOut")
];
3-309
3 Objects
% Define action path

actPath = [
sequenceInputLayer(prod(actInfo.Dimension),Name="actIn")
fullyConnectedLayer(300,Name="actOut")
];
% Define common path

commonPath = [
lstmLayer(16)
reluLayer
];

criticNet = layerGraph(obsPath);
% Connect paths
criticNet = connectLayers(criticNet,"obsOut","cat/in1");
criticNet = connectLayers(criticNet,"actOut","cat/in2");
critics.
summary(criticNet1)
Initialized: true
Inputs:
1 'obsIn' Sequence input with 2 dimensions
2 'actIn' Sequence input with 1 dimensions
Create the critic using rlQValueFunction. Use the same network structure for both critics. The
SAC agent initializes the two networks using different default parameters.
critic1 = rlQValueFunction(criticNet1,obsInfo,actInfo);
ans = single
-0.0020
ans = single
0.0510
3-310
rlSACAgent
To approximate the policy within the actor, use a deep neural network. Since the critic has a
recurrent network, the actor must have a recurrent network too. The network must have two output
layers (one for the mean values the other for the standard deviation values), each having as many
elements as the dimension of the action space.
Do not add a tanhLayer or scalingLayer in the mean output path. The SAC agent internally
transforms the unbounded Gaussian distribution to the bounded distribution to compute the
probability density function and entropy properly.
Define each network path as an array of layer objects and specify a name for the input and output

commonPath = [
lstmLayer(8)
reluLayer(Name="CommonOut")];
meanPath = [
fullyConnectedLayer(300,Name="MeanIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension),Name="Mean")
];
stdPath = [
fullyConnectedLayer(300,Name="StdIn")
reluLayer
softplusLayer(Name="StandardDeviation")];
actorNet = addLayers(actorNet,stdPath);
actorNet = connectLayers(actorNet,"CommonOut","MeanIn/in");
actorNet = connectLayers(actorNet,"CommonOut","StdIn/in");
% Convert to dlnetwork and display the number of weights.

summary(actorNet)
Initialized: true
Inputs:
1 'obsIn' Sequence input with 2 dimensions
Create the actor using actorNetwork.

ActionMeanOutputNames="Mean",...
ActionStandardDeviationOutputNames="StandardDeviation",...
ObservationInputNames="obsIn");
3-311
3 Objects

{[-0.3476]}

Optimizer = "adam", LearnRate = 1e-3,...
GradientThreshold = 1, L2RegularizationFactor = 2e-4);

Optimizer = "adam", LearnRate = 1e-3,...
GradientThreshold = 1, L2RegularizationFactor = 1e-5);
Specify agent options. To use a recurrent neural network, you must specify a SequenceLength
greater than 1.
agentOptions = rlSACAgentOptions;
agentOptions.SampleTime = env.Ts;
agentOptions.ExperienceBufferLength = 1e6;
agentOptions.SequenceLength = 32;
agentOptions.MiniBatchSize = 32;
Create SAC agent using actor, critics, and options.
agent = rlSACAgent(actor,[critic1 critic2],agentOptions);

{[0.7990]}
Version History
See Also
rlAgentInitializationOptions | rlSACAgentOptions | rlQValueFunction |
rlContinuousGaussianActor | initialize | Deep Network Designer
Topics
“Soft Actor-Critic Agents”
3-312
rlSACAgentOptions
rlSACAgentOptions
Options for SAC agent
Description
Use an rlSACAgentOptions object to specify options for soft actor-critic (SAC) agents. To create a
SAC agent, use rlSACAgent.
For more information, see “Soft Actor-Critic Agents”.
Learning Agents”.
Creation
Syntax
opt = rlSACAgentOptions
opt = rlSACAgentOptions(Name,Value)
Description
opt = rlSACAgentOptions creates an options object for use as an argument when creating a SAC
agent using all default options. You can modify the object properties using dot notation.
opt = rlSACAgentOptions(Name,Value) sets option properties on page 3-98 using name-value

pairs. For example, rlSACAgentOptions('DiscountFactor',0.95) creates an option set with a
quotes.
Properties
EntropyWeightOptions — Entropy tuning options
EntropyWeightOptions object
Entropy tuning options, specified as an EntropyWeightOptions object with the following

properties.
EntropyWeight — Initial entropy component weight

1 (default) | positive scalar
Initial entropy component weight, specified as a positive scalar.
LearnRate — Optimizer learning rate

3e-4 (default) | nonnegative scalar
Optimizer learning rate, specified as a nonnegative scalar. If LearnRate is zero, the EntropyWeight
value is fixed during training and the TargetEntropy value is ignored.
3-313
3 Objects
TargetEntropy — Target entropy value

[] (default) | scalar
Target entropy value for tuning entropy weight, specified as a scalar. A higher target entropy value
encourages more exploration.
If you do not specify TargetEntropy, the agent uses –A as the target value, where A is the number
of actions.
Algorithm — Algorithm to tune entropy

Algorithm to tune entropy, specified as one of the following strings.
• "adam" — Use the Adam optimizer. You can specify the decay rates of the gradient and squared
gradient moving averages using the GradientDecayFactor and
SquaredGradientDecayFactor fields of the OptimizerParameters option.
• "sgdm" — Use the stochastic gradient descent with momentum (SGDM) optimizer. You can specify
the momentum value using the Momentum field of the OptimizerParameters option.
• "rmsprop" — Use the RMSProp optimizer. You can specify the decay rate of the squared gradient
option.
For more information about these optimizers, see “Stochastic Gradient Descent” in Deep Learning
Toolbox.
GradientThreshold — Threshold value for gradient

Threshold value for the entropy gradient, specified as Inf or a positive scalar. If the gradient exceeds
this value, the gradient is clipped.
OptimizerParameters — Applicable parameters for optimizer

Applicable parameters for the optimizer, specified as an OptimizerParameters object with the
following parameters. The default parameter values work well for most problems.
Parameter Description Default

Momentum Contribution of previous step, 0.9
specified as a scalar from 0 to 1.
A value of 0 means no
contribution from the previous
step. A value of 1 means
maximal contribution.
This parameter applies only

when Optimizer is "sgdm".
3-314
rlSACAgentOptions
Parameter Description Default

Epsilon Denominator offset, specified as 1e-8
a positive scalar. The optimizer
adds this offset to the
denominator in the network
parameter updates to avoid
division by zero.

when Optimizer is "adam" or
"rmsprop".
GradientDecayFactor Decay rate of gradient moving 0.9
average, specified as a positive
scalar from 0 to 1.

when Optimizer is "adam".
SquaredGradientDecayFact Decay rate of squared gradient 0.999
or moving average, specified as a
positive scalar from 0 to 1.

when Optimizer is "adam" or
"rmsprop".

specified in the Algorithm option, that property is set to "Not applicable".
To change the default values, access the properties of OptimizerParameters using dot notation.
opt = rlSACAgentOptions;
opt.EntropyWeightOptions.OptimizerParameters.GradientDecayFactor = 0.95;
PolicyUpdateFrequency — Number of steps between actor policy updates

Number of steps between actor policy updates, specified as a positive integer. For more information,
see “Training Algorithm”.
CriticUpdateFrequency — Number of steps between critic updates

Number of steps between critic updates, specified as a positive integer. For more information, see
“Training Algorithm”.
NumWarmStartSteps — Number of actions to take before updating actor and critic

positive integer
Number of actions to take before updating actor and critics, specified as a positive integer. By
default, the NumWarmStartSteps value is equal to the MiniBatchSize value.
NumGradientStepsPerUpdate — Number of gradient steps when updating actor and critics

3-315
3 Objects
Number of gradient steps to take when updating actor and critics, specified as a positive integer.


rlOptimizer.


rlOptimizer.
TargetSmoothFactor — Smoothing factor for target critic updates

Smoothing factor for target critic updates, specified as a positive scalar less than or equal to 1. For
more information, see “Target Update Methods”.
TargetUpdateFrequency — Number of steps between target critic updates

Number of steps between target critic updates, specified as a positive integer. For more information,
see “Target Update Methods”.


otherwise.

updating the actor and critics. Large mini-batches reduce the variance when computing gradients but
3-316
rlSACAgentOptions



equal to 1.
Object Functions
Examples
Create SAC Agent Options Object
Create a SAC agent options object, specifying the discount factor.
opt = rlSACAgentOptions('DiscountFactor',0.95)
opt =
rlSACAgentOptions with properties:
EntropyWeightOptions: [1x1 rl.option.EntropyWeightOptions]

PolicyUpdateFrequency: 1
CriticUpdateFrequency: 1
NumWarmStartSteps: 64
NumGradientStepsPerUpdate: 1
3-317
3 Objects

SequenceLength: 1
MiniBatchSize: 64
SampleTime: 1
For SAC agents, configure the entropy weight optimizer using the options in
EntropyWeightOptions. For example, set the target entropy value to –5.
opt.EntropyWeightOptions.TargetEntropy = -5;
Version History

The property UseDeterministicExploitation of the rlSACAgentOptions object will be

removed in a future release. Use the UseExplorationPolicy property of rlSACAgent instead.

3-318
rlSACAgentOptions
References
2018.
See Also
rlSACAgent
Topics
“Soft Actor-Critic Agents”
3-319
3 Objects
rlSARSAAgent
SARSA reinforcement learning agent
Description
The SARSA algorithm is a model-free, online, on-policy reinforcement learning method. A SARSA
agent is a value-based reinforcement learning agent which trains a critic to estimate the return or
future rewards.
For more information on SARSA agents, see “SARSA Agents”.
Learning Agents”.
Creation
Syntax
agent = rlSARSAAgent(critic,agentOptions)
Description
agent = rlSARSAAgent(critic,agentOptions) creates a SARSA agent with the specified critic

network and sets the AgentOptions property.
Input Arguments
critic — Critic
Properties
rlSARSAAgentOptions object
Agent options, specified as an rlSARSAAgentOptions object.

values.
• false — Use the agent greedy policy when selecting actions.
3-320
rlSARSAAgent
• true — Use the agent exploration policy when selecting actions.

Observation specifications, specified as a reinforcement learning specification object defining

properties such as dimensions, data type, and name of the observation signal.
The value of ObservationInfo matches the corresponding value specified in critic.

The value of ActionInfo matches the corresponding value specified in critic.

event-based simulations. The initial value of SampleTime matches the value specified in
AgentOptions.
Object Functions
environment
observations
Examples
3-321
3 Objects
Create a SARSA Agent
Create or load an environment interface. For this example load the Basic Grid World environment
interface also used in the example “Train Reinforcement Learning Agent in Basic Grid World”.
Get observation and action specifications.
Create a table approximation model derived from the environment observation and action
specifications.
Create the critic using qTable. SARSA agents use an rlValueFunction object to implement the
critic.
critic = rlQValueFunction(qTable,obsInfo,actInfo);
Create a SARSA agent using the specified critic and an epsilon value of 0.05.
opt = rlSARSAAgentOptions;
agent = rlSARSAAgent(critic,opt)
agent =
rlSARSAAgent with properties:
AgentOptions: [1x1 rl.option.rlSARSAAgentOptions]

SampleTime: 1
act = getAction(agent,{randi(numel(obsInfo.Elements))});
act{1}
ans = 1
Version History
See Also
rlSARSAAgentOptions
3-322
rlSARSAAgent
Topics
“SARSA Agents”
3-323
3 Objects
rlSARSAAgentOptions
Options for SARSA agent
Description
Use an rlSARSAAgentOptions object to specify options for creating SARSA agents. To create a
SARSA agent, use rlSARSAAgent
For more information on SARSA agents, see “SARSA Agents”.
Learning Agents”.
Creation
Syntax
opt = rlSARSAAgentOptions
opt = rlSARSAAgentOptions(Name,Value)
Description
opt = rlSARSAAgentOptions creates an rlSARSAAgentOptions object for use as an argument

when creating a SARSA agent using all default settings. You can modify the object properties using
dot notation.
opt = rlSARSAAgentOptions(Name,Value) sets option properties on page 3-324 using name-

value pairs. For example, rlSARSAAgentOptions('DiscountFactor',0.95) creates an option
set with a discount factor of 0.95. You can specify multiple name-value pairs. Enclose each property
name in quotes.
Properties

3-324
rlSARSAAgentOptions

higher rate.
Epsilon.
To specify exploration options, use dot notation after creating the rlSARSAAgentOptions object
opt. For example, set the epsilon value to 0.9.


rlOptimizer.


equal to 1.
3-325
3 Objects
Object Functions
Examples
Create a SARSA Agent Options Object
This example shows how to create a SARSA agent option object.
Create an rlSARSAAgentOptions object that specifies the agent sample time.
opt = rlSARSAAgentOptions('SampleTime',0.5)
opt =
rlSARSAAgentOptions with properties:

SampleTime: 0.5000
You can modify options using dot notation. For example, set the agent discount factor to 0.95.
opt.DiscountFactor = 0.95;
Version History
See Also
Topics
“SARSA Agents”
3-326
rlSimulationOptions
rlSimulationOptions
Options for simulating a reinforcement learning agent within an environment
Description
Use an rlSimulationOptions object to specify simulation options for simulating a reinforcement
learning agent within an environment. To perform the simulation, use sim.
For more information on agents training and simulation, see “Train Reinforcement Learning Agents”.
Creation
Syntax
simOpts = rlSimulationOptions
opt = rlSimulationOptions(Name,Value)
Description
simOpts = rlSimulationOptions returns the default options for simulating a reinforcement

learning environment against an agent. Use simulation options to specify parameters about the
simulation such as the maximum number of steps to run per simulation and the number of
simulations to run. After configuring the options, use simOpts as an input argument for sim.
opt = rlSimulationOptions(Name,Value) creates a simulation options set with the specified

Properties
MaxSteps — Number of steps to run the simulation
Number of steps to run the simulation, specified as the comma-separated pair consisting of
'MaxSteps' and a positive integer. In general, you define episode termination conditions in the
environment. This value is the maximum number of steps to run in the simulation if those termination
conditions are not met.
Example: 'MaxSteps',1000
NumSimulations — Number of simulations

Number of simulations to run, specified as the comma-separated pair consisting of

'NumSimulations' and a positive integer. At the start of each simulation, sim resets the
environment. You specify what happens on environment reset when you create the environment. For
instance, resetting the environment at the start of each episode can include randomizing initial state
values, if you configure your environment to do so. In that case, running multiple simulations allows
you to validate performance of a trained agent over a range of initial conditions.
3-327
3 Objects
Example: 'NumSimulations',10
StopOnError — Stop simulation when error occurs

Stop simulation when an error occurs, specified as "off" or "on". When this option is "off", errors
are captured and returned in the SimulationInfo output of sim, and simulation continues.
UseParallel — Flag for using parallel simulation

Flag for using parallel simulation, specified as a logical. Setting this option to true configures the
simulation to use parallel processing to simulate the environment, thereby enabling usage of multiple
cores, processors, computer clusters or cloud resources to speed up simulation. To specify options for
parallel simulation, use the ParallelizationOptions property.
Note that if you want to speed up deep neural network calculations (such as gradient computation,
parameter update and prediction) using a local GPU you do not need to set UseParallel to true.
Instead, when creating your actor or critic representation, use an rlRepresentationOptions
object in which the UseDevice option is set to "gpu".
Using parallel computing or the GPU requires Parallel Computing Toolbox software. Using computer
clusters or cloud resources additionally requires MATLAB Parallel Server™.
For more information about training using multicore processors and GPUs, see “Train Agents Using
Parallel Computing and GPUs”.
Example: 'UseParallel',true
ParallelizationOptions — Options to control parallel simulation

ParallelTraining object
Parallelization options to control parallel simulation, specified as a ParallelTraining object. For

more information about training using parallel computing, see “Train Reinforcement Learning
Agents”.
The ParallelTraining object has the following properties, which you can modify using dot
notation after creating the rlTrainingOptions object.
WorkerRandomSeeds — Randomizer initialization for workers

–1 (default) | –2 | vector
Randomizer initialization for workers, specified as one the following:
• –1 — Assign a unique random seed to each worker. The value of the seed is the worker ID.
• –2 — Do not assign a random seed to the workers.
• Vector — Manually specify the random seed for each work. The number of elements in the vector
must match the number of workers.
TransferBaseWorkspaceVariables — Send model and workspace variables to parallel

workers
3-328
rlSimulationOptions
Send model and workspace variables to parallel workers, specified as "on" or "off". When the
option is "on", the host sends variables used in models and defined in the base MATLAB workspace
to the workers.
AttachedFiles — Additional files to attach to the parallel pool

[] (default) | string | string array
Additional files to attach to the parallel pool, specified as a string or string array.
SetupFcn — Function to run before simulation starts

Function to run before simulation starts, specified as a handle to a function having no input
arguments. This function is run once per worker before simulation begins. Write this function to
perform any processing that you need prior to simulation.
CleanupFcn — Function to run after simulation ends

Function to run after simulation ends, specified as a handle to a function having no input arguments.
You can write this function to clean up the workspace or perform other processing after simulation
terminates.
Object Functions
Examples
Configure Options for Simulation
Create an options set for simulating a reinforcement learning environment. Set the number of steps
to simulate to 1000, and configure the options to run three simulations.
You can set the options using Name,Value pairs when you create the options set. Any options that you
do not explicitly set have their default values.
simOpts = rlSimulationOptions(...
'MaxSteps',1000,...
'NumSimulations',3)
simOpts =
rlSimulationOptions with properties:
MaxSteps: 1000
NumSimulations: 3
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelSimulation]
simOpts = rlSimulationOptions;
simOpts.MaxSteps = 1000;
3-329
3 Objects
simOpts.NumSimulations = 3;
simOpts
simOpts =
rlSimulationOptions with properties:
MaxSteps: 1000
NumSimulations: 3
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelSimulation]
Version History
See Also
Topics
3-330
rlStochasticActorPolicy
Policy object to generate stochastic actions for custom training loops and application deployment
Description
This object implements a stochastic policy, which returns stochastic actions given an input
observation, according to a probability distribution. You can create an rlStochasticActorPolicy
object from an rlDiscreteCategoricalActor or rlContinuousGaussianActor, or extract it
from an rlPGAgent, rlACAgent, rlPPOAgent, rlTRPOAgent, or rlSACAgent. You can then train
the policy object using a custom training loop or deploy it for your application using
generatePolicyBlock or generatePolicyFunction. If UseMaxLikelihoodAction is set to 1
the policy is deterministic, therefore in this case it does not explore. For more information on policies
and value functions, see “Create Policies and Value Functions”.
Creation
Syntax
policy = rlStochasticActorPolicy(actor)
Description
policy = rlStochasticActorPolicy(actor) creates the stochastic policy object policy from

the continuous Gaussian or discrete categorical actor actor. It also sets the Actor property of
policy to the input argument actor.
Properties
Actor — Actor
Actor, specified as an rlContinuousGaussianActor or rlDiscreteCategoricalActor object.
UseMaxLikelihoodAction — Option to enable maximum likelihood action

Option to enable maximum likelihood action, specified as a logical value: either false (default, the
action is sampled from the probability distribution, this helps exploration) or true (always using
maximum likelihood action). When the option to always use the maximum likelihood action enabled
the policy is deterministic and therefore it does not explore.
Example: false

3-331
3 Objects


Action specifications, specified as an rlFiniteSetSpec or rlNumericSpec object. This object

defines the properties of the environment action channel, such as its dimensions, data type, and
name. Note that the name of the action channel specified in actionInfo (if any) is not used.

Example: 0.2
Object Functions
observations
policy object
object
Examples
Create Stochastic Actor Policy from Discrete Categorical Actor
Create observation and action specification objects. For this example, define a continuous four-
dimensional observation space and a discrete action space having two possible actions.

actInfo = rlFiniteSetSpec([-1 1]);
3-332

from an environment
Create a discrete categorical actor. This actor must accept an observation as input and return an
output vector in which each element represents the probability of taking the corresponding action.
network as an array of layer objects, and get the dimension of the observation space and the number
of possible actions from the environment specification objects.
layers = [
reluLayer
];
summary(model)
Initialized: true
Inputs:
actor = rlDiscreteCategoricalActor(model,obsInfo,actInfo)
actor =
rlDiscreteCategoricalActor with properties:
Distribution: [1x1 rl.distribution.rlDiscreteGaussianDistribution]

UseDevice: "cpu"
To return the probability distribution of the possible actions as a function of a random observation,
and given the current network weights, use evaluate.
prb = evaluate(actor,{rand(obsInfo.Dimension)});
prb{1}
0.5850
0.4150

policy = rlStochasticActorPolicy(actor)
policy =
3-333
3 Objects

SampleTime: -1
You can access the policy options using dot notation. For example, set the option to always use the
maximum likelihood action, thereby making the policy deterministic.
policy.UseMaxLikelihoodAction = true
policy =

SampleTime: -1
act{1}
ans = -1
Version History
See Also
Functions
rlAdditiveNoisePolicy | rlStochasticActorPolicy | rlDiscreteCategoricalActor |
rlContinuousGaussianActor | rlPGAgent | rlACAgent | rlSACAgent | rlPPOAgent |
rlTRPOAgent | generatePolicyBlock | generatePolicyFunction
Blocks
RL Agent
Topics
3-334
rlStochasticActorRepresentation
(Not recommended) Stochastic actor representation for reinforcement learning agents
Note rlStochasticActorRepresentation is not recommended. Use either

rlDiscreteCategoricalActor or rlContinuousGaussianActor instead. For more information,
see “rlStochasticActorRepresentation is not recommended”.
Description
reinforcement learning agent. A stochastic actor takes the observations as inputs and returns a
random action, thereby implementing a stochastic policy with a specific probability distribution. After
you create an rlStochasticActorRepresentation object, use it to create a suitable agent, such
as an rlACAgent or rlPGAgent agent. For more information on creating representations, see
Creation
Syntax
discActor = rlStochasticActorRepresentation(net,observationInfo,
discActionInfo,'Observation',obsName)
discActor = rlStochasticActorRepresentation({basisFcn,W0},observationInfo,
actionInfo)
discActor = rlStochasticActorRepresentation( ___ ,options)
contActor = rlStochasticActorRepresentation(net,observationInfo,
contActionInfo,'Observation',obsName)
contActor = rlStochasticActorRepresentation( ___ ,options)
Description
Discrete Action Space Stochastic Actor
discActor = rlStochasticActorRepresentation(net,observationInfo,
discActionInfo,'Observation',obsName) creates a stochastic actor with a discrete action
space, using the deep neural network net as function approximator. Here, the output layer of net
must have as many elements as the number of possible discrete actions. This syntax sets the
ObservationInfo and ActionInfo properties of discActor to the inputs observationInfo and
discActionInfo, respectively. obsName must contain the names of the input layers of net.
discActor = rlStochasticActorRepresentation({basisFcn,W0},observationInfo,
actionInfo) creates a discrete space stochastic actor using a custom basis function as underlying
approximator. The first input argument is a two-elements cell in which the first element contains the
handle basisFcn to a custom basis function, and the second element contains the initial weight
matrix W0. This syntax sets the ObservationInfo and ActionInfo properties of discActor to the inputs
observationInfo and actionInfo, respectively.
3-335
3 Objects
discActor = rlStochasticActorRepresentation( ___ ,options) creates the discrete action

space, stochastic actor discActor using the additional options set options, which is an
rlRepresentationOptions object. This syntax sets the Options property of discActor to the
options input argument. You can use this syntax with any of the previous input-argument
combinations.
Continuous Action Space Gaussian Actor
contActor = rlStochasticActorRepresentation(net,observationInfo,
contActionInfo,'Observation',obsName) creates a Gaussian stochastic actor with a
continuous action space using the deep neural network net as function approximator. Here, the
output layer of net must have twice as many elements as the number of dimensions of the continuous
action space. This syntax sets the ObservationInfo and ActionInfo properties of contActor to the
inputs observationInfo and contActionInfo respectively. obsName must contain the names of
the input layers of net.
Note contActor does not enforce constraints set by the action specification, therefore, when using
this actor, you must enforce action space constraints within the environment.
contActor = rlStochasticActorRepresentation( ___ ,options) creates the continuous

action space, Gaussian actor contActor using the additional options option set, which is an
rlRepresentationOptions object. This syntax sets the Options property of contActor to the
options input argument. You can use this syntax with any of the previous input-argument
combinations.
Input Arguments

dlNetwork object
following:

For a discrete action space stochastic actor, net must have the observations as input and a single
output layer having as many elements as the number of possible discrete actions. Each element
represents the probability (which must be nonnegative) of executing the corresponding action.
For a continuous action space stochastic actor, net must have the observations as input and a single
output layer having twice as many elements as the number of dimensions of the continuous action
space. The elements of the output vector represent all the mean values followed by all the standard
deviations (which must be nonnegative) of the Gaussian distributions for the dimensions of the action
space.
Note The fact that standard deviations must be nonnegative while mean values must fall within the
output range means that the network must have two separate paths. The first path must produce an
3-336
estimation for the mean values, so any output nonlinearity must be scaled so that its output falls in
the desired range. The second path must produce an estimation for the standard deviations, so you
must use a softplus or ReLU layer to enforce nonnegativity.
observation names specified in obsName. The network output layer must have the same data type and
dimension as the signal defined in ActionInfo.
rlStochasticActorRepresentation objects support recurrent deep neural networks.

must be the names of the input layers in net.
Example: {'my_obs'}

function handle
of the actor is the vector a = softmax(W'*B), where W is a weight matrix and B is the column
vector returned by the custom basis function. Each element of a represents the probability of taking
the corresponding action. The learnable parameters of the actor are the elements of W.
When creating a stochastic actor representation, your basis function must have the following
signature.
as the signals defined in observationInfo

column vector
Initial value of the basis function weights, W, specified as a matrix. It must have as many rows as the
length of the basis function output, and as many columns as the number of possible actions.
Properties

3-337
3 Objects
Note TRPO agents use only the Options.UseDevice representation options and ignore the other
training and learning rate options.

Observation specifications, specified as specified as an rlFiniteSetSpec or rlNumericSpec object

or an array containing a mix of such objects. These objects define properties such as dimensions, data
type, and names of the observation signals.
rlStochasticActorRepresentation sets the ObservationInfo property of contActor or

discActor to the input observationInfo.

or rlNumericSpec.

Action specifications, specified as an rlFiniteSetSpec or rlNumericSpec object. These objects

define properties such as the dimensions, data type and name of the action signals.
For a discrete action space actor, rlStochasticActorRepresentation sets ActionInfo to the

input discActionInfo, which must be an rlFiniteSetSpec object.
For a continuous action space actor, rlStochasticActorRepresentation sets ActionInfo to the

input contActionInfo, which must be an rlNumericSpec object.
discrete action.
Object Functions
getAction Obtain action from agent, actor, or policy object given environment observations
Examples
Create Discrete Stochastic Actor from Deep Neural Network
doubles.
3-338
Create a deep neural network approximator for the actor. The input of the network (here called
state) must accept a four-element vector (the observation vector just defined by obsInfo), and its
output (here called actionProb) must be a three-element vector. Each element of the output vector
must be between 0 and 1 since it represents the probability of executing each of the three possible
actions (as defined by actInfo). Using softmax as the output layer enforces this requirement.
net = [ featureInputLayer(4,'Normalization','none',...
'Name','state')
fullyConnectedLayer(3,'Name','fc')
softmaxLayer('Name','actionProb') ];
Create the actor with rlStochasticActorRepresentation, using the network, the observations
and action specification objects, as well as the names of the network input layer.
actor = rlStochasticActorRepresentation(net,obsInfo,actInfo,...
'Observation','state')
actor =

To validate your actor, use getAction to return a random action from the observation vector [1 1 1
1], using the current network weights.
act = getAction(actor,{[1 1 1 1]});

act{1}
ans = 10
You can now use the actor to create a suitable agent, such as an rlACAgent, or rlPGAgent agent.
Create Continuous Stochastic Actor from Deep Neural Network
continuous six-dimensional space, so that a single observation is a column vector containing 6
doubles.
two-dimensional space, so that a single action is a column vector containing 2 doubles both between
-10 and 10.
3-339
3 Objects
Create a deep neural network approximator for the actor. The observation input (here called myobs)
must accept a six-dimensional vector (the observation vector just defined by obsInfo). The output
(here called myact) must be a four-dimensional vector (twice the number of dimensions defined by
actInfo). The elements of the output vector represent, in sequence, all the means and all the
standard deviations of every action.
The fact that standard deviations must be non-negative while mean values must fall within the output
range means that the network must have two separate paths. The first path is for the mean values,
and any output nonlinearity must be scaled so that it can produce values in the output range. The
second path is for the standard deviations, and you can use a softplus or relu layer to enforce non-
negativity.
% input path layers (6 by 1 input and a 2 by 1 output)

inPath = [ imageInputLayer([6 1 1], ...
'Normalization','none',...
'Name','myobs')
fullyConnectedLayer(2,'Name','infc') ];
% path layers for mean value

% (2 by 1 input and 2 by 1 output)
% using scalingLayer to scale the range
meanPath = [ tanhLayer('Name','tanh'); % range: (-1,1)
scalingLayer('Name','scale',...
'Scale',actInfo.UpperLimit) ]; % range: (-10,10)
% path layers for standard deviations

% (2 by 1 input and output)
% using softplus layer to make it non negative
sdevPath = softplusLayer('Name', 'splus');
% conctatenate two inputs (along dimension #3)

% to form a single (4 by 1) output layer
outLayer = concatenationLayer(3,2,'Name','mean&sdev');
% add layers to network object

net = layerGraph(inPath);
net = addLayers(net,outLayer);
% connect layers: the mean value path output MUST

% be connected to the FIRST input of the concatenationLayer
% connect output of inPath to meanPath input

net = connectLayers(net,'infc','tanh/in');
% connect output of inPath to sdevPath input
net = connectLayers(net,'infc','splus/in');
% connect output of meanPath to gaussPars input #1
net = connectLayers(net,'scale','mean&sdev/in1');
% connect output of sdevPath to gaussPars input #2
net = connectLayers(net,'splus','mean&sdev/in2');
actorOpts = rlRepresentationOptions('LearnRate',8e-3,'GradientThreshold',1);
3-340
Create the actor with rlStochasticActorRepresentation, using the network, the observations
and action specification objects, the names of the network input layer and the options object.
actor = rlStochasticActorRepresentation(net, obsInfo, actInfo, 'Observation','myobs',actorOpts)
actor =

To check your actor, use getAction to return a random action from the observation vector
ones(6,1), using the current network weights.
act = getAction(actor,{ones(6,1)});
act{1}
-0.0763
9.6860
You can now use the actor to create a suitable agent (such as an rlACAgent, rlPGAgent, or
rlPPOAgent agent).
Create Stochastic Actor from Custom Basis Function
doubles.
The stochastic actor based on a custom basis function does not support continuous action spaces.
Therefore, create a discrete action space specification object (or alternatively use getActionInfo to
extract the specification object from an environment with a discrete action space). For this example,
define the action space as a finite set consisting of 3 possible values (named 7, 5, and 3 in this case).
Create a custom basis function. Each element is a function of the observations defined by obsInfo.
myBasisFcn = @(myobs) [myobs(2)^2;

myobs(1);
exp(myobs(2));
abs(myobs(1))]

@(myobs)[myobs(2)^2;myobs(1);exp(myobs(2));abs(myobs(1))]
3-341
3 Objects
The output of the actor is the action, among the ones defined in actInfo, corresponding to the
element of softmax(W'*myBasisFcn(myobs)) which has the highest value. W is a weight matrix,
containing the learnable parameters, which must have as many rows as the length of the basis
function output, and as many columns as the number of possible actions.
W0 = rand(4,3);
actor = rlStochasticActorRepresentation({myBasisFcn,W0},obsInfo,actInfo)
actor =

To check your actor use the getAction function to return one of the three possible actions,
depending on a given random observation and on the current parameter matrix.
v = getAction(actor,{rand(2,1)})
v = 1x1 cell array

{[3]}
You can now use the actor (along with an critic) to create a suitable discrete action space agent.
Create Stochastic Actor with Recurrent Neural Network
For this example, you create a stochastic actor with a discrete action space using a recurrent neural
network. You can also use a recurrent neural network for a continuous stochastic actor using the
same method.
Create a recurrent deep neural network for the actor. To create a recurrent neural network, use a
actorNetwork = [
sequenceInputLayer(numObs,'Normalization','none','Name','state')
fullyConnectedLayer(8,'Name','fc')
reluLayer('Name','relu')
3-342
lstmLayer(8,'OutputMode','sequence','Name','lstm')
fullyConnectedLayer(numDiscreteAct,'Name','output')
softmaxLayer('Name','actionProb')];
Create a stochastic actor representation for the network.

actorOptions = rlRepresentationOptions('LearnRate',1e-3,...
'GradientThreshold',1);
actor = rlStochasticActorRepresentation(actorNetwork,obsInfo,actInfo,...
'Observation','state', actorOptions);
Version History
rlStochasticActorRepresentation is not recommended

rlStochasticActorRepresentation is not recommended. Use either

rlDiscreteCategoricalActor or rlContinuousGaussianActor instead.
The following table shows some typical uses of rlStochasticActorRepresentation to create

neural network-based actors, and how to update your code with one of the new stochastic actor
approximator objects instead. The first table entry builds an actor with a discrete action space, the
second one builds an actor with a continuous action space.
Network-Based Stochastic Actor Network-Based Stochastic Actor

Representation: Not Recommended Approximator: Recommended
myActor = myActor =
rlStochasticActorRepresentation(net,ob rlDiscreteCategoricalActor(net,obsInfo
sInfo,actInfo,'Observation',obsNames), ,actInfo,'ObservationInputNames',obsNa
with actInfo defining a discrete action space mes). Use this syntax to create a stochastic actor
and net having observations as inputs and a object with a discrete action space. This actor
single output layer with as many elements as the samples its action from a categorical (also known
number of possible discrete actions. as Multinoulli) distribution.
myActor = myActor =
rlStochasticActorRepresentation(net,ob rlContinuousGaussianActor(net,obsInfo,
sInfo,actInfo,'Observation',obsNames), actInfo,'ObservationInputNames',obsNam
with actInfo defining a continuous action space es,'ActionMeanOutputNames',actMeanName
and net having observations as inputs and a s,'ActionStandardDeviationOutputNames'
single output layer with twice as many elements ,actStdNames). Use this syntax to create a
as the number of dimensions of the continuous stochastic actor object with a continuous action
action space (representing, in sequence, all the space. This actor samples its action from a
means and all the standard deviations of every Gaussian distribution, and you must provide the
action dimension). names of the network outputs representing the
mean and standard deviations for the action.
The following table shows a typical use of rlStochasticActorRepresentation to create a

(discrete action space) actor which use a (linear in the learnable parameters) custom basis function,
and how to update your code with rlDiscreteCategoricalActor instead. In these function calls,
the first input argument is a two-element cell array containing both the handle to the custom basis
function and the initial weight vector or matrix.
3-343
3 Objects
Custom Basis Stochastic Actor Custom Basis Function-Based Stochastic

Representation: Not Recommended Actor Approximator: Recommended
rep = rep =
rlStochasticActorRepresentation({basis rlDiscreteCategoricalActor({basisFcn,W
Fcn,W0},obsInfo,actInfo), where the basis 0},obsInfo,actInfo). Use this syntax to
function has observations as inputs and actions create a stochastic actor object with a discrete
as outputs, W0 is a matrix with as many columns action space which returns an action sampled
as the number of possible actions, and actInfo from a categorical (also known as Multinoulli)
defines a discrete action space. distribution.
See Also
Functions
rlDiscreteCategoricalActor | rlContinuousGaussianActor | rlRepresentationOptions
| getActionInfo | getObservationInfo
Topics
3-344
rlTable
rlTable
Value table or Q table
Description
Value tables and Q tables are one way to represent critic networks for reinforcement learning. Value
tables store rewards for a finite set of observations. Q tables store rewards for corresponding finite
observation-action pairs.
To create a value function approximator using an rlTable object, use an rlValueFunction,

rlQValueFunction, or rlVectorQValueFunction object.
Creation
Syntax
T = rlTable(obsinfo)
T = rlTable(obsinfo,actinfo)
Description
T = rlTable(obsinfo) creates a value table for the given discrete observations.
T = rlTable(obsinfo,actinfo) creates a Q table for the given discrete observations and

actions.
Input Arguments
obsinfo — Observation specification

Observation specification, specified as an rlFiniteSetSpec object.
actinfo — Action specification

Properties
Table — Reward table
array
Reward table, returned as an array. When Table is a:
• Value table, it contains NO rows, where NO is the number of finite observation values.
• Q table, it contains NO rows and NA columns, where NA is the number of possible finite actions.
3-345
3 Objects
Object Functions
rlValueFunction Value function approximator object for reinforcement learning agents
rlQValueFunction Q-Value function approximator object for reinforcement learning agents
rlVectorQValueFunction Vector Q-value function approximator for reinforcement learning agents
Examples
Create a Value Table
This example shows how to use rlTable to create a value table. You can use such a table to
represent the critic of an actor-critic agent with a finite observation space.
Create an environment interface, and obtain its observation specifications.
obsInfo =
Elements: [25x1 double]

Name: "MDP Observations"
Dimension: [1 1]
DataType: "double"
Create the value table using the observation specification.
vTable = rlTable(obsInfo)
vTable =
Table: [25x1 double]
Create a Q Table
This example shows how to use rlTable to create a Q table. Such a table could be used to represent
the actor or critic of an agent with finite observation and action spaces.
env=rlMDPEnv(createMDP(8,["up";"down"]));
obsInfo =

Name: "MDP Observations"
3-346
rlTable
Dimension: [1 1]
DataType: "double"
actInfo =

Name: "MDP Actions"
Dimension: [1 1]
DataType: "double"
Create the Q table using the observation and action specifications.
qTable = rlTable(obsInfo,actInfo)
qTable =
Table: [8x2 double]
Version History
See Also
Topics
3-347
3 Objects
rlTD3Agent
Twin-delayed deep deterministic policy gradient reinforcement learning agent
Description
The twin-delayed deep deterministic policy gradient (DDPG) algorithm is an actor-critic, model-free,
online, off-policy reinforcement learning method which computes an optimal policy that maximizes
the long-term reward. The action space can only be continuous.
Use rlTD3Agent to create one of the following types of agents.
• Twin-delayed deep deterministic policy gradient (TD3) agent with two Q-value functions. This
agent prevents overestimation of the value function by learning two Q value functions and using
the minimum values for policy updates.
• Delayed deep deterministic policy gradient (delayed DDPG) agent with a single Q value function.
This agent is a DDPG agent with target policy smoothing and delayed policy and target updates.
For more information, see “Twin-Delayed Deep Deterministic Policy Gradient Agents”. For more
Agents”.
Creation
Syntax
agent = rlTD3Agent(observationInfo,actionInfo)
agent = rlTD3Agent(observationInfo,actionInfo,initOpts)
agent = rlTD3Agent(actor,critics,agentOptions)
agent = rlTD3Agent( ___ ,agentOptions)
Description
agent = rlTD3Agent(observationInfo,actionInfo) creates a TD3 agent for an environment

with the given observation and action specifications, using default initialization options. The actor
and critics in the agent use default deep neural networks built from the observation specification
observationInfo and the action specification actionInfo. The ObservationInfo and
ActionInfo properties of agent are set to the observationInfo and actionInfo input
agent = rlTD3Agent(observationInfo,actionInfo,initOpts) creates a deep deterministic

policy gradient agent for an environment with the given observation and action specifications. The
agent uses default networks configured using options specified in the initOpts object. For more
information on the initialization options, see rlAgentInitializationOptions.
3-348
rlTD3Agent
agent = rlTD3Agent(actor,critics,agentOptions) creates an agent with the specified actor

and critic. To create a:
• TD3 agent, specify a two-element row vector of critic.

• Delayed DDPG agent, specify a single critic.
agent = rlTD3Agent( ___ ,agentOptions) creates a TD3 agent and sets the AgentOptions
Input Arguments

actor — Actor
Actor, specified as an rlContinuousDeterministicActor. For more information on creating

actors, see “Create Policies and Value Functions”.
critics — Critic network

rlQValueFunction object | two-element row vector of rlQValueFunction objects
Critic, specified as one of the following:
• rlQValueFunction object — Create a delayed DDPG agent with a single Q value function. This
agent is a DDPG agent with target policy smoothing and delayed policy and target updates.
• Two-element row vector of rlQValueFunction objects — Create a TD3 agent with two critic
value functions. The two critic networks must be unique rlQValueFunction objects with the
same observation and action specifications. The critics can either have different structures or the
same structure but with different initial parameters.
For more information on creating critics, see “Create Policies and Value Functions”.
Properties

signals.
3-349
3 Objects

or rlNumericSpec.


Since a TD3 agent operates in a continuous action space, you must specify actionInfo as an
rlNumericSpec object.
also construct the specification manually using rlNumericSpec.

rlTD3AgentOptions object
Agent options, specified as an rlTD3AgentOptions object.
If you create a TD3 agent with default actor and critic that use recurrent neural networks, the default
value of AgentOptions.SequenceLength is 32.

values.



3-350
rlTD3Agent
Object Functions
environment
observations
Examples
Create TD3 Agent from Observation and Action Specifications
rng(0)
Create a TD3 agent from the environment observation and action specifications.
agent = rlTD3Agent(obsInfo,actInfo);

{[0.0087]}
3-351
3 Objects
Create TD3 Agent Using Initialization Options


reproducibility by fixing the seed of the random generator. To do so, uncomment the following line.
% rng(0)
Create a DDPG agent from the environment observation and action specifications.
agent = rlTD3Agent(obsInfo,actInfo,initOpts);
Extract the deep neural networks from the actor.
Extract the deep neural networks from the two critics. Note that getModel(critics) only returns
the first critic network.
critics = getCritic(agent);
Display the layers of the first critic network, and verify that each hidden fully connected layer has 128
neurons.
criticNet1.Layers
ans =

3-352
rlTD3Agent

Plot the networks of the actor and of the second critic, and display the number of weights.
summary(actorNet)
Initialized: true
Inputs:
plot(layerGraph(criticNet2))
3-353
3 Objects
summary(criticNet2)
Initialized: true
Inputs:

{[0.0675]}
Create TD3 Agent from Actor and Critic
3-354
rlTD3Agent
position and velocity of a mass. The action is a scalar representing a force ranging continuously from
-2 to 2 Newton.
action spaces from the environment specification objects.
% Observation path
obsPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="obsPathIn")
reluLayer
fullyConnectedLayer(16,Name="obsPathOut")
];
% Action path
actPath = [
featureInputLayer(prod(actInfo.Dimension),Name="actPathIn")
reluLayer
fullyConnectedLayer(16,Name="actPathOut")
];
% Common path
commonPath = [
concatenationLayer(1,2,Name="concat")
reluLayer
];

% Connect layers
criticNet = connectLayers(criticNet,"obsPathOut","concat/in1");
criticNet = connectLayers(criticNet,"actPathOut","concat/in2");
critics.
3-355
3 Objects
summary(criticNet1)
Initialized: true
Inputs:
1 'obsPathIn' 2 features
2 'actPathIn' 1 features
Create the two critics using rlQValueFunction, using the two networks with different weights.
Alternatively, if you use exactly the same network with the same weights, you must explicitly initialize
the network each time (to make sure weights are initialized differently) before passing it to
rlQValueFunction. To do so, use initialize.

ans = single
-0.1330
ans = single
-0.1526
Create a neural network to be used as approximation model within the actor. For TD3 agents, the
actor executes a deterministic policy, which is implemented by a continuous deterministic actor. In
this case the network must take the observation signal as input and return an action. Therefore the
output layer must have as many elements as the number of possible actions.
actorNet = [
reluLayer
reluLayer
tanhLayer
];

summary(actorNet)
Initialized: true
Inputs:
3-356
rlTD3Agent
Create the actor using actorNet. TD3 agents use an rlContinuousDeterministicActor object
to implement the actor.
actor = rlContinuousDeterministicActor(actorNet,obsInfo,actInfo);

{[0.0221]}

LearnRate=1e-3,...

LearnRate=1e-3,...
Specify agent options, including training options for actor and critics.
agentOptions = rlTD3AgentOptions;
agentOptions.TargetPolicySmoothModel.Variance = 0.2;
agentOptions.TargetPolicySmoothModel.LowerLimit = -0.5;
agentOptions.TargetPolicySmoothModel.UpperLimit = 0.5;
Create TD3 agent using actor, critics, and options.
agent = rlTD3Agent(actor,[critic1 critic2],agentOptions);
You can also create an rlTD3Agent object with a single critic. In this case, the object represents a
DDPG agent with target policy smoothing and delayed policy and target updates.
delayedDDPGAgent = rlTD3Agent(actor,critic1,agentOptions);
To check your agents, use getAction to return the action from a random observation.

{[0.0351]}
getAction(delayedDDPGAgent,{rand(obsInfo.Dimension)})
3-357
3 Objects

{[0.0296]}
You can now test and train either agent within the environment.
Create TD3 Agent with Recurrent Neural Networks
Integrator System”. The observation from the environment is a vector containing the position and
velocity of a mass. The action is a scalar representing a force ranging continuously from -2 to 2
Newton.
A SAC agent uses two Q-value function critics. To approximate each Q-value function, use a deep
recurrent neural network. The network for a single-output Q-value function critic must have two
input layers, one for the observation and the other for the action, and return a scalar value
representing the expected cumulative long-term reward following from the given observation and
action.
action spaces from the environment specification objects. To create a recurrent neural network, use a
layers.
% Define observation path

obsPath = [
reluLayer
fullyConnectedLayer(300,Name="obsOut")
];
% Define action path

actPath = [
sequenceInputLayer(prod(actInfo.Dimension),Name="actIn")
fullyConnectedLayer(300,Name="actOut")
];

commonPath = [
reluLayer
lstmLayer(16);
];
3-358
rlTD3Agent
% Connect paths
criticNet = connectLayers(criticNet,"obsOut","cat/in1");
criticNet = connectLayers(criticNet,"actOut","cat/in2");
critics.
summary(criticNet1)
Initialized: true
Inputs:
1 'obsIn' Sequence input with 2 dimensions (CTB)
2 'actIn' Sequence input with 1 dimensions (CTB)
Create the critic using rlQValueFunction. Use the same network structure for both critics. The
TD3 agent initializes the two networks using different default parameters.
ans = single
-0.0060
ans = single
0.0481
Since the critics has a recurrent network, the actor must have a recurrent network as approximation
model too. For TD3 agents, the actor executes a deterministic policy, which is implemented by a
continuous deterministic actor. In this case the network must take the observation signal as input and
return an action. Therefore the output layer must have as many elements as the number of possible
actions.
actorNet = [
3-359
3 Objects
lstmLayer(8)
reluLayer
fullyConnectedLayer(300,'Name','ActorFC2')
reluLayer
tanhLayer
];
summary(actorNet)
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 2 dimensions (CTB)
Create the actor using actorNet. TD3 agents use an rlContinuousDeterministicActor object
to implement the actor.
actor = rlContinuousDeterministicActor(actorNet,obsInfo,actInfo);
ans = 1×1 cell array

{[0.0014]}

LearnRate=1e-3,...

LearnRate=1e-3,...
Specify agent options, including training options for actor and critics. To use a TD3 agent with
recurrent neural networks, you must specify a SequenceLength greater than 1.
agentOptions = rlTD3AgentOptions;
agentOptions.SequenceLength = 32;
agentOptions.TargetPolicySmoothModel.Variance = 0.2;
agentOptions.TargetPolicySmoothModel.LowerLimit = -0.5;
agentOptions.TargetPolicySmoothModel.UpperLimit = 0.5;
3-360
rlTD3Agent
Create TD3 agent using actor, critics, and options.
agent = rlTD3Agent(actor,[critic1 critic2],agentOptions);
You can also create an rlTD3Agent object with a single critic. In this case, the object represents a
DDPG agent with target policy smoothing and delayed policy and target updates.
delayedDDPGAgent = rlTD3Agent(actor,critic1,agentOptions);
To check your agents, use getAction to return the action from a random observation.

{[0.0018]}
getAction(delayedDDPGAgent,{rand(obsInfo.Dimension)})

{[0.0015]}
You can now test and train either agent within the environment.
Version History
See Also
rlAgentInitializationOptions | rlTD3AgentOptions | rlQValueFunction |
rlContinuousDeterministicActor | Deep Network Designer
Topics
“Twin-Delayed Deep Deterministic Policy Gradient Agents”
“Train Biped Robot to Walk Using Reinforcement Learning Agents”
3-361
3 Objects
rlTD3AgentOptions
Options for TD3 agent
Description
Use an rlTD3AgentOptions object to specify options for twin-delayed deep deterministic policy
gradient (TD3) agents. To create a TD3 agent, use rlTD3Agent
For more information see “Twin-Delayed Deep Deterministic Policy Gradient Agents”.
Learning Agents”.
Creation
Syntax
opt = rlTD3AgentOptions
opt = rlTD3AgentOptions(Name,Value)
Description
opt = rlTD3AgentOptions creates an options object for use as an argument when creating a TD3
agent using all default options. You can modify the object properties using dot notation.
opt = rlTD3AgentOptions(Name,Value) sets option properties on page 3-362 using name-value

pairs. For example, rlTD3AgentOptions('DiscountFactor',0.95) creates an option set with a
quotes.
Properties
ExplorationModel — Exploration noise model options
GaussianActionNoise object (default) | OrnsteinUhlenbeckActionNoise object
Noise model options, specified as a GaussianActionNoise object or an

OrnsteinUhlenbeckActionNoise object. For more information on noise models, see “Noise
Models” on page 3-365.
action requires different noise model parameters. If the actions have similar ranges and units, you
can set the noise parameters for all actions to the same value.
3-362
rlTD3AgentOptions
opt = rlTD3AgentOptions;
opt.ExplorationModel.StandardDeviation = [0.1 0.2];
opt.ExplorationModel.StandardDeviationDecayRate = 1e-4;
To use Ornstein-Uhlenbeck action noise, first create a default OrnsteinUhlenbeckActionNoise

object. Then, specify any nondefault model properties using dot notation.
opt.ExplorationModel = rl.option.OrnsteinUhlenbeckActionNoise;
opt.ExplorationModel.StandardDeviation = 0.05;
TargetPolicySmoothModel — Target smoothing noise model options

GaussianActionNoise object
Target smoothing noise model options, specified as a GaussianActionNoise object. This model
helps the policy exploit actions with high Q-value estimates. For more information on noise models,
see “Noise Models” on page 3-365.
action requires different smoothing noise model parameters. If the actions have similar ranges and
units, you can set the noise parameters for all actions to the same value.
opt.TargetPolicySmoothModel.StandardDeviation = [0.1 0.2];
opt.TargetPolicySmoothModel.StandardDeviationDecayRate = 1e-4;
PolicyUpdateFrequency — Number of steps between policy updates

Number of steps between policy updates, specified as a positive integer.


rlOptimizer.


rlOptimizer.
TargetSmoothFactor — Smoothing factor for target actor and critic updates

Smoothing factor for target actor and critic updates, specified as a positive scalar less than or equal
to 1. For more information, see “Target Update Methods”.
3-363
3 Objects
TargetUpdateFrequency — Number of steps between target actor and critic updates

Number of steps between target actor and critic updates, specified as a positive integer. For more
information, see “Target Update Methods”.


otherwise.



3-364
rlTD3AgentOptions

equal to 1.
Object Functions
Examples
Create TD3 Agent Options Object
This example shows how to create a TD3 agent option object.
Create an rlTD3AgentOptions object that specifies the mini-batch size.
opt = rlTD3AgentOptions('MiniBatchSize',48)
opt =
rlTD3AgentOptions with properties:
ExplorationModel: [1x1 rl.option.GaussianActionNoise]

TargetPolicySmoothModel: [1x1 rl.option.GaussianActionNoise]
PolicyUpdateFrequency: 2
TargetSmoothFactor: 0.0050
SequenceLength: 1
MiniBatchSize: 48
SampleTime: 1
Algorithms
Noise Models
Gaussian Action Noise
A GaussianActionNoise object has the following numeric value properties.
3-365
3 Objects
Property Description Default Value Default Value

(ExplorationModel) (TargetPolicySmooth
Model)
Mean Noise mean value 0 0
StandardDeviationD Decay rate of the 0 0
ecayRate standard deviation
StandardDeviation Initial value of noise sqrt(0.1) sqrt(0.2)
standard deviation
StandardDeviationM Minimum standard 0.01 0.01
in deviation, which must
be less than
StandardDeviation
LowerLimit Noise sample lower -Inf -0.5
limit
UpperLimit Noise sample upper Inf 0.5
limit
At each time step k, the Gaussian noise v is sampled as shown in the following code.
w = Mean + randn(ActionSize).*StandardDeviation(k);
v(k+1) = min(max(w,LowerLimit),UpperLimit);
Where the initial value v(1) is defined by the InitialAction parameter. At each sample time step,
the standard deviation decays as shown in the following code.

Ornstein-Uhlenbeck Action Noise
An OrnsteinUhlenbeckActionNoise object has the following numeric value properties.

InitialAction Initial value of action 0
Mean Noise mean value 0
MeanAttractionConstant Constant specifying how quickly 0.15
the noise model output is
attracted to the mean
te deviation
StandardDeviation Initial value of noise standard 0.3
deviation
StandardDeviationMin Minimum standard deviation 0
At each sample time step k, the noise value v(k) is updated using the following formula, where Ts is
the agent sample time, and the initial value v(1) is defined by the InitialAction parameter.
v(k+1) = v(k) + MeanAttractionConstant.*(Mean - v(k)).*Ts

+ StandardDeviation(k).*randn(size(Mean)).*sqrt(Ts)
3-366
rlTD3AgentOptions
At each sample time step, the standard deviation decays as shown in the following code.

You can calculate how many samples it will take for the standard deviation to be halved using this
simple formula.
halflife = log(0.5)/log(1-StandardDeviationDecayRate);
For continuous action signals, it is important to set the noise standard deviation appropriately to
encourage exploration. It is common to set StandardDeviation*sqrt(Ts) to a value between 1%
and 10% of your action range.
If your agent converges on local optima too quickly, promote agent exploration by increasing the
amount of noise; that is, by increasing the standard deviation. Also, to increase exploration, you can
reduce the StandardDeviationDecayRate.
Version History
Properties defining noise probability distribution in the GaussianActionNoise object have

changed
The properties defining the probability distribution of the Gaussian action noise model have changed.
This noise model is used by TD3 agents for exploration and target policy smoothing.
• The Variance property has been replaced by the StandardDeviation property.

• The VarianceDecayRate property has been replaced by the StandardDeviationDecayRate
property.
• The VarianceMin property has been replaced by the StandardDeviationMin property.
When a GaussianActionNoise noise object saved from a previous MATLAB release is loaded, the
value of VarianceDecayRate is copied to StandardDeviationDecayRate, while the square root
of the values of Variance and VarianceMin are copied to StandardDeviation and
StandardDeviationMin, respectively.
recommended. To define the probability distribution of the Gaussian action noise model, use the new
property names instead.
Update Code
This table shows how to update your code to use the new property names for rlTD3AgentOptions
object td3opt.

td3opt.ExplorationModel.Variance = td3opt.ExplorationModel.StandardDeviat
0.5; ion = sqrt(0.5);
td3opt.ExplorationModel.VarianceDecayR td3opt.ExplorationModel.StandardDeviat
ate = 0.1; ionDecayRate = 0.1;
3-367
3 Objects

td3opt.ExplorationModel.VarianceMin = td3opt.ExplorationModel.StandardDeviat
0.1; ionMin = sqrt(0.1);
Property names defining noise probability distribution in the

OrnsteinUhlenbeckActionNoise object have changed
The properties defining the probability distribution of the Ornstein-Uhlenbeck (OU) noise model have
been renamed. TD3 agents use OU noise for exploration.
• The Variance property has been renamed StandardDeviation.

• The VarianceDecayRate property has been renamed StandardDeviationDecayRate.
• The VarianceMin property has been renamed StandardDeviationMin.
The default values of these properties remain the same. When an

OrnsteinUhlenbeckActionNoise noise object saved from a previous MATLAB release is loaded,
the values of Variance, VarianceDecayRate, and VarianceMin are copied in the
StandardDeviation, StandardDeviationDecayRate, and StandardDeviationMin,
respectively.
recommended. To define the probability distribution of the OU noise model, use the new property
names instead.
Update Code
This table shows how to update your code to use the new property names for rlTD3AgentOptions
object td3opt.

td3opt.ExplorationModel.Variance = td3opt.ExplorationModel.StandardDeviat
0.5; ion = sqrt(0.5);
td3opt.ExplorationModel.VarianceDecayR td3opt.ExplorationModel.StandardDeviat
ate = 0.1; ionDecayRate = 0.1;
td3opt.ExplorationModel.VarianceMin = td3opt.ExplorationModel.StandardDeviat
0.1; ionMin = sqrt(0.1);
td3opt.TargetPolicySmoothModel.Varianc td3opt.TargetPolicySmoothModel.Standar
e = 0.5; dDeviation = sqrt(0.5);
eDecayRate = 0.1; dDeviationDecayRate = 0.1;
eMin = 0.1; dDeviationMin = sqrt(0.1);
References
2018.
3-368
rlTD3AgentOptions
See Also
Topics
“Twin-Delayed Deep Deterministic Policy Gradient Agents”
3-369
3 Objects
rlTrainingOptions
Options for training reinforcement learning agents
Description
Use an rlTrainingOptions object to specify training options for an agent. To train an agent, use
train.
For more information on training agents, see “Train Reinforcement Learning Agents”.
Creation
Syntax
trainOpts = rlTrainingOptions
opt = rlTrainingOptions(Name,Value)
Description
trainOpts = rlTrainingOptions returns the default options for training a reinforcement

learning agent. Use training options to specify parameters for the training session, such as the
maximum number of episodes to train, criteria for stopping training, criteria for saving agents, and
options for using parallel computing. After configuring the options, use trainOpts as an input
argument for train.
opt = rlTrainingOptions(Name,Value) creates a training option set and sets object

Properties
MaxEpisodes — Maximum number of episodes to train the agent
Maximum number of episodes to train the agent, specified as a positive integer. Regardless of other
criteria for termination, training terminates after MaxEpisodes.
Example: 'MaxEpisodes',1000
MaxStepsPerEpisode — Maximum number of steps to run per episode

Maximum number of steps to run per episode, specified as a positive integer. In general, you define
episode termination conditions in the environment. This value is the maximum number of steps to run
in the episode if other termination conditions are not met.
Example: 'MaxStepsPerEpisode',1000
ScoreAveragingWindowLength — Window length for averaging

5 (default) | positive integer scalar | positive integer vector
3-370
rlTrainingOptions
Window length for averaging the scores, rewards, and number of steps for each agent, specified as a
scalar or vector.
If the training environment contains a single agent, specify ScoreAveragingWindowLength as a

scalar.
If the training environment is a multi-agent Simulink environment, specify a scalar to apply the same
window length to all agents.
To use a different window length for each agent, specify ScoreAveragingWindowLength as a

vector. In this case, the order of the elements in the vector correspond to the order of the agents used
during environment creation.
For options expressed in terms of averages, ScoreAveragingWindowLength is the number of

episodes included in the average. For instance, if StopTrainingCriteria is "AverageReward",
and StopTrainingValue is 500 for a given agent, then for that agent, training terminates when the
average reward over the number of episodes specified in ScoreAveragingWindowLength equals or
exceeds 500. For the other agents, training continues until:

• The number of episodes reaches MaxEpisodes.
Example: 'ScoreAveragingWindowLength',10
StopTrainingCriteria — Training termination condition

"AverageSteps" (default) | "AverageReward" | "EpisodeCount" | ...
Training termination condition, specified as one of the following strings:
• "AverageSteps" — Stop training when the running average number of steps per episode equals
or exceeds the critical value specified by the option StopTrainingValue. The average is
• "AverageReward" — Stop training when the running average reward equals or exceeds the
critical value.
• "EpisodeReward" — Stop training when the reward in the current episode equals or exceeds the
critical value.
• "GlobalStepCount" — Stop training when the total number of steps in all episodes (the total
• "EpisodeCount" — Stop training when the number of training episodes equals or exceeds the
critical value.
Example: 'StopTrainingCriteria',"AverageReward"
StopTrainingValue — Critical value of training termination condition

500 (default) | scalar | vector
Critical value of the training termination condition, specified as a scalar or a vector.
If the training environment contains a single agent, specify StopTrainingValue as a scalar.
termination criterion to all agents. To use a different termination criterion for each agent, specify
3-371
3 Objects
StopTrainingValue as a vector. In this case, the order of the elements in the vector corresponds to
the order of the agents used during environment creation.
For a given agent, training ends when the termination condition specified by the
StopTrainingCriteria option equals or exceeds this value. For the other agents, the training
continues until:

• The number of episodes reaches maxEpisodes.
For instance, if StopTrainingCriteria is "AverageReward", and StopTrainingValue is 100

for a given agent, then for that agent, training terminates when the average reward over the number
of episodes specified in ScoreAveragingWindowLength equals or exceeds 100.
Example: 'StopTrainingValue',100
SaveAgentCriteria — Condition for saving agents during training

"none" (default) | "EpisodeReward" | "AverageReward" | "EpisodeCount" | ...
Condition for saving agents during training, specified as one of the following strings:
• "none" — Do not save any agents during training.

• "EpisodeReward" — Save the agent when the reward in the current episode equals or exceeds
the critical value.
• "AverageSteps" — Save the agent when the running average number of steps per episode
equals or exceeds the critical value specified by the option StopTrainingValue. The average is
• "AverageReward" — Save the agent when the running average reward over all episodes equals
or exceeds the critical value.
• "GlobalStepCount" — Save the agent when the total number of steps in all episodes (the total
• "EpisodeCount" — Save the agent when the number of training episodes equals or exceeds the
critical value.
Set this option to store candidate agents that perform well according to the criteria you specify. When
you set this option to a value other than "none", the software sets the SaveAgentValue option to
500. You can change that value to specify the condition for saving the agent.
For instance, suppose you want to store for further testing any agent that yields an episode reward
that equals or exceeds 100. To do so, set SaveAgentCriteria to "EpisodeReward" and set the
SaveAgentValue option to 100. When an episode reward equals or exceeds 100, train saves the
corresponding agent in a MAT file in the folder specified by the SaveAgentDirectory option. The
MAT file is called AgentK.mat, where K is the number of the corresponding episode. The agent is
stored within that MAT file as saved_agent.
Example: 'SaveAgentCriteria',"EpisodeReward"
SaveAgentValue — Critical value of condition for saving agents

"none" (default) | 500 | scalar | vector
Critical value of the condition for saving agents, specified as a scalar or a vector.
3-372
rlTrainingOptions
If the training environment contains a single agent, specify SaveAgentValue as a scalar.
saving criterion to each agent. To save the agents when one meets a particular criterion, specify
SaveAgentValue as a vector. In this case, the order of the elements in the vector corresponds to the
order of the agents used when creating the environment. When a criteria for saving an agent is met,
all agents are saved in the same MAT file.
When you specify a condition for saving candidate agents using SaveAgentCriteria, the software
sets this value to 500. Change the value to specify the condition for saving the agent. See the
SaveAgentCriteria option for more details.
Example: 'SaveAgentValue',100
SaveAgentDirectory — Folder for saved agents

"savedAgents" (default) | string | character vector
Folder for saved agents, specified as a string or character vector. The folder name can contain a full
or relative path. When an episode occurs that satisfies the condition specified by the
SaveAgentCriteria and SaveAgentValue options, the software saves the agents in a MAT file in
this folder. If the folder does not exist, train creates it. When SaveAgentCriteria is "none", this
option is ignored and train does not create a folder.
Example: 'SaveAgentDirectory', pwd + "\run1\Agents"
UseParallel — Flag for using parallel training

Flag for using parallel training, specified as a logical. Setting this option to true configures
training to use parallel processing to simulate the environment, thereby enabling usage of multiple
cores, processors, computer clusters or cloud resources to speed up training. To specify options for
parallel training, use the ParallelizationOptions property.
When UseParallel is true then for DQN, DDPG, TD3, and SAC the NumStepsToLookAhead
property or the corresponding agent option object must be set to 1, otherwise an error is generated.
This guarantees that experiences are stored contiguously. When AC agents are trained in parallel, a
warning is generated if the StepsUntilDataIsSent property of the ParallelizationOptions
object is set to a different value than the NumStepToLookAhead property of the AC agent option
object.
Note that if you want to speed up deep neural network calculations (such as gradient computation,
parameter update and prediction) using a local GPU, you do not need to set UseParallel to true.
Instead, when creating your actor or critic representation, use an rlRepresentationOptions
object in which the UseDevice option is set to "gpu". Using parallel computing or the GPU requires
Parallel Computing Toolbox software. Using computer clusters or cloud resources additionally
requires MATLAB Parallel Server. For more information about training using multicore processors
and GPUs, see “Train Agents Using Parallel Computing and GPUs”.
Example: 'UseParallel',true
ParallelizationOptions — Options to control parallel training

ParallelTraining object
Parallelization options to control parallel training, specified as a ParallelTraining object. For

more information about training using parallel computing, see “Train Reinforcement Learning
Agents”.
3-373
3 Objects
The ParallelTraining object has the following properties, which you can modify using dot
notation after creating the rlTrainingOptions object.
Mode — Parallel computing mode

"sync" (default) | "async"
Parallel computing mode, specified as one of the following:
• "sync" — Use parpool to run synchronous training on the available workers. In this case,
workers pause execution until all workers are finished. The host updates the actor and critic
parameters based on the results from all the workers and sends the updated parameters to all
workers. Note that synchronous training is required for gradient-based parallelization, that is
when DataToSendFromWorkers is set to "gradients" then Mode must be set to "sync".
• "async" — Use parpool to run asynchronous training on the available workers. In this case,
workers send their data back to the host as soon as they finish and receive updated parameters
from the host. The workers then continue with their task.
WorkerRandomSeeds — Randomizer initialization for workers

–1 (default) | –2 | vector
Randomizer initialization for workers, specified as one of the following:
• –1 — Assign a unique random seed to each worker. The value of the seed is the worker ID.
• –2 — Do not assign a random seed to the workers.
• Vector — Manually specify the random seed for each worker. The number of elements in the
vector must match the number of workers.
TransferBaseWorkspaceVariables — Option to send model and workspace variables to

parallel workers
Option to send model and workspace variables to parallel workers, specified as "on" or "off". When
the option is "on", the host sends variables used in models and defined in the base MATLAB
workspace to the workers.
AttachedFiles — Additional files to attach to the parallel pool

[] (default) | string | string array
Additional files to attach to the parallel pool, specified as a string or string array.
SetupFcn — Function to run before training starts

Function to run before training starts, specified as a handle to a function having no input arguments.
This function is run once per worker before training begins. Write this function to perform any
processing that you need prior to training.
CleanupFcn — Function to run after training ends

Function to run after training ends, specified as a handle to a function having no input arguments.
You can write this function to clean up the workspace or perform other processing after training
terminates.
3-374
rlTrainingOptions
Verbose — Display training progress on the command line

false (0) (default) | true (1)
Display training progress on the command line, specified as the logical values false (0) or true (1).
Set to true to write information from each training episode to the MATLAB command line during
training.
StopOnError — Option to stop training when error occurs

Option to stop training when an error occurs during an episode, specified as "on" or "off". When
this option is "off", errors are captured and returned in the SimulationInfo output of train, and
training continues to the next episode.
Plots — Option to display training progress with Episode Manager

"training-progress" (default) | "none"
Option to display training progress with Episode Manager, specified as "training-progress" or

"none". By default, calling train opens the Reinforcement Learning Episode Manager, which
graphically and numerically displays information about the training progress, such as the reward for
each episode, average reward, number of episodes, and total number of steps. (For more information,
see train.) To turn off this display, set this option to "none".
Object Functions
Examples
Configure Options for Training
Create an options set for training a reinforcement learning agent. Set the maximum number of
episodes and the maximum number of steps per episode to 1000. Configure the options to stop
training when the average reward equals or exceeds 480, and turn on both the command-line display
and Reinforcement Learning Episode Manager for displaying training results. You can set the options
using name-value pair arguments when you create the options set. Any options that you do not
trainOpts = rlTrainingOptions(...
'MaxEpisodes',1000,...
'MaxStepsPerEpisode',1000,...
'StopTrainingCriteria',"AverageReward",...
'StopTrainingValue',480,...
'Verbose',true,...
'Plots',"training-progress")
trainOpts =
rlTrainingOptions with properties:
MaxEpisodes: 1000
3-375
3 Objects
Verbose: 1
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelTraining]
trainOpts
trainOpts =
rlTrainingOptions with properties:
MaxEpisodes: 1000
Verbose: 1
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelTraining]
Configure Parallel Computing Options for Training
To turn on parallel computing for training a reinforcement learning agent, set the UseParallel
training option to true.
To configure your parallel training, configure the fields of the

trainOpts.ParallelizationOptions. For example, specify the asynchronous training mode:
trainOpts.ParallelizationOptions
ans =
ParallelTraining with properties:
3-376
rlTrainingOptions
Mode: "async"
WorkerRandomSeeds: -1
TransferBaseWorkspaceVariables: "on"
AttachedFiles: []
SetupFcn: []
CleanupFcn: []
You can now use trainOpts as an input argument to the train command to perform training with
parallel computing.
Configure Options for A3C Training
To train an agent using the asynchronous advantage actor-critic (A3C) method, you must set the
agent and parallel training options appropriately.
When creating the AC agent, set the NumStepsToLookAhead value to be greater than 1. Common
values are 64 and 128.
agentOpts = rlACAgentOptions(NumStepsToLookAhead=64);
Use agentOpts when creating your agent. Alternatively, create your agent first and then modify its
options, including the actor and critic options later using dot notation.
Configure the training algorithm to use asynchronous parallel training.
Configure the workers to return gradient data to the host. Also, set the number of steps before the
workers send data back to the host to match the number of steps to look ahead.
trainOpts.ParallelizationOptions.DataToSendFromWorkers = ...
"gradients";
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = ...
agentOpts.NumStepsToLookAhead;
Use trainOpts when training your agent.
For an example on asynchronous advantage actor-critic agent training, see “Train AC Agent to
Balance Cart-Pole System Using Parallel Computing”.
Version History
Training Parallelization Options: DataToSendFromWorkers and StepsUntilDataIsSent

properties are no longer active
3-377
3 Objects
The property DataToSendFromWorkers of the ParallelizationOptions object is no longer

active and will be removed in a future release. The data sent from the workers to the learner is now
automatically determined based on agent type.
The property StepsUntilDataIsSent of the ParallelizationOptions object is no longer active

and will be removed in a future release. Data is now sent from the workers to the learner at the end
each episode.
rlTrainingOptions is not recommended for multi agent training

rlTrainingOptions is not recommended to concurrently train agents in a multi-agent

environments. Use rlMultiAgentTrainingOptions instead.
rlMultiAgentTrainingOptions is specifically built for multi-agent reinforcement learning, and

allows you to group agents according to a common learning strategy and specify whether their
learning is centralized (that is all agents in a group share experiences) or decentralized (agents do
not share experiences), whereas rlTrainingOptions only allows for decentralized learning.
See Also
train | rlMultiAgentTrainingOptions
Topics
3-378
rlTRPOAgent
rlTRPOAgent
Trust region policy optimization reinforcement learning agent
Description
Trust region policy optimization (TRPO) is a model-free, online, on-policy, policy gradient
reinforcement learning method. This algorithm prevents significant performance drops compared to
standard policy gradient methods by keeping the updated policy within a trust region close to the
current policy. The action space can be either discrete or continuous.
For more information on TRPO agents, see “Trust Region Policy Optimization Agents”. For more
Agents”.
Creation
Syntax
agent = rlTRPOAgent(observationInfo,actionInfo)
agent = rlTRPOAgent(observationInfo,actionInfo,initOpts)
agent = rlTRPOAgent(actor,critic)
agent = rlTRPOAgent( ___ ,agentOptions)
Description
agent = rlTRPOAgent(observationInfo,actionInfo) creates a trust region policy

optimization (TRPO) agent for an environment with the given observation and action specifications,
using default initialization options. The actor and critic in the agent use default deep neural networks
built from the observation specification observationInfo and the action specification actionInfo.
The ObservationInfo and ActionInfo properties of agent are set to the observationInfo and
agent = rlTRPOAgent(observationInfo,actionInfo,initOpts) creates a TRPO agent for

an environment with the given observation and action specifications. The agent uses default networks
configured using options specified in the initOpts object. TRPO agents do not support recurrent
neural networks. For more information on the initialization options, see
rlAgentInitializationOptions.
agent = rlTRPOAgent(actor,critic) creates a TRPO agent with the specified actor and critic,
using the default options for the agent.
3-379
3 Objects
agent = rlTRPOAgent( ___ ,agentOptions) creates a TRPO agent and sets the AgentOptions
Input Arguments

TRPO agents do not support recurrent neural networks. Therefore initOpts.UseRNNmust be

false.
actor — Actor

critic — Critic
Properties

signals.

or rlNumericSpec.


3-380
rlTRPOAgent

rlTRPOAgentOptions object
Agent options, specified as an rlTRPOAgentOptions object.

values.

Object Functions
environment
observations
3-381
3 Objects
Examples
Create Discrete TRPO Agent from Observation and Action Specifications

Obtain the observation and action specifications for this environment.
% rng(0)
Create a TRPO agent from the environment observation and action specifications.
agent = rlTRPOAgent(obsInfo,actInfo);

{[-2]}
Create Continuous TRPO Agent Using Initialization Options
Obtain observation and action specifications for this environment.
3-382
rlTRPOAgent
Create an agent initialization options object, specifying that each hidden fully connected layer in the
network must have 128 neurons.
initOpts = rlAgentInitializationOptions('NumHiddenUnit',128);
% rng(0)
Create a TRPO agent from the environment observation and action specifications using the specified
initialization options.
agent = rlTRPOAgent(obsInfo,actInfo,initOpts);
You can verify that the networks have 128 units in their hidden fully connected layers. For example,
display the layers of the critic network.
criticNet.Layers
ans =


{[0.9228]}
Create Trust Region Policy Optimization Agent
3-383
3 Objects
For TRPO agents, the critic estimates a value function, therefore it must take the observation signal
as input and return a scalar value. Create a deep neural network to be used as approximation model
within the critic. Define the network as an array of layer objects.
criticNet = [
reluLayer
];
summary(criticNet)
Initialized: true
Inputs:
Create the critic using criticNet. TRPO agents use an rlValueFunction object to implement the
critic.
ans = single
-0.2479
To approximate the policy within the actor use a neural network. For TRPO agents, the actor executes
a stochastic policy, which for discrete action spaces is implemented by a discrete categorical actor. In
this case the approximator must take the observation signal as input and return a probability for each
action. Therefore the output layer must have as many elements as the number of possible actions.
Define the network as an array of layer objects, getting the dimension of the observation space and
actorNet = [
reluLayer
fullyConnectedLayer(numel(actInfo.Dimension))
];
summary(actorNet)
3-384
rlTRPOAgent
Initialized: true
Inputs:
Create the actor using actorNet. PPO agents use an rlDiscreteCategoricalActor object to

{[-10]}
Create a TRPO agent using the actor and the critic.
agent =
rlTRPOAgent with properties:
AgentOptions: [1x1 rl.option.rlTRPOAgentOptions]

SampleTime: 1

{[-10]}
Create Continuous TRPO Agent
3-385
3 Objects
in the example “Train DDPG Agent to Control Double Integrator System”. The observation from the
environment is a vector containing the position and velocity of a mass. The action is a scalar
representing a force applied to the mass, ranging continuously from -2 to 2 Newton.
obsInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Dimension: [2 1]
DataType: "double"
actInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Dimension: [1 1]
DataType: "double"
The actor and critic networks are initialized randomly. You can ensure reproducibility by fixing the
seed of the random generator.
rng(0)
Create a deep neural network to be used as approximation model within the critic. For TRPO agents,
the critic estimates a value function, therefore it must take the observation signal as input and return
a scalar value.
criticNet = [
reluLayer
summary(criticNet)
3-386
rlTRPOAgent
Initialized: true
Inputs:
critic.
ans = single
-0.0899
To approximate the policy within the actor, use a neural network. For TRPO agents, the actor
executes a stochastic policy, which for continuous action spaces is implemented by a continuous
Gaussian actor. In this case the network must take the observation signal as input and return both a
action spaces, and the action range limits from the environment specification objects. Specify a name
for the input and output layers, so you can later explicitly associate them with the appropriate
environment channel.
% Define common input path layer

commonPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="comPathIn")
reluLayer
fullyConnectedLayer(1,Name="comPathOut") ];
% Define mean value path

meanPath = [
fullyConnectedLayer(32,Name="meanPathIn")
reluLayer
tanhLayer;
scalingLayer(Name="meanPathOut",Scale=actInfo.UpperLimit) ];
% Define standard deviation path

sdevPath = [
fullyConnectedLayer(32,'Name',"stdPathIn")
reluLayer
softplusLayer(Name="stdPathOut") ];
3-387
3 Objects

% Connect paths
actorNet = connectLayers(actorNet,"comPathOut","meanPathIn/in");
actorNet = connectLayers(actorNet,"comPathOut",'stdPathIn/in');
% Plot network
plot(actorNet)

summary(actorNet)
Initialized: true
Inputs:
1 'comPathIn' 2 features
Create the actor using actorNet. TRPO agents use an rlContinuousGaussianActor object to
implement the actor for continuous action spaces.
3-388
rlTRPOAgent

'ActionMeanOutputNames',"meanPathOut",...
'ActionStandardDeviationOutputNames',"stdPathOut",...
'ObservationInputNames',"comPathIn");

{[0.2732]}
Create a TRPO agent using the actor and the critic.
agent =
rlTRPOAgent with properties:
AgentOptions: [1x1 rl.option.rlTRPOAgentOptions]

SampleTime: 1

{[-0.5005]}
Tips
• For continuous action spaces, this agent does not enforce the constraints set by the action
specification. In this case, you must enforce action space constraints within the environment.
• While tuning the learning rate of the actor network is necessary for PPO agents, it is not
necessary for TRPO agents.
• For high-dimensional observations, such as for images, it is recommended to use PPO, SAC, or
TD3 agents.
3-389
3 Objects
Version History
See Also
rlTRPOAgentOptions | rlValueFunction | rlDiscreteCategoricalActor |
rlContinuousGaussianActor | Deep Network Designer
Topics
“Trust Region Policy Optimization Agents”
3-390
rlTRPOAgentOptions
rlTRPOAgentOptions
Options for TRPO agent
Description
Use an rlTRPOAgentOptions object to specify options for trust region policy optimization (TRPO)
agents. To create a TRPO agent, use rlTRPOAgent.
For more information on TRPO agents, see “Trust Region Policy Optimization Agents”.
Learning Agents”.
Creation
Syntax
opt = rlTRPOAgentOptions
opt = rlTRPOAgentOptions(Name,Value)
Description
opt = rlTRPOAgentOptions creates an rlPPOAgentOptions object for use as an argument when

creating a TRPO agent using all default settings. You can modify the object properties using dot
notation.
opt = rlTRPOAgentOptions(Name,Value) sets option properties on page 3-391 using name-

value arguments. For example, rlTRPOAgentOptions('DiscountFactor',0.95) creates an
option set with a discount factor of 0.95. You can specify multiple name-value arguments. Enclose
each property name in quotes.
Properties
ExperienceHorizon — Number of steps the agent interacts with the environment before
learning
The ExperienceHorizon value must be greater than or equal to the MiniBatchSize value.
MiniBatchSize — Mini-batch size

Mini-batch size used for each learning epoch, specified as a positive integer. When the agent uses a
recurrent neural network, MiniBatchSize is treated as the training trajectory length.
3-391
3 Objects
The MiniBatchSize value must be less than or equal to the ExperienceHorizon value.

minimizing the entropy loss. For more information, see “Entropy Loss”.
NumEpoch — Number of epochs

Number of epochs for which the actor and critic networks learn from the current experience set,
AdvantageEstimateMethod — Method for estimating advantage values

"gae" (default) | "finite-horizon"
Method for estimating advantage values, specified as one of the following:
• "gae" — Generalized advantage estimator

• "finite-horizon" — Finite horizon estimation
For more information on these methods, see the training algorithm information in “Proximal Policy
Optimization Agents”.
GAEFactor — Smoothing factor for generalized advantage estimator

Smoothing factor for generalized advantage estimator, specified as a scalar value between 0 and 1,
inclusive. This option applies only when the AdvantageEstimateMethod option is "gae"
ConjugateGradientDamping — Conjugate gradient damping factor

1e-4 (default) | nonnegative scalar
Conjugate gradient damping factor for numerical stability, specified as a nonnegative scalar.
KLDivergenceLimit — Upper limit for KL divergence

Upper limit for the Kullback-Leibler (KL) divergence between the old policy and the current policy,
specified as a positive scalar.
NumIterationsConjugateGradient — Maximum number of iterations for conjugate

gradient decent
Maximum number of iterations for conjugate gradient decent, specified as positive integer.
NumIterationsLineSearch — Number of iterations for line search

Number of iterations for line search, specified as a positive integer.
3-392
rlTRPOAgentOptions
Typically, the default value works well for most cases.
ConjugateGradientResidualTolerance — Conjugate gradient residual tolerance factor

1e-8 (default) | positive scalar
Conjugate gradient residual tolerance, specified as a positive scalar. Once the residual for the
conjugate gradient algorithm is below this tolerance, the algorithm stops.
Typically, the default value works well for most cases.
NormalizedAdvantageMethod — Method for normalizing advantage function

"none" (default) | "current | "moving"
Method for normalizing advantage function values, specified as one of the following:
• "none" — Do not normalize advantage values

• "current" — Normalize the advantage function using the mean and standard deviation for the
current mini-batch of experiences.
• "moving" — Normalize the advantage function using the mean and standard deviation for a
moving window of recent experiences. To specify the window size, set the
AdvantageNormalizingWindow option.
In some environments, you can improve agent performance by normalizing the advantage function
during training. The agent normalizes the advantage function by subtracting the mean advantage
value and scaling by the standard deviation.
AdvantageNormalizingWindow — Window size for normalizing advantage function

1e6 (default) | positive integer
Window size for normalizing advantage function values, specified as a positive integer. Use this
option when the NormalizedAdvantageMethod option is "moving".


rlOptimizer.

3-393
3 Objects

equal to 1.
Object Functions
rlTRPOAgent Trust region policy optimization reinforcement learning agent
Examples
Create TRPO Agent Options Object
Create a TRPO agent options object, specifying the discount factor.
opt = rlTRPOAgentOptions('DiscountFactor',0.9)
opt =
rlTRPOAgentOptions with properties:
ExperienceHorizon: 512
MiniBatchSize: 128
EntropyLossWeight: 0.0100
NumEpoch: 3
AdvantageEstimateMethod: "gae"
GAEFactor: 0.9500
ConjugateGradientDamping: 0.1000
KLDivergenceLimit: 0.0100
NumIterationsConjugateGradient: 10
NumIterationsLineSearch: 10
ConjugateGradientResidualTolerance: 1.0000e-08
NormalizedAdvantageMethod: "none"
AdvantageNormalizingWindow: 1000000
SampleTime: 1
Version History

The property UseDeterministicExploitation of the rlTRPOAgentOptions object will be

removed in a future release. Use the UseExplorationPolicy property of rlTRPOAgent instead.
3-394
rlTRPOAgentOptions

See Also
Topics
“Trust Region Policy Optimization Agents”
3-395
3 Objects
rlValueFunction
Value function approximator object for reinforcement learning agents
Description
This object implements a value function approximator object that you can use as a critic for a
reinforcement learning agent. A value function maps an environment state to a scalar value. The
output represents the predicted discounted cumulative long-term reward when the agent starts from
the given state and takes the best possible action. After you create an rlValueFunction critic, use
it to create an agent such as an rlACAgent, rlPGAgent, or rlPPOAgent agent. For an example of
this workflow, see “Create Actor and Critic Representations” on page 3-412. For more information on
creating value functions, see “Create Policies and Value Functions”.
Creation
Syntax
critic = rlValueFunction(net,observationInfo)
critic = rlValueFunction(net,ObservationInputNames=netObsNames)
critic = rlValueFunction(tab,observationInfo)
critic = rlValueFunction({basisFcn,W0},observationInfo)
critic = rlValueFunction( ___ ,UseDevice=useDevice)
Description
critic = rlValueFunction(net,observationInfo) creates the value-function object critic

from the deep neural network net and sets the ObservationInfo property of critic to the
observationInfo input argument. The network input layers are automatically associated with the
environment observation channels according to the dimension specifications in observationInfo.
critic = rlValueFunction(net,ObservationInputNames=netObsNames) specifies the

network input layer names to be associated with the environment observation channels. The function
assigns, in sequential order, each environment observation channel specified in observationInfo
to the layer specified by the corresponding name in the string array netObsNames. Therefore, the
network input layers, ordered as the names in netObsNames, must have the same data type and
dimensions as the observation channels, as ordered in observationInfo.
critic = rlValueFunction(tab,observationInfo) creates the value function object critic

with a discrete observation space, from the table tab, which is an rlTable object containing a
column array with as many elements as the number of possible observations. The function sets the
ObservationInfo property of critic to the observationInfo input argument, which in this case
must be a scalar rlFiniteSetSpec object.
critic = rlValueFunction({basisFcn,W0},observationInfo) creates the value function

object critic using a custom basis function as underlying approximator. The first input argument is
3-396
rlValueFunction
a two-element cell array whose first element is the handle basisFcn to a custom basis function and
whose second element is the initial weight vector W0. The function sets the ObservationInfo
property of critic to the observationInfo input argument.
critic = rlValueFunction( ___ ,UseDevice=useDevice) specifies the device used to perform

computations for the critic object, and sets the UseDevice property of critic to the useDevice
input argument. You can use this syntax with any of the previous input-argument combinations.
Input Arguments

following:

The network must have the environment observation channels as inputs and a single scalar as output.
rlValueFunction objects support recurrent deep neural networks.

channels
3-397
3 Objects
tab — Value table

rlTable object
Value table, specified as an rlTable object containing a column vector with length equal to the
number of possible observations from the environment. Each element is the predicted discounted
cumulative long-term reward when the agent starts from the given observation and takes the best
possible action. The elements of this vector are the learnable parameters of the representation.

function handle
Custom basis function, specified as a function handle to a user-defined function. The user defined
function can either be an anonymous function or a function on the MATLAB path. The output of the
critic is the scalar c = W'*B, where W is a weight vector containing the learnable parameters and B
is the column vector returned by the custom basis function.
For an example on how to use a basis function to create a value function critic with a mixed
continuous and discrete observation space, see “Create Mixed Observation Space Value Function
Critic from Custom Basis Function” on page 3-407.

column vector
Initial value of the basis function weights W, specified as a column vector having the same length as
Properties

dimension of a channel are used by the software to create actors or critics, but not its (optional) name
and description.
rlValueFucntion sets the ObservationInfo property of critic to the input argument

observationInfo.
3-398
rlValueFunction


Computing Toolbox).
used with MATLAB.
Example: "gpu"
Object Functions
actions
action) input data
action input data
policy object
object
Examples
Create Value Function Critic from Deep Neural Network
3-399
3 Objects
doubles.
Create a deep neural network to approximate the value function within the critic, as a column vector
of layer objects. The network input layer must accept a four-element vector (the observation vector
defined by obsInfo), and the output must be a scalar (the value, representing the expected
cumulative long-term reward when the agent starts from the given observation).
You can also obtain the number of observations from the obsInfo specification (regardless of
prod(obsInfo.Dimension) is its total number of dimensions, in this case equal to 4).
net = [ featureInputLayer(prod(obsInfo.Dimension));
reluLayer;
fullyConnectedLayer(1,Name="value")];
Convert the network to a dlnetwork object.
dlnet = dlnetwork(net);
You can plot the network using plot and display its main characteristics, like the number of weights,
using summary.
plot(dlnet)
3-400
rlValueFunction
summary(dlnet)
Initialized: true
Inputs:
Create the critic using the network and the observation specification object. When you use this syntax
the network input layer is automatically associated with the environment observation according to
the dimension specifications in obsInfo.
critic = rlValueFunction(dlnet,obsInfo)
critic =
rlValueFunction with properties:

UseDevice: "cpu"
To check your critic, use getValue to return the value of a random observation, using the current
network weights.
v = single
0.5196
You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent).
Create Actor and Critic
Create an actor and a critic that you can use to define a reinforcement learning agent such as an
Actor Critic (AC) agent. For this example, create actor and critic for an agent that can be trained
against the cart-pole environment described in “Train AC Agent to Balance Cart-Pole System”.
First, create the environment. Then, extract the observation and action specifications from the
environment. You need these specifications to define the agent and critic.
A state-value-function critic, such as those used for AC or PG agents, has the current observation as
input and the state value, a scalar, as output. For this example, to approximate the value function
within the critic, create a deep neural network with one output (the value) and four inputs (the
environment observation signals x, xdot, theta, and thetadot).
Create the network as a column vector of layer objects. You can obtain the number of observations
from the obsInfo specification (regardless of whether the observation space is a column vector, row
vector, or matrix, prod(obsInfo.Dimension) is its total number of dimensions). Name the network
input layer criticNetInput.
3-401
3 Objects
criticNetwork = [
featureInputLayer(prod(obsInfo.Dimension),...
Name="criticNetInput");
reluLayer;
fullyConnectedLayer(1,Name="CriticFC")];
criticNetwork = dlnetwork(criticNetwork);
To display the network main characteristics, use summary.
summary(criticNetwork)
Initialized: true
Inputs:
1 'criticNetInput' 4 features
Create the critic using the specified neural network. Also, specify the action and observation
information for the critic. Set the observation name to observation, which is the name of the
criticNetwork input layer.
critic = rlValueFunction(criticNetwork,obsInfo,...
ObservationInputNames={'criticNetInput'})
critic =

UseDevice: "cpu"
Check your critic using getValue to return the value of a random observation, given the current
network weights.
v = single
0.5196
Specify the critic optimization options using rlOptimizerOptions. These options control the
learning of the critic network parameters. For this example, set the learning rate to 0.01 and the
gradient threshold to 1.

LearnRate=1e-2,...
An AC agent decides which action to take given observations using a policy which is represented by
an actor. For an actor, the inputs are the environment observations, and the output depends on
whether the action space is discrete or continuous. The actor in this example has two possible
discrete actions, –10 or 10. To create the actor, use a deep neural network that can output these two
values given the same observation input as the critic.
3-402
rlValueFunction
Create the network using a row vector of two layer objects. You can obtain the number of actions
from the actInfo specification. Name the network output actorNetOutput.
actorNetwork = [
prod(obsInfo.Dimension),...
Name="actorNetInput")
numel(actInfo.Elements), ...
Name="actorNetOutput")];
actorNetwork = dlnetwork(actorNetwork);
To display the network main characteristics, use summary.
summary(actorNetwork)
Initialized: true
Inputs:
1 'actorNetInput' 4 features
Create the actor using rlDiscreteCategoricalActor together with the observation and action
specifications, and the name of the network input layer to be associated with the environment
actor = rlDiscreteCategoricalActor(actorNetwork,obsInfo,actInfo,...
ObservationInputNames={'actorNetInput'})
actor =
rlDiscreteCategoricalActor with properties:
Distribution: [1x1 rl.distribution.rlDiscreteGaussianDistribution]

UseDevice: "cpu"
To check your actor, use getAction to return a random action from a given observation, using the
a = getAction(actor,{rand(obsInfo.Dimension)})
a = 1x1 cell array

{[-10]}
Specify the actor optimization options using rlOptimizerOptions. These options control the
learning of the critic network parameters. For this example, set the learning rate to 0.05 and the
gradient threshold to 1.

LearnRate=5e-2,...
3-403
3 Objects
Create an AC agent using the actor and critic. Use the optimizer options objects previously created
for both actor and critic.
NumStepsToLookAhead=32,...
DiscountFactor=0.99,...
CriticOptimizerOptions=criticOpts,...
ActorOptimizerOptions=actorOpts);
agent =

SampleTime: 1
To check your agent, use getAction to return a random action from a given observation, using the
current actor and critic network weights.
act = getAction(agent,{rand(obsInfo.Dimension)})

{[-10]}
For additional examples showing how to create actors and critics for different agent types, see:
• “Train DDPG Agent to Control Double Integrator System”

• “Train DQN Agent to Balance Cart-Pole System”
Create Value Function Critic from Table
example, define the observation space as a finite set consisting of four possible values: 1, 3, 4 and 7.
Create a table to approximate the value function within the critic.
vTable = rlTable(obsInfo);
The table is a column vector in which each entry stores the predicted cumulative long-term reward
for each possible observation as defined by obsInfo. You can access the table using the Table
property of the vTable object. The initial value of each element is zero.
vTable.Table
ans = 4×1
3-404
rlValueFunction
0
0
0
You can also initialize the table to any value, in this case, an array containing all the integers from 1
to 4.
vTable.Table = reshape(1:4,4,1)
vTable =
Table: [4x1 double]
Create the critic using the table and the observation specification object.
critic = rlValueFunction(vTable,obsInfo)
critic =

UseDevice: "cpu"
To check your critic, use the getValue function to return the value of a given observation, using the
current table entries.
v = getValue(critic,{7})
v = 4
Create Value Function Critic from Custom Basis Function
doubles.
defined by obsInfo.
myBasisFcn = @(myobs) [myobs(2)^2; myobs(3)+exp(myobs(1)); abs(myobs(4))]

@(myobs)[myobs(2)^2;myobs(3)+exp(myobs(1));abs(myobs(4))]
3-405
3 Objects
The output of the critic is the scalar W'*myBasisFcn(myobs), where W is a weight column vector
which must have the same size as the custom basis function output. This output is the expected
W0 = [3;5;2];
function and the initial weight vector. The second argument is the observation specification object.
critic = rlValueFunction({myBasisFcn,W0},obsInfo)
critic =

UseDevice: "cpu"
v = getValue(critic,{[2 4 6 8]'})
v = 130.9453
Create Value Function Critic from Recurrent Neural Network
To approximate the value function within the critic, use create a recurrent deep neural network as a
row vector of layer objects. Use a sequenceInputLayer as the input layer
(obsInfo.Dimension(1) is the dimension of the observation space) and include at least one
lstmLayer.
myNet = [
fullyConnectedLayer(8, Name="fc")
reluLayer(Name="relu")
fullyConnectedLayer(1,Name="output")];
dlCriticNet = dlnetwork(myNet);
Display a summary of network characteristics.
3-406
rlValueFunction
summary(dlCriticNet)
Initialized: true
Inputs:
Create a value function representation object for the critic.
critic = rlValueFunction(dlCriticNet,obsInfo)
critic =

UseDevice: "cpu"
To check your critic, use the getValue function to return the value of a random observation, using
the current network weights.
v = single
0.0017
Create Mixed Observation Space Value Function Critic from Custom Basis Function
Create a finite-set observation specification object (or alternatively use getObservationInfo to

extract the specification object from an environment). For this example, define the observation space
as two channels where the first one is a single observation labeled 7, 5, 3, or 1, and the second one is
a vector over a continuous three-dimensional space.
obsInfo = [rlFiniteSetSpec([7 5 3 1]) rlNumericSpec([3 1])];
function must return a column vector. Each vector element must be a function of the observations (in
this case a single number) defined by obsInfo.
myBasisFcn = @(obsA,obsB) [ obsA(1) + norm(obsB);

obsA(1) - norm(obsB);
obsA(1)^2 + obsB(3);
obsA(1)^2 - obsB(3)];
The output of the critic is the scalar W'*myBasisFcn(obsA,obsB), where W is a weight column
3-407
3 Objects
W0 = ones(4,1);
critic = rlValueFunction({myBasisFcn,W0},obsInfo)
critic =

UseDevice: "cpu"
v = getValue(critic,{5,[0.1 0.1 0.1]'})
v = 60
Note that the critic does not enforce the set constraint for the discrete set element.
v = getValue(critic,{-3,[0.1 0.1 0.1]'})
v = 12
You can now use the critic (along with an with an actor) to create an agent relying on a discrete value
function critic (such as rlACAgent or rlPGAgent).
Version History
See Also
Functions
rlQValueFunction | rlVectorQValueFunction | rlTable | getActionInfo |
getObservationInfo
Topics
3-408
rlValueRepresentation
(Not recommended) Value function critic representation for reinforcement learning agents
Note rlValueRepresentation is not recommended. Use rlValueFunction instead. For more

information, see “rlValueRepresentation is not recommended”.
Description
This object implements a value function approximator to be used as a critic within a reinforcement
learning agent. A value function is a function that maps an observation to a scalar value. The output
represents the expected total long-term reward when the agent starts from the given observation and
takes the best possible action. Value function critics therefore only need observations (but not
actions) as inputs. After you create an rlValueRepresentation critic, use it to create an agent
relying on a value function critic, such as an rlACAgent, rlPGAgent, or rlPPOAgent. For an
example of this workflow, see “Create Actor and Critic Representations” on page 3-412. For more
information on creating representations, see “Create Policies and Value Functions”.
Creation
Syntax
critic = rlValueRepresentation(net,observationInfo,'Observation',obsName)
critic = rlValueRepresentation(tab,observationInfo)
critic = rlValueRepresentation({basisFcn,W0},observationInfo)
critic = rlValueRepresentation( ___ ,options)
Description
critic = rlValueRepresentation(net,observationInfo,'Observation',obsName)
creates the value function based critic from the deep neural network net. This syntax sets the
ObservationInfo property of critic to the input observationInfo. obsName must contain the
names of the input layers of net.
critic = rlValueRepresentation(tab,observationInfo) creates the value function based

critic with a discrete observation space, from the value table tab, which is an rlTable object
containing a column array with as many elements as the possible observations. This syntax sets the
ObservationInfo property of critic to the input observationInfo.
critic = rlValueRepresentation({basisFcn,W0},observationInfo) creates the value

function based critic using a custom basis function as underlying approximator. The first input
argument is a two-elements cell in which the first element contains the handle basisFcn to a custom
basis function, and the second element contains the initial weight vector W0. This syntax sets the
ObservationInfo property of critic to the input observationInfo.
critic = rlValueRepresentation( ___ ,options) creates the value function based critic
using the additional option set options, which is an rlRepresentationOptions object. This
3-409
3 Objects
syntax sets the Options property of critic to the options input argument. You can use this syntax
with any of the previous input-argument combinations.
Input Arguments

dlNetwork object
following:

rlValueRepresentation objects support recurrent deep neural networks.

must be the names of the input layers in net. These network layers must be in the same order and
with the same data type and dimensions as the signals defined in ObservationInfo.
Example: {'my_obs'}
tab — Value table

rlTable object
Value table, specified as an rlTable object containing a column vector with length equal to the
number of observations. The element i is the expected cumulative long-term reward when the agent
starts from the given observation s and takes the best possible action. The elements of this vector are
the learnable parameters of the representation.

function handle
Custom basis function, specified as a function handle to a user-defined function. The user defined
function can either be an anonymous function or a function on the MATLAB path. The output of the
critic is c = W'*B, where W is a weight vector and B is the column vector returned by the custom
basis function. c is the expected cumulative long term reward when the agent starts from the given
observation and takes the best possible action. The learnable parameters of this representation are
the elements of W.
3-410
When creating a value function critic representation, your basis function must have the following
signature.
as the signals defined in ObservationInfo.

column vector
Initial value of the basis function weights, W, specified as a column vector having the same length as
Properties



containing a mix of such objects. These objects define properties such as the dimensions, data types,
rlValueRepresentation sets the ObservationInfo property of critic to the input

observationInfo.

Object Functions
getValue Obtain estimated value from a critic given environment observations and actions
Examples
Create Value Function Critic from Deep Neural Network
doubles.
3-411
3 Objects
Create a deep neural network to approximate the value function within the critic. The input of the
network (here called myobs) must accept a four-element vector (the observation vector defined by
obsInfo), and the output must be a scalar (the value, representing the expected cumulative long-
term reward when the agent starts from the given observation).
net = [featureInputLayer(4, 'Normalization','none','Name','myobs')

fullyConnectedLayer(1,'Name','value')];
Create the critic using the network, observation specification object, and name of the network input
layer.
critic = rlValueRepresentation(net,obsInfo,'Observation',{'myobs'})
critic =

To check your critic, use the getValue function to return the value of a random observation, using
the current network weights.
v = single
0.7904
Create Actor and Critic Representations
Create an actor representation and a critic representation that you can use to define a reinforcement
learning agent such as an Actor Critic (AC) agent.
For this example, create actor and critic representations for an agent that can be trained against the
cart-pole environment described in “Train AC Agent to Balance Cart-Pole System”. First, create the
environment. Then, extract the observation and action specifications from the environment. You need
these specifications to define the agent and critic representations.
For a state-value-function critic such as those used for AC or PG agents, the inputs are the
observations and the output should be a scalar value, the state value. For this example, create the
critic representation using a deep neural network with one output, and with observation signals
corresponding to x, xdot, theta, and thetadot as described in “Train AC Agent to Balance Cart-
Pole System”. You can obtain the number of observations from the obsInfo specification. Name the
network layer input 'observation'.
numObservation = obsInfo.Dimension(1);
criticNetwork = [
3-412
featureInputLayer(numObservation,'Normalization','none','Name','observation')
fullyConnectedLayer(1,'Name','CriticFC')];
Specify options for the critic representation using rlRepresentationOptions. These options
control the learning of the critic network parameters. For this example, set the learning rate to 0.05
and the gradient threshold to 1.
repOpts = rlRepresentationOptions('LearnRate',5e-2,'GradientThreshold',1);
Create the critic representation using the specified neural network and options. Also, specify the
action and observation information for the critic. Set the observation name to 'observation',
which is the of the criticNetwork input layer.
critic = rlValueRepresentation(criticNetwork,obsInfo,'Observation',{'observation'},repOpts)
critic =

Similarly, create a network for the actor. An AC agent decides which action to take given observations
using an actor representation. For an actor, the inputs are the observations, and the output depends
on whether the action space is discrete or continuous. For the actor of this example, there are two
possible discrete actions, –10 or 10. To create the actor, use a deep neural network with the same
observation input as the critic, that can output these two values. You can obtain the number of
actions from the actInfo specification. Name the output 'action'.
numAction = numel(actInfo.Elements);
actorNetwork = [
featureInputLayer(numObservation,'Normalization','none','Name','observation')
fullyConnectedLayer(numAction,'Name','action')];
Create the actor representation using the observation name and specification and the same
representation options.
actor = rlStochasticActorRepresentation(actorNetwork,obsInfo,actInfo,...
'Observation',{'observation'},repOpts)
actor =

Create an AC agent using the actor and critic representations.
'NumStepsToLookAhead',32,...
'DiscountFactor',0.99);
agent =
3-413
3 Objects
For additional examples showing how to create actor and critic representations for different agent
types, see:
• “Train DDPG Agent to Control Double Integrator System”

• “Train DQN Agent to Balance Cart-Pole System”
Create Value Function Critic from Table
example, define the observation space as a finite set consisting of 4 possible values.
Create a table to approximate the value function within the critic.

vTable = rlTable(obsInfo);
The table is a column vector in which each entry stores the expected cumulative long-term reward for
each possible observation as defined by obsInfo. You can access the table using the Table property
of the vTable object. The initial value of each element is zero.
vTable.Table
ans = 4×1
0
0
0
0
You can also initialize the table to any value, in this case, an array containing all the integers from 1
to 4.
vTable.Table = reshape(1:4,4,1)
vTable =
Table: [4x1 double]
Create the critic using the table and the observation specification object.
critic = rlValueRepresentation(vTable,obsInfo)
critic =

3-414
current table entries.
v = getValue(critic,{7})
v = 4
(such as rlACAgent or rlPGAgent agent).
Create Value Function Critic from Custom Basis Function
doubles.
defined by obsInfo.
myBasisFcn = @(myobs) [myobs(2)^2; myobs(3)+exp(myobs(1)); abs(myobs(4))]

@(myobs)[myobs(2)^2;myobs(3)+exp(myobs(1));abs(myobs(4))]
The output of the critic is the scalar W'*myBasisFcn(myobs), where W is a weight column vector
which must have the same size of the custom basis function output. This output is the expected
W0 = [3;5;2];
critic = rlValueRepresentation({myBasisFcn,W0},obsInfo)
critic =

v = getValue(critic,{[2 4 6 8]'})
3-415
3 Objects
v =
1x1 dlarray
130.9453
You can now use the critic (along with an with an actor) to create an agent relying on a value function
critic (such as rlACAgent or rlPGAgent).
Create Value Function Critic from Recurrent Neural Network
Create a recurrent deep neural network for the critic. To create a recurrent neural network, use a
criticNetwork = [
sequenceInputLayer(numObs,'Normalization','none','Name','state')
fullyConnectedLayer(8, 'Name','fc')
reluLayer('Name','relu')
lstmLayer(8,'OutputMode','sequence','Name','lstm')
fullyConnectedLayer(1,'Name','output')];
Create a value function representation object for the critic.
criticOptions = rlRepresentationOptions('LearnRate',1e-2,'GradientThreshold',1);
critic = rlValueRepresentation(criticNetwork,obsInfo,...
'Observation','state',criticOptions);
Version History
rlValueRepresentation is not recommended

rlValueRepresentation is not recommended. Use rlValueFunction instead.
The following table shows some typical uses of rlValueRepresentation, and how to update your
code with rlValueFunction instead. Each table entry is related to different approximator objects,
the first one uses a neural network, the second one uses a table, the third one uses a basis function.
3-416
rlValueRepresentation: Not Recommended rlValueFunction: Recommended

rlValueRepresentation(net,obsInfo,'Obs rlValueFunction(net,obsInfo,'Observati
ervation',obsNames), where the neural onInputNames',obsNames). Use this syntax to
network net has observations as inputs and a create a state value function object for a critic
single scalar output. that does not require action input.
rep = rep = rlValueFunction(tab,obsInfo). Use
rlValueRepresentation(tab,obsInfo) this syntax to create a value function object for a
where the table tab contains a column vector critic that does not require action input.
with as many elements as the number of possible
observations.
rep = rep =
rlValueRepresentation({basisFcn,W0},ob rlValueFunction({basisFcn,W0},obsInfo)
sInfo), where the basis function has only . Use this syntax to create a value function object
observations as inputs and W0 is a column vector. for a critic that does not require action input.
See Also
Functions
rlValueFunction | getActionInfo | getObservationInfo
Topics
3-417
3 Objects
rlVectorQValueFunction
Vector Q-value function approximator for reinforcement learning agents
Description
This object implements a vector Q-value function approximator that you can use as a critic with a
discrete action space for a reinforcement learning agent. A vector Q-value function is a function that
maps an environment state to a vector in which each element represents the predicted discounted
cumulative long-term reward when the agent starts from the given state and executes the action
corresponding to the element number. A Q-value function critic therefore needs only the environment
state as input. After you create an rlVectorQValueFunction critic, use it to create an agent such
as rlQAgent, rlDQNAgent, rlSARSAAgent, rlDDPGAgent, or rlTD3Agent. For more information
on creating representations, see “Create Policies and Value Functions”.
Creation
Syntax
critic = rlVectorQValueFunction(net,observationInfo,actionInfo)
critic = rlVectorQValueFunction(net,observationInfo,ObservationInputNames=
netObsNames)
critic = rlVectorQValueFunction({basisFcn,W0},observationInfo,actionInfo)
critic = rlVectorQValueFunction( ___ ,UseDevice=useDevice)
Description
critic = rlVectorQValueFunction(net,observationInfo,actionInfo) creates the multi-

output Q-value function critic with a discrete action space. Here, net is the deep neural network
used as an approximator, and must have only the observations as input and a single output layer
having as many elements as the number of possible discrete actions. The network input layers are
automatically associated with the environment observation channels according to the dimension
specifications in observationInfo. This function sets the ObservationInfo and ActionInfo
properties of critic to the observationInfo and actionInfo input arguments, respectively.
critic = rlVectorQValueFunction(net,observationInfo,ObservationInputNames=
netObsNames) specifies the names of the network input layers to be associated with the
environment observation channels. The function assigns, in sequential order, each environment
observation channel specified in observationInfo to the layer specified by the corresponding name
in the string array netObsNames. Therefore, the network input layers, ordered as the names in
netObsNames, must have the same data type and dimensions as the observation channels, as ordered
in observationInfo.
critic = rlVectorQValueFunction({basisFcn,W0},observationInfo,actionInfo)
creates the multi-output Q-value function critic with a discrete action space using a custom basis
function as underlying approximator. The first input argument is a two-element cell array whose first
3-418
element is the handle basisFcn to a custom basis function and whose second element is the initial
weight matrix W0. Here the basis function must have only the observations as inputs, and W0 must
have as many columns as the number of possible actions. The function sets the ObservationInfo and
ActionInfo properties of critic to the input arguments observationInfo and actionInfo,
respectively.
critic = rlVectorQValueFunction( ___ ,UseDevice=useDevice) specifies the device used

to perform computations for the critic object, and sets the UseDevice property of critic to the
useDevice input argument. You can use this syntax with any of the previous input-argument
combinations.
Input Arguments

Deep neural network used as the underlying approximator within the critic,specified as one of the
following:

The network must have only the observation channels as inputs and a single output layer having as
many elements as the number of possible discrete actions. Each element of the output vector
approximates the value of executing the corresponding action starting from the currently observed
state.
rlQValueFunction objects support recurrent deep neural networks.

channels
3-419
3 Objects


function handle
of the critic is the vector c = W'*B, where W is a matrix containing the learnable parameters, and B
is the column vector returned by the custom basis function. Each element of a approximates the
value of executing the corresponding action from the observed state.
channels defined in observationInfo.
Example: @(obs1,obs2) [act(2)*obs1(1)^2; abs(obs2(5))]

matrix
of the basis function output vector and as many columns as the number of possible actions.
Properties

dimension of a channel are used by the software to create actors or critics, but not its (optional)
name.
rlVectorQValueFucntion sets the ObservationInfo property of critic to the input argument

observationInfo.

3-420

the environment action channel, such as its dimensions, data type, and name. Note that the function
does not use the name of the action channel specified in actionInfo.
rlVectorQValueFucntion sets the ActionInfo property of critic to the input actionInfo.

Computing Toolbox).
used with MATLAB.
Example: "gpu"
Object Functions
actions
getMaxQValue Obtain maximum estimated value over all possible actions from a Q-value
function critic with discrete action space, given environment observations
action) input data
action input data
3-421
3 Objects

policy object
object
Examples
Create Multi-Output Q-Value Function Critic from Deep Neural Network
This example shows how to create a vector Q-value function critic for a discrete action space using a
deep neural network approximator.
doubles.
To approximate the Q-value function within the critic, use a deep neural network. The input of the
network must accept a four-element vector, as defined by obsInfo. The output must be a single
output layer having as many elements as the number of possible discrete actions (three in this case,
as defined by actInfo).
Create a neural network as a row vector of layer objects.
Summarize the network properties.
summary(net)
Initialized: true
3-422
Inputs:
Create the critic using the network, as well as the observation and action specification objects. The
network input layers are automatically associated with the components of the observation signals
according to the dimension specifications in obsInfo.
critic = rlVectorQValueFunction(net,obsInfo,actInfo)
critic =

UseDevice: "cpu"
To check your critic, use getValue to return the values of a random observation, using the current
network weights. There is one value for each of the three possible actions.
0.7232
0.8177
-0.2212
Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).
Create Multi-Output Q-Value Function Critic from Deep Neural Network Specifying Layer
Names
A vector Q-value function critic takes only the observation as input and produces as output a vector
with as many elements as the possible actions. Each element represents the expected cumulative
long term reward when the agent starts from the given observation and takes the action
corresponding to the position of the element in the output vector.
doubles.
3-423
3 Objects
To approximate the Q-value function within the critic, use a deep neural network. The input of the
network must accept a four-element vector, as defined by obsInfo. The output must be a single
output layer having as many elements as the number of possible discrete actions (three in this case,
as defined by actInfo).
Create the network as an array of layer objects. Name the network input netObsIn (so you can later
explicitly associate it with the observation input channel).
net = [featureInputLayer(4,Name="netObsIn")
fullyConnectedLayer(3,Name="value")];
Convert the network to a dlnetwork object and display the number of its learnable parameters.
net = dlnetwork(net)
net =
dlnetwork with properties:
Layers: [2x1 nnet.cnn.layer.Layer]

Connections: [1x2 table]
Learnables: [2x3 table]
State: [0x3 table]
InputNames: {'netObsIn'}
OutputNames: {'value'}
Initialized: 1
View summary with summary.
summary(net)
Initialized: true
Inputs:
Create the critic using the network, the observations specification object, and the name of the
network input layer. The specified network input layer, netObsIn, is associated with the environment
observation, and therefore must have the same data type and dimension as the observation channel
specified in obsInfo.
critic = rlVectorQValueFunction(net, ...

ObservationInputNames="netObsIn")
critic =

UseDevice: "cpu"
the current network weights. The function returns one value for each of the three possible actions.
3-424
{rand(obsInfo.Dimension)})
0.7232
0.8177
-0.2212
Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).
Create Multi-Output Q-Value Function Critic from Custom Basis Function
consisting of two channels, the first a two-by-two continuous matrix and the second is scalar that can
assume only two values, 0 and 1.

action space as a finite set consisting of three possible vectors, [1 2], [3 4], and [5 6].
actInfo = rlFiniteSetSpec({[1 2],[3 4],[5 6]});
defined by obsInfo.
myBasisFcn = @(obsA,obsB) [obsA(1,1)+obsB(1)^2;

obsA(2,1)-obsB(1)^2;
obsA(1,2)^2+obsB(1);
obsA(2,2)^2-obsB(1)];
The output of the critic is the vector c = W'*myBasisFcn(obsA,obsB), where W is a weight matrix
which must have as many rows as the length of the basis function output and as many columns as the
Each element of c is the expected cumulative long term reward when the agent starts from the given
observation and takes the action corresponding to the position of the considered element. The
elements of W are the learnable parameters.
W0 = rand(4,3);
3-425
3 Objects
critic = rlVectorQValueFunction({myBasisFcn,W0},obsInfo,actInfo)
critic =

UseDevice: "cpu"
the current parameter matrix. The function returns one value for each of the three possible actions.
v = getValue(critic,{rand(2,2),0})
v = 3×1
1.3192
0.8420
1.5053
Note that the critic does not enforce the set constraint for the discrete set elements.
v = getValue(critic,{rand(2,2),-1})
v = 3×1
2.7890
1.8375
3.0855
Version History
See Also
Functions
rlValueFunction | rlQValueFunction | getActionInfo | getObservationInfo
Topics
3-426
scalingLayer
scalingLayer
Scaling layer for actor or critic network
Description
A scaling layer linearly scales and biases an input array U, giving an output Y = Scale.*U + Bias.
You can incorporate this layer into the deep neural networks you define for actors or critics in
reinforcement learning agents. This layer is useful for scaling and shifting the outputs of nonlinear
layers, such as tanhLayer and sigmoid.
For instance, a tanhLayer gives bounded output that falls between –1 and 1. If your actor network
output has different bounds (as defined in the actor specification), you can include a ScalingLayer
as an output to scale and shift the actor network output appropriately.
The parameters of a ScalingLayer object are not learnable.
Creation
Syntax
sLayer = scalingLayer
sLayer = scalingLayer(Name,Value)
Description
sLayer = scalingLayer creates a scaling layer with default property values.
sLayer = scalingLayer(Name,Value) sets properties on page 3-427 using name-value pairs. For
example, scalingLayer('Scale',0.5) creates a scaling layer that scales its input by 0.5. Enclose
each property name in quotes.
Properties
'scaling' (default) | character vector

'Scaling layer' (default) | character vector
Description of layer, specified as a character vector. When you create the scaling layer, you can use
3-427
3 Objects
Scale — Element-wise scale on input

1 (default) | scalar | array
Element-wise scale on the input to the scaling layer, specified as one of the following:
• Scalar — Specify the same scale factor for all elements of the input array.
• Array with the same dimensions as the input array — Specify different scale factors for each
element of the input array.
The scaling layer takes an input U and generates the output Y = Scale.*U + Bias.
Bias — Element-wise bias on input

0 (default) | scalar | array
Element-wise bias on the input to the scaling layer, specified as one of the following:
• Scalar — Specify the same bias for all elements of the input array.
• Array with the same dimensions as the input array — Specify a different bias for each element of
the input array.
The scaling layer takes an input U and generates the output Y = Scale.*U + Bias.
Examples
Create Scaling Layer
Create a scaling layer that converts an input array U to the output array Y = 0.1.*U - 0.4.
sLayer = scalingLayer('Scale',0.1,'Bias',-0.4)
sLayer =
ScalingLayer with properties:
Name: 'scaling'
Scale: 0.1000
Bias: -0.4000
No properties.
State Parameters
No properties.
Show all properties
Confirm that the scaling layer scales and offsets an input array as expected.
predict(sLayer,[10,20,30])
ans = 1×3
0.6000 1.6000 2.6000
3-428
scalingLayer
You can incorporate sLayer into an actor network or critic network for reinforcement learning.
Specify Different Scale and Bias for Each Input Element
Assume that the layer preceding the scalingLayer is a tanhLayer with three outputs aligned
along the first dimension, and that you want to apply a different scaling factor and bias to each out
using a scalingLayer.
scale = [2.5 0.4 10]';

bias = [5 0 -50]';
Create the scalingLayer object.
sLayer = scalingLayer('Scale',scale,'Bias',bias);
Confirm that the scaling layer applies the correct scale and bias values to an array with the expected
dimensions.
testData = [10 10 10]';

predict(sLayer,testData)
ans = 3×1
30
4
50
Version History
GPU Code Generation

Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.
See Also
quadraticLayer | softplusLayer
Topics
“Train DDPG Agent to Swing Up and Balance Pendulum”
3-429
3 Objects
SimulinkEnvWithAgent
Reinforcement learning environment with a dynamic model implemented in Simulink
Description
The SimulinkEnvWithAgent object represents a reinforcement learning environment that uses a
dynamic model implemented in Simulink. The environment object acts as an interface such that when
you call sim or train, these functions in turn call the Simulink model to generate experiences for
the agents.
Creation
To create a SimulinkEnvWithAgent object, use one of the following functions.
• rlSimulinkEnv — Create an environment using a Simulink model with at least one RL Agent
block.
• createIntegratedEnv — Use a reference model as a reinforcement learning environment.
• rlPredefinedEnv — Create a predefined reinforcement learning environment.
Properties
Model — Simulink model name
Simulink model name, specified as a string or character vector. The specified model must contain one
or more RL Agent blocks.
AgentBlock — Agent block paths

Agent block paths, specified as a string or string array.
If Model contains a single RL Agent block for training, then AgentBlock is a string containing the
block path.
If Model contains multiple RL Agent blocks for training, then AgentBlock is a string array, where
each element contains the path of one agent block.
Model can contain RL Agent blocks whose path is not included in AgentBlock. Such agent blocks
behave as part of the environment and select actions based on their current policies. When you call
sim or train, the experiences of these agents are not returned and their policies are not updated.
The agent blocks can be inside of a model reference. For more information on configuring an agent
block for reinforcement learning, see RL Agent.
ResetFcn — Reset behavior for environment

function handle | anonymous function handle
3-430
Reset behavior for the environment, specified as a function handle or anonymous function handle.
The function must have a single Simulink.SimulationInput input argument and a single
Simulink.SimulationInput output argument.
The reset function sets the initial state of the Simulink environment. For example, you can create a
reset function that randomizes certain block states such that each training episode begins from
different initial conditions.
If you have an existing reset function myResetFunction on the MATLAB path, set ResetFcn using a
handle to the function.
env.ResetFcn = @(in)myResetFunction(in);
If your reset behavior is simple, you can implement it using an anonymous function handle. For
example, the following code sets the variable x0 to a random value.
env.ResetFcn = @(in) setVariable(in,'x0',rand());
The sim function calls the reset function to reset the environment at the start of each simulation, and
the train function calls it at the start of each training episode.
UseFastRestart — Option to toggle fast restart

Option to toggle fast restart, specified as either "on" or "off". Fast restart allows you to perform
iterative simulations without compiling a model or terminating the simulation each time.
For more information on fast restart, see “How Fast Restart Improves Iterative Simulations”
(Simulink).
Object Functions
Examples
Create Simulink Environment Using Agent in Workspace
Create a Simulink environment using the trained agent and corresponding Simulink model from the
“Create Simulink Environment and Train Agent” example.
Load the agent in the MATLAB® workspace.

Create an environment for the rlwatertank model, which contains an RL Agent block. Since the
agent used by the block is already in the workspace, you do not need to pass the observation and
action specifications to create the environment.
env = rlSimulinkEnv('rlwatertank','rlwatertank/RL Agent')
3-431
3 Objects
env =
Model : rlwatertank
AgentBlock : rlwatertank/RL Agent
ResetFcn : []
UseFastRestart : on
Validate the environment by performing a short simulation for two sample times.
You can now train and simulate the agent within the environment by using train and sim,
respectively.
Open the model.
open_system(mdl)
respectively.
angle.
obsInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: [0×0 string]
Dimension: [3 1]
DataType: "double"
2 Nm.
actInfo =

Name: [0×0 string]
3-432

Dimension: [1 1]
DataType: "double"
objects.
env =
ResetFcn : []
UseFastRestart : on
env =
UseFastRestart : on
Create Simulink Environment for Multiple Agents
Create an environment for the Simulink model from the example “Train Multiple Agents to Perform
Collaborative Task”.
Load the agents in the MATLAB workspace.

Create an environment for the rlCollaborativeTask model, which has two agent blocks. Since the
agents used by the two blocks (agentA and agentB) are already in the workspace, you do not need
to pass their observation and action specifications to create the environment.
["rlCollaborativeTask/Agent A","rlCollaborativeTask/Agent B"])
3-433
3 Objects
env =
Model : rlCollaborativeTask
AgentBlock : [
rlCollaborativeTask/Agent A
rlCollaborativeTask/Agent B
]
ResetFcn : []
UseFastRestart : on
You can now simulate or train the agents within the environment using sim or train, respectively.
Create Continuous Simple Pendulum Model Environment
Use the predefined 'SimplePendulumModel-Continuous' keyword to create a continuous simple

pendulum model reinforcement learning environment.
env = rlPredefinedEnv('SimplePendulumModel-Continuous')
env =
ResetFcn : []
UseFastRestart : on
Create Environment from Simulink Model
This example shows how to use createIntegratedEnv to create an environment object starting
from a Simulink model that implements the system with which the agent. Such a system is often
referred to as plant, open-loop system, or reference system, while the whole (integrated) system
including the agent is often referred to as the closed-loop system.
For this example, use the flying robot model described in “Train DDPG Agent to Control Flying
Robot” as the reference (open-loop) system.
Open the flying robot model.

open_system('rlFlyingRobotEnv')
Initialize the state variables and sample time.

% initial model state variables
theta0 = 0;
x0 = -15;
y0 = 0;
% sample time
Ts = 0.4;
3-434
Create the Simulink model myIntegratedEnv containing the flying robot model connected in a
closed loop to the agent block. The function also returns the reinforcement learning environment
object env to be used for training.
env = createIntegratedEnv('rlFlyingRobotEnv','myIntegratedEnv')
env =
Model : myIntegratedEnv
AgentBlock : myIntegratedEnv/RL Agent
ResetFcn : []
UseFastRestart : on
The function can also return the block path to the RL Agent block in the new integrated model, as
well as the observation and action specifications for the reference model.
[~,agentBlk,observationInfo,actionInfo] = createIntegratedEnv( ...

'rlFlyingRobotEnv','myIntegratedEnv')
agentBlk =
'myIntegratedEnv/RL Agent'
observationInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "observation"
Dimension: [7 1]
DataType: "double"
actionInfo =
LowerLimit: -Inf
UpperLimit: Inf
Name: "action"
Dimension: [2 1]
DataType: "double"
Returning the block path and specifications is useful in cases in which you need to modify
descriptions, limits, or names in observationInfo and actionInfo. After modifying the
specifications, you can then create an environment from the integrated model IntegratedEnv using
the rlSimulinkEnv function.
Version History
3-435
3 Objects
See Also
Functions
rlSimulinkEnv | rlPredefinedEnv | train | sim | rlNumericSpec | rlFiniteSetSpec
Blocks
RL Agent
Topics
3-436
softplusLayer
softplusLayer
Softplus layer for actor or critic network
Description
A softplus layer applies the softplus activation function Y = log(1 + eX), which ensures that the output
is always positive. This activation function is a smooth continuous version of reluLayer. You can
incorporate this layer into the deep neural networks you define for actors in reinforcement learning
agents. This layer is useful for creating continuous Gaussian policy deep neural networks, for which
the standard deviation output must be positive.
Creation
Syntax
sLayer = softplusLayer
sLayer = softplusLayer(Name,Value)
Description
sLayer = softplusLayer creates a softplus layer with default property values.
sLayer = softplusLayer(Name,Value) sets properties on page 3-437 using name-value pairs.

For example, softplusLayer('Name','softlayer') creates a softplus layer and assigns the
name 'softlayer'.
Properties
'softplus' (default) | character vector

'Softplus layer' (default) | character vector
Description of layer, specified as a character vector. When you create the softplus layer, you can use
Examples
3-437
3 Objects
Create Softplus Layer
Create s softplus layer.
sLayer = softplusLayer;
You can specify the name of the softplus layer. For example, if the softplus layer represents the
standard deviation of a Gaussian policy deep neural network, you can specify an appropriate name.
sLayer = softplusLayer('Name','stddev')
sLayer =
SoftplusLayer with properties:
Name: 'stddev'
No properties.
State Parameters
No properties.
Show all properties
You can incorporate sLayer into an actor network for reinforcement learning.
Version History
GPU Code Generation

Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.
See Also
quadraticLayer | scalingLayer
Topics
3-438
4
Blocks
4 Blocks
RL Agent
Reinforcement learning agent
Library: Reinforcement Learning Toolbox
Description
Use the RL Agent block to simulate and train a reinforcement learning agent in Simulink. You
associate the block with an agent stored in the MATLAB workspace or a data dictionary, such as an
rlACAgent or rlDDPGAgent object. You connect the block so that it receives an observation and a
computed reward. For instance, consider the following block diagram of the
rlSimplePendulumModel model.
The observation input port of the RL Agent block receives a signal that is derived from the
instantaneous angle and angular velocity of the pendulum. The reward port receives a reward
calculated from the same two values and the applied action. You configure the observations and
reward computations that are appropriate to your system.
The block uses the agent to generate an action based on the observation and reward you provide.
Connect the action output port to the appropriate input for your system. For instance, in the
4-2
RL Agent
rlSimplePendulumModel, the action output port is a torque applied to the pendulum system. For
more information about this model, see “Train DQN Agent to Swing Up and Balance Pendulum”.
To train a reinforcement learning agent in Simulink, you generate an environment from the Simulink
model. You then create and configure the agent for training against that environment. For more
information, see “Create Simulink Reinforcement Learning Environments”. When you call train
using the environment, train simulates the model and updates the agent associated with the block.
Ports
Input
observation — Environment observations

scalar | vector | nonvirtual bus
This port receives observation signals from the environment. Observation signals represent
measurements or other instantaneous system data. If you have multiple observations, you can use a
Mux block to combine them into a vector signal. To use a nonvirtual bus signal, use bus2RLSpec.
reward — Reward from environment

scalar
This port receives the reward signal, which you compute based on the observation data. The reward
signal is used during agent training to maximize the expectation of the long-term reward.
isdone — Flag to terminate episode simulation

logical
Use this signal to specify conditions under which to terminate a training episode. You must configure
logic appropriate to your system to determine the conditions for episode termination. One application
is to terminate an episode that is clearly going well or going poorly. For instance, you can terminate
an episode if the agent reaches its goal or goes irrecoverably far from its goal.
external action — External action signal

scalar | vector
Use this signal to provide an external action to the block. This signal can be a control action from a
human expert, which can be used for safe or imitation learning applications. When the value of the
use external action signal is 1, the passes the external action signal to the environment through
the action block output. The block also uses the external action to update the agent policy based on
the resulting observations and rewards.
Dependencies
To enable this port, select the External action inputs parameter.
last action — Last action applied to environment signal

scalar | vector
For some applications, the action applied to the environment can differ from the action output by the
RL Agent block. For example, the Simulink model can contain a saturation block on the action output
signal.
In such cases, to improve learning results, you can enable this input port and connect the actual
action signal that is applied to the environment.
4-3
4 Blocks
Note The last action port should be used only with off-policy agents, otherwise training can produce
unexpected results.
Dependencies
To enable this port, select the Last action input parameter.
use external action — Use external action signal

0|1
Use this signal to pass the external action signal to the environment.
When the value of the use external action signal is 1 the block passes the external action signal to
the environment. The block also uses the external action to update the agent policy.
When the value of the use external action signal is 0 the block does not pass the external action
signal to the environment and does not update the policy using the external action. Instead, the
action from the block uses the action from the agent policy.
Dependencies
To enable this port, select the External action inputs parameter.
Output
action — Agent action

Action computed by the agent based on the observation and reward inputs. Connect this port to the
inputs of your system. To use a nonvirtual bus signal, use bus2RLSpec.
Note Continuous action-space agents such as rlACAgent, rlPGAgent, or rlPPOAgent (the ones
using an rlContinuousGaussianActor object), do not enforce constraints set by the action
specification. In these cases, you must enforce action space constraints within the environment.
cumulative reward — Total reward

scalar | vector
Cumulative sum of the reward signal during simulation. Observe or log this signal to track how the
cumulative reward evolves over time.
Dependencies
To enable this port, select the Cumulative reward output parameter.
Parameters
Agent object — Agent to train
agentObj (default) | agent object
Enter the name of an agent object stored in the MATLAB workspace or a data dictionary, such as an
rlACAgent or rlDDPGAgent object. For information about agent objects, see “Reinforcement
Learning Agents”.
4-4
RL Agent
If the RL Agent block is within a conditionally executed subsystem, such as a Triggered Subsystem or
a Function-Call Subsystem, you must specify the sample time of the agent object as -1 so that the
block can inherit the sample time of its parent subsystem.
Programmatic Use
Block Parameter: Agent
Type: string, character vector
Default: "agentObj"
Generate greedy policy block — Generate greedy policy block controller

button
Generate a Policy block that implements a greedy policy for the agent specified in Agent object by
calling the generatePolicyBlock block function. To generate a greedy policy, the block sets the
UseExplorationPolicy property of the agent to false before generating the policy block..
The generated block is added to a new Simulink model and the policy data is saved in a MAT-file in
the current working folder.
External action inputs — Add input ports for external action

off (default) | on
Enable the external action and use external action block input ports by selecting this parameter.
Programmatic Use
Block Parameter: ExternalActionAsInput
Values: "off" | "on"
Default: "off"
Last action input — Add input ports for last action applied to environment
off (default) | on
Enable the last action block input port by selecting this parameter.
Programmatic Use
Block Parameter: ProvideLastAction
Default: "off"
Cumulative reward output — Add cumulative reward output port

off (default) | on
Enable the cumulative reward block output by selecting this parameter.

Programmatic Use
Block Parameter: ProvideCumRwd
Default: "off"
Use strict observation data types — Enforce strict data types for observations
off (default) | on
Select this parameter to enforce the observation data types. In this case, if the data type of the signal
connected to the observation input port does not match the data type in the ObservationInfo
4-5
4 Blocks
property of the agent, the block attempts to cast the signal to the correct data type. If casting the
data type is not possible, the block generates an error.
Enforcing strict data types:
• Lets you validate that the block is getting the correct data types.
• Allows other blocks to inherit their data type from the observation port.
Programmatic Use
Block Parameter: UseStrictObservationDataTypes
Default: "off"
Version History
See Also
Functions
bus2RLSpec | createIntegratedEnv
Blocks
Policy
Topics
4-6
Policy
Policy
Reinforcement learning policy
Library: Reinforcement Learning Toolbox
Description
Use the Policy block to simulate a reinforcement learning policy in Simulink and to generate code
(using Simulink Coder) for deployment purposes. This block takes an observation as input and
outputs an action. You associate the block with a MAT-file that contains the information needed to
fully characterize the policy, and which can be generated by generatePolicyFunction or
generatePolicyBlock.
Ports
Input
observation — Environment observations

This port receives observation signals from the environment. Observation signals represent
measurements or other instantaneous system data. If you have multiple observations, you can use a
Mux block to combine them into a vector signal. To use a nonvirtual bus signal, use bus2RLSpec.
Output
action — Policy action

Action computed by the policy based on the observation input. Connect this port to the inputs of your
system. To use a nonvirtual bus signal, use bus2RLSpec.
Note Policy blocks generated from a continuous action-space rlStochasticActorPolicy object or

a continuous action-space rlACAgent, rlPGAgent, or rlPPOAgent object, do not enforce the
constraints set by the action specification. In these cases, you must enforce action space constraints
within the environment.
Parameters
Policy block data MAT file — Policy block data MAT file
blockAgentData.mat (default) | file name
Enter the name of the MAT-file containing the information needed to fully characterize the policy. This
file is generated by generatePolicyFunction or generatePolicyBlock. When you generate the
4-7
4 Blocks
block using generatePolicyBlock and specify a non-default dataFileName argument, then the
generated block has this parameter set to the specified file name, so that the block is associated with
the generated data file.
To use a Policy block within a conditionally executed subsystem, such as a Triggered Subsystem or a
Function-Call Subsystem, you must generate its data file from an agent or policy object which has its
SampleTIme property set to -1. Doing so allows the block to inherit the sample time of its parent
subsystem.
Programmatic Use
Block Parameter: MATFile
Default: "blockAgentData.mat"
Version History
Generate C and C++ code using Simulink® Coder™.
See Also
Functions
bus2RLSpec | createIntegratedEnv | generatePolicyFunction | generatePolicyBlock
Blocks
RL Agent
Topics
4-8

RL Ref

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

RL Ref

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RL Ref

Uploaded by

Copyright:

Available Formats

Reinforcement Learning Toolbox™

Latest news: www.mathworks.com

Sales and services: www.mathworks.com/sales_and_services

User community: www.mathworks.com/matlabcentral

Technical support: www.mathworks.com/support/contact_us

The MathWorks, Inc.

Reinforcement Learning Designer

Using this app, you can:

• Import an existing environment from the MATLAB® workspace or create a predefined

Open the Reinforcement Learning Designer App

Accelerate Gradient Computation for a Q-Value Function

obsInfo = [rlNumericSpec([4 1])

actInfo = rlNumericSpec([3 1]);

inPath2 = [ sequenceInputLayer( ...

inPath3 = [ sequenceInputLayer( ...

% Concatenate 3 previous layer outputs along dim 1

% Add layers to network object

% Convert to dlnetwork and display number of weights

Number of learnables: 832

val = 1x1 cell array

gro=3×1 cell array

gro_batch = gradient(critic,"output-input", ...

gro_batch=3×1 cell array

Set the option to accelerate the gradient computations.

grp = gradient(critic,"output-parameters", ...

grp=11×1 cell array

grp_batch = gradient(critic,"output-parameters", ...

grp_batch=11×1 cell array

Accelerate Gradient Computation for a Discrete Categorical Actor

obsInfo = [rlNumericSpec([4 1])

actInfo = rlFiniteSetSpec([-1 0 1]);

% Concatenate previous paths outputs along first dimension

% Add layers to network object

% Convert to dlnetwork and display the number of weights

Number of learnables: 386

actor = rlDiscreteCategoricalActor(net, obsInfo, actInfo);

To return a vector of probabilities for each possible action, use evaluate.

ans = 3x1 single column vector

To return an action sampled from the distribution, use getAction.

Set the option to accelerate the gradient computations.

grp_batch=9×1 cell array

Function approximator object, specified as:

useAcceleration — Option to use acceleration for gradient computations

Return all experiences in replay memory buffer

experience = allExperiences(buffer,ConcatenateMode=mode) returns experiences

Extract All Experiences from Replay Memory Buffer

Create an experience buffer with a maximum length of 5,000.

Append a sequence of 10 random experiences to the buffer.

experience=10×1 struct array with fields:

expBatch = struct with fields:

Experience buffer, specified as an rlReplayMemory or rlPrioritizedReplayMemory object.

mode — Concatenation mode

Concatenation mode specified as a one of the following values.

experience contains the following fields.

Action — Agent action

Reward — Reward value

NextObservation — Next observation