Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

RL Ref

Download as pdf or txt
Download as pdf or txt
You are on page 1of 708

Reinforcement Learning Toolbox™

Reference

R2022b
How to Contact MathWorks

Latest news: www.mathworks.com

Sales and services: www.mathworks.com/sales_and_services

User community: www.mathworks.com/matlabcentral

Technical support: www.mathworks.com/support/contact_us

Phone: 508-647-7000

The MathWorks, Inc.


1 Apple Hill Drive
Natick, MA 01760-2098
Reinforcement Learning Toolbox™ Reference
© COPYRIGHT 2019–2022 by The MathWorks, Inc.
The software described in this document is furnished under a license agreement. The software may be used or copied
only under the terms of the license agreement. No part of this manual may be photocopied or reproduced in any form
without prior written consent from The MathWorks, Inc.
FEDERAL ACQUISITION: This provision applies to all acquisitions of the Program and Documentation by, for, or through
the federal government of the United States. By accepting delivery of the Program or Documentation, the government
hereby agrees that this software or documentation qualifies as commercial computer software or commercial computer
software documentation as such terms are used or defined in FAR 12.212, DFARS Part 227.72, and DFARS 252.227-7014.
Accordingly, the terms and conditions of this Agreement and only those rights specified in this Agreement, shall pertain
to and govern the use, modification, reproduction, release, performance, display, and disclosure of the Program and
Documentation by the federal government (or other entity acquiring for or through the federal government) and shall
supersede any conflicting contractual terms or conditions. If this License fails to meet the government's needs or is
inconsistent in any respect with federal procurement law, the government agrees to return the Program and
Documentation, unused, to The MathWorks, Inc.
Trademarks
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See
www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand names may be
trademarks or registered trademarks of their respective holders.
Patents
MathWorks products are protected by one or more U.S. patents. Please see www.mathworks.com/patents for
more information.
Revision History
March 2019 Online only New for Version 1.0 (Release 2019a)
September 2019 Online only Revised for Version 1.1 (Release 2019b)
March 2020 Online only Revised for Version 1.2 (Release 2020a)
September 2020 Online only Revised for Version 1.3 (Release 2020b)
March 2021 Online only Revised for Version 2.0 (Release 2021a)
September 2021 Online only Revised for Version 2.1 (Release 2021b)
March 2022 Online only Revised for Version 2.2 (Release 2022a)
September 2022 Online only Revised for Version 2.3 (Release 2022b)
Contents

Apps
1

Functions
2

Objects
3

Blocks
4

iii
1

Apps
1 Apps

Reinforcement Learning Designer


Design, train, and simulate reinforcement learning agents

Description
The Reinforcement Learning Designer app lets you design, train, and simulate agents for existing
environments.

Using this app, you can:

• Import an existing environment from the MATLAB® workspace or create a predefined


environment.
• Automatically create or import an agent for your environment (DQN, DDPG, TD3, SAC, and PPO
agents are supported).
• Train and simulate the agent against the environment.
• Analyze simulation results and refine your agent parameters.
• Export the final agent to the MATLAB workspace for further use and deployment.

Limitations

The following features are not supported in the Reinforcement Learning Designer app.

• Multi-agent systems
• Q, SARSA, PG, AC, and SAC agents
• Custom agents
• Agents relying on table or custom basis function representations

If your application requires any of these features then design, train, and simulate your agent at the
command line.

1-2
Reinforcement Learning Designer

Open the Reinforcement Learning Designer App


• MATLAB Toolstrip: On the Apps tab, under Machine Learning and Deep Learning, click the
app icon.
• MATLAB command prompt: Enter reinforcementLearningDesigner.

Examples
• “Create MATLAB Environments for Reinforcement Learning Designer”
• “Create Simulink Environments for Reinforcement Learning Designer”
• “Create Agents Using Reinforcement Learning Designer”
• “Design and Train Agent Using Reinforcement Learning Designer”

Programmatic Use
reinforcementLearningDesigner opens the Reinforcement Learning Designer app. You can
then import an environment and start the design process, or open a saved design session.

Version History
Introduced in R2021a

1-3
1 Apps

See Also
Apps
Deep Network Designer | Simulation Data Inspector

Functions
rlDQNAgent | rlDDPGAgent | rlTD3Agent | rlPPOAgent | analyzeNetwork

Topics
“Create MATLAB Environments for Reinforcement Learning Designer”
“Create Simulink Environments for Reinforcement Learning Designer”
“Create Agents Using Reinforcement Learning Designer”
“Design and Train Agent Using Reinforcement Learning Designer”
“What Is Reinforcement Learning?”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”

1-4
2

Functions
2 Functions

accelerate
Package: rl.function

Option to accelerate computation of gradient for approximator object based on neural network

Syntax
newAppx = accelerate(oldAppx,useAcceleration)

Description
newAppx = accelerate(oldAppx,useAcceleration) returns the new neural-network-based
function approximator object newAppx, which has the same configuration as the original object,
oldAppx, and the option to accelerate the gradient computation set to the logical value
useAcceleration.

Examples

Accelerate Gradient Computation for a Q-Value Function

Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, define
an observation space with two channels. The first channel carries an observation from a continuous
four-dimensional space. The second carries a discrete scalar observation that can be either zero or
one. Finally, the action space is a three-dimensional vector in a continuous action space.

obsInfo = [rlNumericSpec([4 1])


rlFiniteSetSpec([0 1])];

actInfo = rlNumericSpec([3 1]);

To approximate the Q-value function within the critic, create a recurrent deep neural network. The
output layer must be a scalar expressing the value of executing the action given the observation.

Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects, and specify a name for the input layers, so
you can later explicitly associate them with the appropriate environment channel. Since the network
is recurrent, use sequenceInputLayer as the input layer and include an lstmLayer as one of the
other network layers.

% Define paths
inPath1 = [ sequenceInputLayer( ...
prod(obsInfo(1).Dimension), ...
Name="netObsIn1")
fullyConnectedLayer(5,Name="infc1") ];

inPath2 = [ sequenceInputLayer( ...


prod(obsInfo(2).Dimension), ...
Name="netObsIn2")
fullyConnectedLayer(5,Name="infc2") ];

2-2
accelerate

inPath3 = [ sequenceInputLayer( ...


prod(actInfo(1).Dimension), ...
Name="netActIn")
fullyConnectedLayer(5,Name="infc3") ];

% Concatenate 3 previous layer outputs along dim 1


jointPath = [ concatenationLayer(1,3,Name="cct")
tanhLayer
lstmLayer(8,"OutputMode","sequence")
fullyConnectedLayer(1,Name="jntfc") ];

% Add layers to network object


net = layerGraph;
net = addLayers(net,inPath1);
net = addLayers(net,inPath2);
net = addLayers(net,inPath3);
net = addLayers(net,jointPath);

% Connect layers
net = connectLayers(net,"infc1","cct/in1");
net = connectLayers(net,"infc2","cct/in2");
net = connectLayers(net,"infc3","cct/in3");

% Plot network
plot(net)

% Convert to dlnetwork and display number of weights


net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 832

2-3
2 Functions

Inputs:
1 'netObsIn1' Sequence input with 4 dimensions
2 'netObsIn2' Sequence input with 1 dimensions
3 'netActIn' Sequence input with 3 dimensions

Create the critic with rlQValueFunction, using the network, and the observation and action
specification objects.
critic = rlQValueFunction(net, ...
obsInfo, ...
actInfo, ...
ObservationInputNames=["netObsIn1","netObsIn2"], ...
ActionInputNames="netActIn");

To return the value of the actions as a function of the current observation, use getValue or
evaluate.
val = evaluate(critic, ...
{ rand(obsInfo(1).Dimension), ...
rand(obsInfo(2).Dimension), ...
rand(actInfo(1).Dimension) })

val = 1x1 cell array


{[0.1360]}

When you use evaluate, the result is a single-element cell array containing the value of the action in
the input, given the observation.
val{1}

ans = single
0.1360

Calculate the gradients of the sum of the three outputs with respect to the inputs, given a random
observation.
gro = gradient(critic,"output-input", ...
{ rand(obsInfo(1).Dimension) , ...
rand(obsInfo(2).Dimension) , ...
rand(actInfo(1).Dimension) } )

gro=3×1 cell array


{4x1 single}
{[ 0.0243]}
{3x1 single}

The result is a cell array with as many elements as the number of input channels. Each element
contains the derivatives of the sum of the outputs with respect to each component of the input
channel. Display the gradient with respect to the element of the second channel.
gro{2}

ans = single
0.0243

Obtain the gradient with respect of five independent sequences, each one made of nine sequential
observations.

2-4
accelerate

gro_batch = gradient(critic,"output-input", ...


{ rand([obsInfo(1).Dimension 5 9]) , ...
rand([obsInfo(2).Dimension 5 9]) , ...
rand([actInfo(1).Dimension 5 9]) } )

gro_batch=3×1 cell array


{4x5x9 single}
{1x5x9 single}
{3x5x9 single}

Display the derivative of the sum of the outputs with respect to the third observation element of the
first input channel, after the seventh sequential observation in the fourth independent batch.

gro_batch{1}(3,4,7)

ans = single
0.0108

Set the option to accelerate the gradient computations.

critic = accelerate(critic,true);

Calculate the gradients of the sum of the outputs with respect to the parameters, given a random
observation.

grp = gradient(critic,"output-parameters", ...


{ rand(obsInfo(1).Dimension) , ...
rand(obsInfo(2).Dimension) , ...
rand(actInfo(1).Dimension) } )

grp=11×1 cell array


{ 5x4 single }
{ 5x1 single }
{ 5x1 single }
{ 5x1 single }
{ 5x3 single }
{ 5x1 single }
{32x15 single }
{32x8 single }
{32x1 single }
{[0.0444 0.1280 -0.1560 0.0193 0.0262 0.0453 -0.0186 -0.0651]}
{[ 1]}

Each array within a cell contains the gradient of the sum of the outputs with respect to a group of
parameters.

grp_batch = gradient(critic,"output-parameters", ...


{ rand([obsInfo(1).Dimension 5 9]) , ...
rand([obsInfo(2).Dimension 5 9]) , ...
rand([actInfo(1).Dimension 5 9]) } )

grp_batch=11×1 cell array


{ 5x4 single }
{ 5x1 single }
{ 5x1 single }
{ 5x1 single }
{ 5x3 single }

2-5
2 Functions

{ 5x1 single }
{32x15 single }
{32x8 single }
{32x1 single }
{[2.6325 10.1821 -14.0886 0.4162 2.0677 5.3991 0.3904 -8.9048]}
{[ 45]}

If you use a batch of inputs, gradient uses the whole input sequence (in this case nine steps), and
all the gradients with respect to the independent batch dimensions (in this case five) are added
together. Therefore, the returned gradient always has the same size as the output from
getLearnableParameters.

Accelerate Gradient Computation for a Discrete Categorical Actor

Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, define
an observation space with two channels. The first channel carries an observation from a continuous
four-dimensional space. The second carries a discrete scalar observation that can be either zero or
one. Finally, the action space consist of a scalar that can be -1, 0, or 1.

obsInfo = [rlNumericSpec([4 1])


rlFiniteSetSpec([0 1])];

actInfo = rlFiniteSetSpec([-1 0 1]);

Create a deep neural network to be used as approximation model within the actor. The output layer
must have three elements, each one expressing the value of executing the corresponding action,
given the observation. To create a recurrent neural network, use sequenceInputLayer as the input
layer and include an lstmLayer as one of the other network layers.

% Define paths
inPath1 = [ sequenceInputLayer(prod(obsInfo(1).Dimension))
fullyConnectedLayer(prod(actInfo.Dimension),Name="fc1") ];

inPath2 = [ sequenceInputLayer(prod(obsInfo(2).Dimension))
fullyConnectedLayer(prod(actInfo.Dimension),Name="fc2") ];

% Concatenate previous paths outputs along first dimension


jointPath = [ concatenationLayer(1,2,Name="cct")
tanhLayer;
lstmLayer(8,OutputMode="sequence");
fullyConnectedLayer( ...
prod(numel(actInfo.Elements)), ...
Name="jntfc"); ];

% Add layers to network object


net = layerGraph;
net = addLayers(net,inPath1);
net = addLayers(net,inPath2);
net = addLayers(net,jointPath);

% Connect layers
net = connectLayers(net,"fc1","cct/in1");

2-6
accelerate

net = connectLayers(net,"fc2","cct/in2");

% Plot network
plot(net)

% Convert to dlnetwork and display the number of weights


net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 386

Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
2 'sequenceinput_1' Sequence input with 1 dimensions

Since each element of the output layer must represent the probability of executing one of the possible
actions the software automatically adds a softmaxLayer as a final output layer if you do not specify
it explicitly.

Create the actor with rlDiscreteCategoricalActor, using the network and the observations and
action specification objects. When the network has multiple input layers, they are automatically
associated with the environment observation channels according to the dimension specifications in
obsInfo.

actor = rlDiscreteCategoricalActor(net, obsInfo, actInfo);

2-7
2 Functions

To return a vector of probabilities for each possible action, use evaluate.


[prob,state] = evaluate(actor, ...
{ rand(obsInfo(1).Dimension) , ...
rand(obsInfo(2).Dimension) });
prob{1}

ans = 3x1 single column vector

0.3403
0.3114
0.3483

To return an action sampled from the distribution, use getAction.


act = getAction(actor, ...
{ rand(obsInfo(1).Dimension) , ...
rand(obsInfo(2).Dimension) });
act{1}

ans = 1

Set the option to accelerate the gradient computations.


actor = accelerate(actor,true);

Each array within a cell contains the gradient of the sum of the outputs with respect to a group of
parameters.
grp_batch = gradient(actor,"output-parameters", ...
{ rand([obsInfo(1).Dimension 5 9]) , ...
rand([obsInfo(2).Dimension 5 9])} )

grp_batch=9×1 cell array


{[-3.6041e-09 -3.5829e-09 -2.8805e-09 -3.2158e-09]}
{[ -9.0017e-09]}
{[ -1.5321e-08]}
{[ -3.0182e-08]}
{32x2 single }
{32x8 single }
{32x1 single }
{ 3x8 single }
{ 3x1 single }

If you use a batch of inputs, the gradient uses the whole input sequence (in this case nine steps),
and all the gradients with respect to the independent batch dimensions (in this case five) are added
together. Therefore, the returned gradient always has the same size as the output from
getLearnableParameters.

Input Arguments
oldAppx — Function approximator object
function approximator object

Function approximator object, specified as:

2-8
accelerate

• rlValueFunction,
• rlQValueFunction,
• rlVectorQValueFunction,
• rlDiscreteCategoricalActor,
• rlContinuousDeterministicActor,
• rlContinuousGaussianActor,
• rlContinuousDeterministicTransitionFunction,
• rlContinuousGaussianTransitionFunction,
• rlContinuousDeterministicRewardFunction,
• rlContinuousGaussianRewardFunction,
• rlIsDoneFunction.

useAcceleration — Option to use acceleration for gradient computations


false (default) | true

Option to use acceleration for gradient computations, specified as a logical value. When
useAcceleration is true, the gradient computations are accelerated by optimizing and caching
some inputs needed by the automatic-differentiation computation graph. For more information, see
“Deep Learning Function Acceleration for Custom Training Loops”.

Output Arguments
newAppx — Actor or critic
approximator object

New actor or critic, returned as an approximator object with the same type as oldAppx but with the
gradient acceleration option set to useAcceleration.

Version History
Introduced in R2022a

See Also
evaluate | gradient | getLearnableParameters | rlValueFunction | rlQValueFunction |
rlVectorQValueFunction | rlContinuousDeterministicActor |
rlDiscreteCategoricalActor | rlContinuousGaussianActor |
rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction

Topics
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”

2-9
2 Functions

allExperiences
Package: rl.replay

Return all experiences in replay memory buffer

Syntax
experiences = allExperiences(buffer)
experience = allExperiences(buffer,ConcatenateMode=mode)

Description
experiences = allExperiences(buffer) returns all experiences stored in experience buffer
buffer as individual experiences, each with a batch size of 1 and a sequence length of 1.

experience = allExperiences(buffer,ConcatenateMode=mode) returns experiences


concatenated along the dimension specified by mode. You can concatenate experiences along the
batch dimension or the sequence dimension.

Examples

Extract All Experiences from Replay Memory Buffer

Define observation specifications for the environment. For this example, assume that the environment
has two observation channels: one channel with two continuous observations and one channel with a
three-valued discrete observation
obsContinuous = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[1;5]);
obsDiscrete = rlFiniteSetSpec([1 2 3]);
obsInfo = [obsContinuous obsDiscrete];

Define action specifications for the environment. For this example, assume that the environment has
a single action channel with one continuous action in a specified range.
actInfo = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[5;10]);

Create an experience buffer with a maximum length of 5,000.


buffer = rlReplayMemory(obsInfo,actInfo,5000);

Append a sequence of 10 random experiences to the buffer.


for i = 1:10
experience(i).Observation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
experience(i).Action = {actInfo.UpperLimit.*rand(2,1)};
experience(i).NextObservation = ...

2-10
allExperiences

{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
experience(i).Reward = 10*rand(1);
experience(i).IsDone = 0;
end

append(buffer,experience);

After appending experiences to the buffer, you extract all of the experiences from the buffer. Extract
all of the experiences as individual experiences, each with a batch size of 1 and sequence size of 1.
experience = allExperiences(buffer)

experience=10×1 struct array with fields:


Observation
Action
NextObservation
Reward
IsDone

Alternatively, you can extract all of the experiences as a single experience batch.
expBatch = allExperiences(buffer,ConcatenateMode="batch")

expBatch = struct with fields:


Observation: {[2x1x10 double] [1x1x10 double]}
Action: {[2x1x10 double]}
Reward: [9.5751 9.1574 7.4313 8.2346 1.8687 1.6261 5.0596 ... ]
NextObservation: {[2x1x10 double] [1x1x10 double]}
IsDone: [0 0 0 0 0 0 0 0 0 0]

Input Arguments
buffer — Experience buffer
rlReplayMemory object | rlPrioritizedReplayMemory object

Experience buffer, specified as an rlReplayMemory or rlPrioritizedReplayMemory object.

mode — Concatenation mode


"none" (default) | "batch" | "sequence"

Concatenation mode specified as a one of the following values.

• "none" — Return experience as N individual experiences, each with a batch size of 1 and a
sequence length of 1.
• "batch" — Return experience as a single batch with a sequence length of 1.
• "sequence" — Return experience as a single sequence with a batch size of 1.

Output Arguments
experience — All buffered experiences
structure array | structure

All N buffered experiences, returned as a structure array or structure. When mode is:

2-11
2 Functions

• "none", experience is returned as a structure array of length N, where each element contains
one buffered experience (batchSize = 1 and SequenceLength = 1).
• "batch", experience is returned as a structure. Each field of experience contains all buffered
experiences concatenated along the batch dimension (batchSize = N and SequenceLength =
1).
• "sequence", experience is returned as a structure. Each field of experience contains all
buffered experiences concatenated along the batch dimension (batchSize = 1 and
SequenceLength = N).

experience contains the following fields.

Observation — Observation
cell array

Observation, returned as a cell array with length equal to the number of observation specifications
specified when creating the buffer. Each element of Observation contains a DO-by-batchSize-by-
SequenceLength array, where DO is the dimension of the corresponding observation specification.

Action — Agent action


cell array

Agent action, returned as a cell array with length equal to the number of action specifications
specified when creating the buffer. Each element of Action contains a DA-by-batchSize-by-
SequenceLength array, where DA is the dimension of the corresponding action specification.

Reward — Reward value


scalar | array

Reward value obtained by taking the specified action from the observation, returned as a 1-by-1-by-
SequenceLength array.

NextObservation — Next observation


cell array

Next observation reached by taking the specified action from the observation, returned as a cell array
with the same format as Observation.

IsDone — Termination signal


integer | array

Termination signal, returned as a 1-by-1-by-SequenceLength array of integers. Each element of


IsDone has one of the following values.

• 0 — This experience is not the end of an episode.


• 1 — The episode terminated because the environment generated a termination signal.
• 2 — The episode terminated by reaching the maximum episode length.

Version History
Introduced in R2022b

2-12
allExperiences

See Also
rlReplayMemory | rlPrioritizedReplayMemory

2-13
2 Functions

append
Package: rl.replay

Append experiences to replay memory buffer

Syntax
append(buffer,experience)
append(buffer,experience,dataSourceID)

Description
append(buffer,experience) appends the experiences in experience to the replay memory
buffer.

append(buffer,experience,dataSourceID) appends experiences for the specified data source


to the replay memory buffer.

Examples

Create Experience Buffer

Define observation specifications for the environment. For this example, assume that the environment
has a single observation channel with three continuous signals in specified ranges.
obsInfo = rlNumericSpec([3 1],...
LowerLimit=0,...
UpperLimit=[1;5;10]);

Define action specifications for the environment. For this example, assume that the environment has
a single action channel with two continuous signals in specified ranges.
actInfo = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[5;10]);

Create an experience buffer with a maximum length of 20,000.


buffer = rlReplayMemory(obsInfo,actInfo,20000);

Append a single experience to the buffer using a structure. Each experience contains the following
elements: current observation, action, next observation, reward, and is-done.

For this example, create an experience with random observation, action, and reward values. Indicate
that this experience is not a terminal condition by setting the IsDone value to 0.
exp.Observation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Action = {actInfo.UpperLimit.*rand(2,1)};
exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Reward = 10*rand(1);
exp.IsDone = 0;

2-14
append

Append the experience to the buffer.

append(buffer,exp);

You can also append a batch of experiences to the experience buffer using a structure array. For this
example, append a sequence of 100 random experiences, with the final experience representing a
terminal condition.

for i = 1:100
expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)};
expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Reward = 10*rand(1);
expBatch(i).IsDone = 0;
end
expBatch(100).IsDone = 1;

append(buffer,expBatch);

After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.

miniBatch = sample(buffer,50);

You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive
experiences with a discount factor of 0.95.

horizonSample = sample(buffer,1,...
NStepHorizon=10,...
DiscountFactor=0.95);

The returned sample includes the following information.

• Observation and Action are the observation and action from the first experience in the
horizon.
• NextObservation and IsDone are the next observation and termination signal from the final
experience in the horizon.
• Reward is the cumulative reward across the horizon using the specified discount factor.

You can also sample a sequence of consecutive experiences. In this case, the structure fields contain
arrays with values for all sampled experiences.

sequenceSample = sample(buffer,1,...
SequenceLength=20);

Create Experience Buffer with Multiple Observation Channels

Define observation specifications for the environment. For this example, assume that the environment
has two observation channels: one channel with two continuous observations and one channel with a
three-valued discrete observation.

obsContinuous = rlNumericSpec([2 1],...


LowerLimit=0,...
UpperLimit=[1;5]);

2-15
2 Functions

obsDiscrete = rlFiniteSetSpec([1 2 3]);


obsInfo = [obsContinuous obsDiscrete];

Define action specifications for the environment. For this example, assume that the environment has
a single action channel with one continuous action in a specified range.

actInfo = rlNumericSpec([2 1],...


LowerLimit=0,...
UpperLimit=[5;10]);

Create an experience buffer with a maximum length of 5,000.

buffer = rlReplayMemory(obsInfo,actInfo,5000);

Append a sequence of 50 random experiences to the buffer.

for i = 1:50
exp(i).Observation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
exp(i).Action = {actInfo.UpperLimit.*rand(2,1)};
exp(i).NextObservation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
exp(i).Reward = 10*rand(1);
exp(i).IsDone = 0;
end

append(buffer,exp);

After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 10 experiences from the buffer.

miniBatch = sample(buffer,10);

Input Arguments
buffer — Experience buffer
rlReplayMemory object | rlPrioritizedReplayMemeory

Experience buffer, specified as an rlReplayMemory or rlPrioritizedReplayMemory object.

experience — Experience to append to the buffer


structure | structure array

Experience to append to the buffer, specified as a structure or structure array with the following
fields.

Observation — Observation
cell array

Observation, specified as a cell array with length equal to the number of observation specifications
specified when creating the buffer. The dimensions of each element in Observation must match the
dimensions in the corresponding observation specification.

Action — Agent action


cell array

2-16
append

Action taken by the agent, specified as a cell array with length equal to the number of action
specifications specified when creating the buffer. The dimensions of each element in Action must
match the dimensions in the corresponding action specification.

Reward — Reward value


scalar

Reward value obtained by taking the specified action from the starting observation, specified as a
scalar.

NextObservation — Next observation


cell array

Next observation reached by taking the specified action from the starting observation, specified as a
cell array with the same format as Observation.

IsDone — Termination signal


0|1|2

Termination signal, specified as one of the following values.

• 0 — This experience is not the end of an episode.


• 1 — The episode terminated because the environment generated a termination signal.
• 2 — The episode terminated by reaching the maximum episode length.

dataSourceID — Data source index


0 (default) | nonnegative integer | array of nonnegative integers

Data source index, specified as a nonnegative integer or array of nonnegative integers.

If experience is a scalar structure, specify dataSourceID as a scalar integer.

If experience is a structure array, specify dataSourceID as an array with length equal to the
length of experience. You can specify different data source indices for each element of
experience. If all elements in experience come from the same data source, you can specify
dataSourceID as a scalar integer.

Version History
Introduced in R2022a

See Also
rlReplayMemory | sample

2-17
2 Functions

barrierPenalty
Logarithmic barrier penalty value for a point with respect to a bounded region

Syntax
p = barrierPenalty(x,xmin,xmax)
p = barrierPenalty( ___ ,maxValue,curvature)

Description
p = barrierPenalty(x,xmin,xmax) calculates the nonnegative (logarithmic barrier) penalty
vector p for the point x with respect to the region bounded by xmin and xmax. p has the same
dimension as x. This syntax uses the default values of 1 and 0.1 for the maxValue and curvature
parameters of the barrier function, respectively.

p = barrierPenalty( ___ ,maxValue,curvature) specifies both the maxValue and


curvature parameters of the barrier function. If maxValue is an empty matrix its default value is
used. Likewise if curvature is an empty matrix or it is omitted, its default value is used.

Examples

Calculate Logarithmic Barrier Penalty for a Point

This example shows how to use the logarithmic barrierPenalty function to calculate the barrier
penalty for a given point, with respect to a bounded region.

Calculate the penalty value for the point 0.1 within the interval [-2,2] using default values for the
maximum value and curvature parameters.

barrierPenalty(0.1,-2,2)

ans = 2.5031e-04

Calculate the penalty value for the point 4 outside the interval [-2,2].

barrierPenalty(4,-2,2)

ans = 1

Calculate the penalty value for the point 4 outside the interval [-2,2], using a maximum value
parameter of 5.

barrierPenalty(4,-2,2,5)

ans = 5

Calculate the penalty value for the point 0.1 inside the interval [-2,2], using a curvature parameter of
0.5.

barrierPenalty(0.1,-2,2,5,0.5)

2-18
barrierPenalty

ans = 0.0013

Calculate the penalty value for the point [-2,0,4] with respect to the box defined by [0,1], [-1,1], and
[-2,2] along the x, y, and z dimensions, respectively, using the default value for maximum value and a
curvature parameter of 0.

barrierPenalty([-2 0 4],[0 -1 -2],[1 1 2],1,0)

ans = 3×1

1
0
1

Visualize Penalty Values for an Interval

Create a vector of 1001 equidistant points distributed between -5 and 5.

x = -5:0.01:5;

Calculate penalties for all the points in the vector, using the default value for the maximum value
parameter and a value of 0.01 for the curvature parameter.

p = barrierPenalty(x,-2,2,1,0.01);

Plot the points and add grid, labels and title.

plot(x,p)
grid
xlabel("point position");
ylabel("penalty value");
title("Penalty values over an interval");

2-19
2 Functions

Input Arguments
x — Point for which the penalty is calculated
scalar | vector | matrix

Point for which the penalty is calculated, specified as a numeric scalar, vector or matrix.
Example: [0.5; 1.6]

xmin — Lower bounds


scalar | vector | matrix

Lower bounds for x, specified as a numeric scalar, vector or matrix. To use the same minimum value
for all elements in x, specify xmin as a scalar.
Example: -1

xmax — Upper bounds


scalar | vector | matrix

Upper bounds for x, specified as a numeric scalar, vector or matrix. To use the same maximum value
for all elements in x, specify xmax as a scalar.
Example: 2

maxValue — Maximum value parameter of the barrier function


1 (default) | nonnegative scalar

2-20
barrierPenalty

Maximum value parameter of the barrier function, specified as a scalar.


Example: 2

curvature — Curvature parameter of the barrier function


0.1 (default) | nonnegative scalar

Curvature parameter of the barrier function, specified as a scalar.


Example: 0.2

Output Arguments
p — Penalty value
nonnegative vector

Penalty value, returned as a vector of nonnegative elements. Each element pi depends on the position
of xi with respect to the interval specified by xmini and xmaxi. The barrier penalty function returns
the value
2
p(x) = min pmax, C log 0.25 xmax − xmin − log x − xmin xmax − xmin

when xmin < x < xmax, and maxValue otherwise. Here, C is the argument curvature, and pmax is the
argument maxValue. Note that for positive values of C the returned penalty value is always positive.
If C is zero, then the returned penalty is zero inside the interval defined by the bounds, and pmax
outside this interval. If x is multidimensional, then the calculation is applied independently on each
dimension. Penalty functions are typically used to generate negative rewards when constraints are
violated, such as in generateRewardFunction.

Version History
Introduced in R2021b

Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

See Also
Functions
generateRewardFunction | exteriorPenalty | hyperbolicPenalty

Topics
“Generate Reward Function from a Model Predictive Controller for a Servomotor”
“Generate Reward Function from a Model Verification Block for a Water Tank System”
“Define Reward Signals”

2-21
2 Functions

bus2RLSpec
Create reinforcement learning data specifications for elements of a Simulink bus

Syntax
specs = bus2RLSpec(busName)
specs = bus2RLSpec(busName,Name,Value)

Description
specs = bus2RLSpec(busName) creates a set of reinforcement learning data specifications from
the Simulink® bus object specified by busName. One specification element is created for each leaf
element in the corresponding Simulink bus. Use these specifications to define actions and
observations for a Simulink reinforcement learning environment.

specs = bus2RLSpec(busName,Name,Value) specifies options for creating specifications using


one or more Name,Value pair arguments.

Examples

Create an observation specification object from a bus object

This example shows how to use the function bus2RLSpec to create an observation specification
object from a Simulink® bus object.

Create a bus object.

obsBus = Simulink.Bus();

Create three elements in the bus and specify their names.

obsBus.Elements(1) = Simulink.BusElement;
obsBus.Elements(1).Name = 'sin_theta';
obsBus.Elements(2) = Simulink.BusElement;
obsBus.Elements(2).Name = 'cos_theta';
obsBus.Elements(3) = Simulink.BusElement;
obsBus.Elements(3).Name = 'dtheta';

Create the observation specification objects using the Simulink bus object.

obsInfo = bus2RLSpec('obsBus');

You can then use obsInfo, together with the corresponding Simulink model, to create a
reinforcement learning environment. For an example, see “Train DDPG Agent to Swing Up and
Balance Pendulum with Bus Signal”.

2-22
bus2RLSpec

Create an action specification object from a bus object

This example shows how to call the function bus2RLSpec using name and value pairs to create an
action specification object from a Simulink® bus object.

Create a bus object.


actBus = Simulink.Bus();

Create one element in the bus and specify the name.


actBus.Elements(1) = Simulink.BusElement;
actBus.Elements(1).Name = 'actuator';

Create the observation specification objects using the Simulink bus object.
actInfo = bus2RLSpec('actBus','DiscreteElements',{'actuator',[-1 1]});

This specifies that the 'actuator' bus element can carry two possible values, -1, and 1.

You can then use actInfo, together with the corresponding Simulink model, to create a
reinforcement learning environment. Specifically the function that creates the environment uses
actInfo to determine the right bus output of the agent block.

For an example, see “Train DDPG Agent to Swing Up and Balance Pendulum with Bus Signal”.

Input Arguments
busName — Name of Simulink bus object
string | character vector

Name of Simulink bus object, specified as a string or character vector.

Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.
Example: 'DiscreteElements',{'force',[-5 0 5]} sets the 'force' bus element to be a
discrete data specification with three possible values, -5, 0, and 5

Model — Name of Simulink model


string | character vector

Name of the Simulink model, specified as the comma-separated pair consisting of 'Model' and a
string or character vector. Specify the model name when the bus object is defined in the model global
workspace (for example, in a data dictionary) instead of the MATLAB workspace.

BusElementNames — Names of bus leaf elements


string array

Names of bus leaf elements for which to create specifications, specified as the comma-separated pair
consisting of BusElementNames' and a string array. To create observation specifications for a subset

2-23
2 Functions

of the elements in a Simulink bus object, specify BusElementNames. If you do not specify
BusElementNames, a data specification is created for each leaf element in the bus.

Note Do not specify BusElementNames when creating specifications for action signals. The RL
Agent block must output the full bus signal.

DiscreteElements — Finite values for discrete bus elements


cell array of name-value pairs

Finite values for discrete bus elements, specified as the comma-separated pair consisting of
'DiscreteElements' and a cell array of name-value pairs. Each name-value pair consists of a bus
leaf element name and an array of discrete values. The specified discrete values must be castable to
the data type of the specified action signal.

If you do not specify discrete values for an element specification, the element is continuous.
Example: 'ActionDiscretElements',{'force',[-10 0 10],'torque',[-5 0 5]} specifies
discrete values for the 'force' and 'torque' leaf elements of a bus action signal.

Output Arguments
specs — Data specifications
rlNumericSpec object | rlFiniteSetSpec object | array of data specification objects

Data specifications for reinforcement learning actions or observations, returned as one of the
following:

• rlNumericSpec object for a single continuous bus element


• rlFiniteSetSpec object for a single discrete bus element
• Array of data specification objects for multiple bus elements

By default, all data specifications for bus elements are rlNumericSpec objects. To create a discrete
specification for one or more bus elements, specify the element names using the DiscreteElements
name-value pair.

Version History
Introduced in R2019a

See Also
Blocks
RL Agent

Functions
rlSimulinkEnv | createIntegratedEnv | rlNumericSpec | rlFiniteSetSpec

Topics
“Create Simulink Reinforcement Learning Environments”

2-24
cleanup

cleanup
Package: rl.env

Clean up reinforcement learning environment or data logger object

Syntax
cleanup(env)

cleanup(lgr)

Description
When you define a custom training loop for reinforcement learning, you can simulate an agent or
policy against an environment using the runEpisode function. Use the cleanup function to clean up
the environment after running simulations using multiple calls to runEpisode. To clean up the
environment after each simulation, you can configure runEpisode to automatically call the cleanup
function at the end of each episode.

Also use cleanup to perform clean up tasks for a FileLogger or MonitorLogger object after
logging data within a custom training loop.
Environment Objects

cleanup(env) cleans up the specified reinforcement learning environment after running multiple
simulations using runEpisode.
Data Logger Objects

cleanup(lgr) cleans up the specified data logger object after logging data within a custom training
loop. This task might involve for example transferring any remaining data from lgr internal memory
to a logging target (either“Log Data to Disk in a Custom Training Loop” on page 2-27 a MAT file or a
trainingProgressMonitor object).

Examples

Simulate Environment and Agent

Create a reinforcement learning environment and extract its observation and action specifications.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

To approximate the Q-value function withing the critic, use a neural network. Create a network as an
array of layer objects.
net = [...
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(24)
reluLayer

2-25
2 Functions

fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(2)
softmaxLayer];

Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 770

Inputs:
1 'input' 4 features

Create a discrete categorical actor using the network.


actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);

Check your actor with a random observation.


act = getAction(actor,{rand(obsInfo.Dimension)})

act = 1x1 cell array


{[-10]}

Create a policy object from the actor.


policy = rlStochasticActorPolicy(actor);

Create an experience buffer.


buffer = rlReplayMemory(obsInfo,actInfo);

Set up the environment for running multiple simulations. For this example, configure the training to
log any errors rather than send them to the command window.
setup(env,StopOnError="off")

Simulate multiple episodes using the environment and policy. After each episode, append the
experiences to the buffer. For this example, run 100 episodes.
for i = 1:100
output = runEpisode(env,policy,MaxSteps=300);
append(buffer,output.AgentData.Experiences)
end

Clean up the environment.


cleanup(env)

Sample a mini-batch of experiences from the buffer. For this example, sample 10 experiences.
batch = sample(buffer,10);

You can then learn from the sampled experiences and update the policy and actor.

2-26
cleanup

Log Data to Disk in a Custom Training Loop

This example shows how to log data to disk when training an agent using a custom training loop.

Create a FileLogger object using rlDataLogger.

flgr = rlDataLogger();

Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.

setup(flgr);

Within a custom training loop, you can now store data to the logger object memory and write data to
file.

For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.

for iter = 1:10

% Store three random numbers in memory


% as elements of the variable "Context1"
for ct = 1:3
store(flgr, "Context1", rand, iter);
end

% Store a random number in memory


% as the variable "Context2"
store(flgr, "Context2", rand, iter);

% Write data to file every 4 iterations


if mod(iter,4)==0
write(flgr);
end

end

Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.

cleanup(flgr);

Input Arguments
env — Reinforcement learning environment
environment object | ...

Reinforcement learning environment, specified as one of the following objects.

2-27
2 Functions

• rlFunctionEnv — Environment defined using custom functions.


• SimulinkEnvWithAgent — Simulink environment created using rlSimulinkEnv or
createIntegratedEnv
• rlMDPEnv — Markov decision process environment
• rlNeuralNetworkEnvironment — Environment with deep neural network transition models
• Predefined environment created using rlPredefinedEnv
• Custom environment created from a template (rlCreateEnvTemplate)

If env is a SimulinkEnvWithAgent object and the associated Simulink model is configured to use
fast restart, then cleanup terminates the model compilation.

lgr — Date logger object


FileLogger object | MonitorLogger object | ...

Data logger object, specified as either a FileLogger or a MonitorLogger object.

Version History
Introduced in R2022a

See Also
Functions
runEpisode | setup | reset | store | write

Objects
rlFunctionEnv | rlMDPEnv | SimulinkEnvWithAgent | rlNeuralNetworkEnvironment |
FileLogger | MonitorLogger

Topics
“Custom Training Loop with Simulink Action Noise”

2-28
createGridWorld

createGridWorld
Create a two-dimensional grid world for reinforcement learning

Syntax
GW = createGridWorld(m,n)
GW = createGridWorld(m,n,moves)

Description
GW = createGridWorld(m,n) creates a grid world GW of size m-by-n with default actions of
['N';'S';'E';'W'].

GW = createGridWorld(m,n,moves) creates a grid world GW of size m-by-n with actions specified


by moves.

Examples

Create Grid World Environment

For this example, consider a 5-by-5 grid world with the following rules:

1 A 5-by-5 grid world bounded by borders, with 4 possible actions (North = 1, South = 2, East = 3,
West = 4).
2 The agent begins from cell [2,1] (second row, first column).
3 The agent receives reward +10 if it reaches the terminal state at cell [5,5] (blue).
4 The environment contains a special jump from cell [2,4] to cell [4,4] with +5 reward.
5 The agent is blocked by obstacles in cells [3,3], [3,4], [3,5] and [4,3] (black cells).
6 All other actions result in -1 reward.

2-29
2 Functions

First, create a GridWorld object using the createGridWorld function.


GW = createGridWorld(5,5)

GW =
GridWorld with properties:

GridSize: [5 5]
CurrentState: "[1,1]"
States: [25x1 string]
Actions: [4x1 string]
T: [25x25x4 double]
R: [25x25x4 double]
ObstacleStates: [0x1 string]
TerminalStates: [0x1 string]
ProbabilityTolerance: 8.8818e-16

Now, set the initial, terminal and obstacle states.


GW.CurrentState = '[2,1]';
GW.TerminalStates = '[5,5]';
GW.ObstacleStates = ["[3,3]";"[3,4]";"[3,5]";"[4,3]"];

Update the state transition matrix for the obstacle states and set the jump rule over the obstacle
states.
updateStateTranstionForObstacles(GW)
GW.T(state2idx(GW,"[2,4]"),:,:) = 0;
GW.T(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 1;

Next, define the rewards in the reward transition matrix.


nS = numel(GW.States);
nA = numel(GW.Actions);

2-30
createGridWorld

GW.R = -1*ones(nS,nS,nA);
GW.R(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 5;
GW.R(:,state2idx(GW,GW.TerminalStates),:) = 10;

Now, use rlMDPEnv to create a grid world environment using the GridWorld object GW.

env = rlMDPEnv(GW)

env =
rlMDPEnv with properties:

Model: [1x1 rl.env.GridWorld]


ResetFcn: []

You can visualize the grid world environment using the plot function.

plot(env)

Input Arguments
m — Number of rows of the grid world
scalar

Number of rows of the grid world, specified as a scalar.

2-31
2 Functions

n — Number of columns of the grid world


scalar

Number of columns of the grid world, specified as a scalar.

moves — Action names


'Standard' (default) | 'Kings'

Action names, specified as either 'Standard' or 'Kings'. When moves is set to

• 'Standard', the actions are ['N';'S';'E';'W'].


• 'Kings', the actions are ['N';'S';'E';'W';'NE';'NW';'SE';'SW'].

Output Arguments
GW — Two-dimensional grid world
GridWorld object

Two-dimensional grid world, returned as a GridWorld object with properties listed below. For more
information, see “Create Custom Grid World Environments”.

GridSize — Size of the grid world


[m,n] vector

Size of the grid world, specified as a [m,n] vector.

CurrentState — Name of the current state


string

Name of the current state, specified as a string.

States — State names


string vector

State names, specified as a string vector of length m*n.

Actions — Action names


string vector

Action names, specified as a string vector. The length of the Actions vector is determined by the
moves argument.

Actions is a string vector of length:

• Four, if moves is specified as 'Standard'.


• Eight, moves is specified as 'Kings'.

T — State transition matrix


3D array

State transition matrix, specified as a 3-D array, which determines the possible movements of the
agent in an environment. State transition matrix T is a probability matrix that indicates how likely the
agent will move from the current state s to any possible next state s' by performing action a. T is
given by,

2-32
createGridWorld

T s, s′, a = probability s′ s, a .

T is:

• A K-by-K-by-4 array, if moves is specified as 'Standard'. Here, K = m*n.


• A K-by-K-by-8 array, if moves is specified as 'Kings'.

R — Reward transition matrix


3D array

Reward transition matrix, specified as a 3-D array, determines how much reward the agent receives
after performing an action in the environment. R has the same shape and size as state transition
matrix T. Reward transition matrix R is given by,

r = R s, s′, a .

R is:

• A K-by-K-by-4 array, if moves is specified as 'Standard'. Here, K = m*n.


• A K-by-K-by-8 array, if moves is specified as 'Kings'.

ObstacleStates — State names that cannot be reached in the grid world


string vector

State names that cannot be reached in the grid world, specified as a string vector.

TerminalStates — Terminal state names in the grid world


string vector

Terminal state names in the grid world, specified as a string vector.

Version History
Introduced in R2019a

See Also
rlMDPEnv | rlPredefinedEnv

Topics
“Create Custom Grid World Environments”
“Train Reinforcement Learning Agent in Basic Grid World”

2-33
2 Functions

createIntegratedEnv
Create Simulink model for reinforcement learning, using reference model as environment

Syntax
env = createIntegratedEnv(refModel,newModel)
[env,agentBlock,obsInfo,actInfo] = createIntegratedEnv( ___ )

[ ___ ] = createIntegratedEnv( ___ ,Name,Value)

Description
env = createIntegratedEnv(refModel,newModel) creates a Simulink model with the name
specified by newModel and returns a reinforcement learning environment object, env, for this model.
The new model contains an RL Agent block and uses the reference model refModel as a
reinforcement learning environment for training the agent specified by this block.

[env,agentBlock,obsInfo,actInfo] = createIntegratedEnv( ___ ) returns the block path


to the RL Agent block in the new model and the observation and action data specifications for the
reference model, obsInfo and actInfo, respectively.

[ ___ ] = createIntegratedEnv( ___ ,Name,Value) creates a model and environment interface


using port, observation, and action information specified using one or more Name,Value pair
arguments.

Examples

Create Environment from Simulink Model

This example shows how to use createIntegratedEnv to create an environment object starting
from a Simulink model that implements the system with which the agent. Such a system is often
referred to as plant, open-loop system, or reference system, while the whole (integrated) system
including the agent is often referred to as the closed-loop system.

For this example, use the flying robot model described in “Train DDPG Agent to Control Flying
Robot” as the reference (open-loop) system.

Open the flying robot model.


open_system('rlFlyingRobotEnv')

Initialize the state variables and sample time.


% initial model state variables
theta0 = 0;
x0 = -15;
y0 = 0;

% sample time
Ts = 0.4;

2-34
createIntegratedEnv

Create the Simulink model myIntegratedEnv containing the flying robot model connected in a
closed loop to the agent block. The function also returns the reinforcement learning environment
object env to be used for training.

env = createIntegratedEnv('rlFlyingRobotEnv','myIntegratedEnv')

env =
SimulinkEnvWithAgent with properties:

Model : myIntegratedEnv
AgentBlock : myIntegratedEnv/RL Agent
ResetFcn : []
UseFastRestart : on

The function can also return the block path to the RL Agent block in the new integrated model, as
well as the observation and action specifications for the reference model.

[~,agentBlk,observationInfo,actionInfo] = createIntegratedEnv( ...


'rlFlyingRobotEnv','myIntegratedEnv')

agentBlk =
'myIntegratedEnv/RL Agent'

observationInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "observation"
Description: [0x0 string]
Dimension: [7 1]
DataType: "double"

actionInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "action"
Description: [0x0 string]
Dimension: [2 1]
DataType: "double"

Returning the block path and specifications is useful in cases in which you need to modify
descriptions, limits, or names in observationInfo and actionInfo. After modifying the
specifications, you can then create an environment from the integrated model IntegratedEnv using
the rlSimulinkEnv function.

Create Integrated Environment with Specified Port Names

This example shows how to call createIntegratedEnv using name-value pairs to specify port
names.

2-35
2 Functions

The first argument of createIntegratedEnv is the name of the reference Simulink model that
contains the system with which the agent must interact. Such a system is often referred to as plant,
or open-loop system. For this example, the reference system is the model of a water tank.

Open the open-loop water tank model.

open_system('rlWatertankOpenloop')

Set the sample time of the discrete integrator block used to generate the observation, so the
simulation can run.

Ts = 1;

The input port is called u (instead of action), and the first and third output ports are called y and
stop (instead of observation and isdone). Specify the port names using name-value pairs.

env = createIntegratedEnv('rlWatertankOpenloop','IntegratedWatertank',...
'ActionPortName','u','ObservationPortName','y','IsDonePortName','stop')

env =
SimulinkEnvWithAgent with properties:

Model : IntegratedWatertank
AgentBlock : IntegratedWatertank/RL Agent
ResetFcn : []
UseFastRestart : on

The new model IntegratedWatertank contains the reference model connected in a closed-loop
with the agent block. The function also returns the reinforcement learning environment object to be
used for training.

Input Arguments
refModel — Reference model name
string | character vector

Reference model name, specified as a string or character vector. This is the Simulink model
implementing the system that the agent needs to interact with. Such a system is often referred to as
plant, open loop system or reference system, while the whole (integrated) system including the agent
is often referred to as the closed loop system. The new Simulink model uses this reference model as
the dynamic model of the environment for reinforcement learning.

newModel — New model name


string | character vector

New model name, specified as a string or character vector. createIntegratedEnv creates a


Simulink model with this name, but does not save the model.

Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.

2-36
createIntegratedEnv

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.
Example: 'IsDonePortName',"stopSim" sets the stopSim port of the reference model as the
source of the isdone signal.

ObservationPortName — Reference model observation output port name


"observation" (default) | string | character vector

Reference model observation output port name, specified as the comma-separated pair consisting of
'ObservationPortName' and a string or character vector. Specify ObservationPortName when
the name of the observation output port of the reference model is not "observation".

ActionPortName — Reference model action input port name


"action" (default) | string | character vector

Reference model action input port name, specified as the comma-separated pair consisting of
'ActionPortName' and a string or character vector. Specify ActionPortName when the name of
the action input port of the reference model is not "action".

RewardPortName — Reference model reward output port name


"reward" (default) | string | character vector

Reference model reward output port name, specified as the comma-separated pair consisting of
'RewardPortName' and a string or character vector. Specify RewardPortName when the name of
the reward output port of the reference model is not "reward".

IsDonePortName — Reference model done flag output port name


"isdone" (default) | string | character vector

Reference model done flag output port name, specified as the comma-separated pair consisting of
'IsDonePortName' and a string or character vector. Specify IsDonePortName when the name of
the done flag output port of the reference model is not "isdone".

ObservationBusElementNames — Names of observation bus leaf elements


string array

Names of observation bus leaf elements for which to create specifications, specified as a string array.
To create observation specifications for a subset of the elements in a Simulink bus object, specify
BusElementNames. If you do not specify BusElementNames, a data specification is created for each
leaf element in the bus.

ObservationBusElementNames is applicable only when the observation output port is a bus signal.
Example: 'ObservationBusElementNames',["sin" "cos"] creates specifications for the
observation bus elements with the names "sin" and "cos".

ObservationDiscreteElements — Finite values for observation specifications


cell array of name-value pairs

Finite values for discrete observation specification elements, specified as the comma-separated pair
consisting of 'ObservationDiscreteElements' and a cell array of name-value pairs. Each name-
value pair consists of an element name and an array of discrete values.

If the observation output port of the reference model is:

2-37
2 Functions

• A bus signal, specify the name of one of the leaf elements of the bus specified in by
ObservationBusElementNames
• Nonbus signal, specify the name of the observation port, as specified by ObservationPortName

The specified discrete values must be castable to the data type of the specified observation signal.

If you do not specify discrete values for an observation specification element, the element is
continuous.
Example: 'ObservationDiscretElements',{'observation',[-1 0 1]} specifies discrete
values for a nonbus observation signal with default port name observation.
Example: 'ObservationDiscretElements',{'gear',[-1 0 1 2],'direction',[1 2 3 4]}
specifies discrete values for the 'gear' and 'direction' leaf elements of a bus action signal.

ActionDiscreteElements — Finite values for action specifications


cell array of name-value pairs

Finite values for discrete action specification elements, specified as the comma-separated pair
consisting of 'ActionDiscreteElements' and a cell array of name-value pairs. Each name-value
pair consists of an element name and an array of discrete values.

If the action input port of the reference model is:

• A bus signal, specify the name of a leaf element of the bus


• Nonbus signal, specify the name of the action port, as specified by ActionPortName

The specified discrete values must be castable to the data type of the specified action signal.

If you do not specify discrete values for an action specification element, the element is continuous.
Example: 'ActionDiscretElements',{'action',[-1 0 1]} specifies discrete values for a
nonbus action signal with default port name 'action'.
Example: 'ActionDiscretElements',{'force',[-10 0 10],'torque',[-5 0 5]} specifies
discrete values for the 'force' and 'torque' leaf elements of a bus action signal.

Output Arguments
env — Reinforcement learning environment
SimulinkEnvWithAgent object

Reinforcement learning environment interface, returned as an SimulinkEnvWithAgent object.

agentBlock — Block path to the agent block


character vector

Block path to the agent block in the new model, returned as a character vector. To train an agent in
the new Simulink model, you must create an agent and specify the agent name in the RL Agent block
indicated by agentBlock.

For more information on creating agents, see “Reinforcement Learning Agents”.

obsInfo — Observation data specifications


rlNumericSpec object | rlFiniteSetSpec object | array of data specification objects

2-38
createIntegratedEnv

Observation data specifications, returned as one of the following:

• rlNumericSpec object for a single continuous observation specification


• rlFiniteSetSpec object for a single discrete observation specification
• Array of data specification objects for multiple specifications

actInfo — Action data specifications


rlNumericSpec object | rlFiniteSetSpec object | array of data specification objects

Action data specifications, returned as one of the following:

• rlNumericSpec object for a single continuous action specification


• rlFiniteSetSpec object for a single discrete action specification
• Array of data specification objects for multiple action specifications

Version History
Introduced in R2019a

See Also
Blocks
RL Agent

Functions
rlSimulinkEnv | bus2RLSpec | rlNumericSpec | rlFiniteSetSpec

Topics
“Create Simulink Reinforcement Learning Environments”

2-39
2 Functions

createMDP
Create Markov decision process model

Syntax
MDP = createMDP(states,actions)

Description
MDP = createMDP(states,actions) creates a Markov decision process model with the specified
states and actions.

Examples

Create MDP Model

Create an MDP model with eight states and two possible actions.

MDP = createMDP(8,["up";"down"]);

Specify the state transitions and their associated rewards.

% State 1 Transition and Reward


MDP.T(1,2,1) = 1;
MDP.R(1,2,1) = 3;
MDP.T(1,3,2) = 1;
MDP.R(1,3,2) = 1;

% State 2 Transition and Reward


MDP.T(2,4,1) = 1;
MDP.R(2,4,1) = 2;
MDP.T(2,5,2) = 1;
MDP.R(2,5,2) = 1;

% State 3 Transition and Reward


MDP.T(3,5,1) = 1;
MDP.R(3,5,1) = 2;
MDP.T(3,6,2) = 1;
MDP.R(3,6,2) = 4;

% State 4 Transition and Reward


MDP.T(4,7,1) = 1;
MDP.R(4,7,1) = 3;
MDP.T(4,8,2) = 1;
MDP.R(4,8,2) = 2;

% State 5 Transition and Reward


MDP.T(5,7,1) = 1;
MDP.R(5,7,1) = 1;
MDP.T(5,8,2) = 1;
MDP.R(5,8,2) = 9;

2-40
createMDP

% State 6 Transition and Reward


MDP.T(6,7,1) = 1;
MDP.R(6,7,1) = 5;
MDP.T(6,8,2) = 1;
MDP.R(6,8,2) = 1;

% State 7 Transition and Reward


MDP.T(7,7,1) = 1;
MDP.R(7,7,1) = 0;
MDP.T(7,7,2) = 1;
MDP.R(7,7,2) = 0;

% State 8 Transition and Reward


MDP.T(8,8,1) = 1;
MDP.R(8,8,1) = 0;
MDP.T(8,8,2) = 1;
MDP.R(8,8,2) = 0;

Specify the terminal states of the model.


MDP.TerminalStates = ["s7";"s8"];

Input Arguments
states — Model states
positive integer | string vector

Model states, specified as one of the following:

• Positive integer — Specify the number of model states. In this case, each state has a default name,
such as "s1" for the first state.
• String vector — Specify the state names. In this case, the total number of states is equal to the
length of the vector.

actions — Model actions


positive integer | string vector

Model actions, specified as one of the following:

• Positive integer — Specify the number of model actions. In this case, each action has a default
name, such as "a1" for the first action.
• String vector — Specify the action names. In this case, the total number of actions is equal to the
length of the vector.

Output Arguments
MDP — MDP model
GenericMDP object

MDP model, returned as a GenericMDP object with the following properties.

CurrentState — Name of the current state


string

2-41
2 Functions

Name of the current state, specified as a string.

States — State names


string vector

State names, specified as a string vector with length equal to the number of states.

Actions — Action names


string vector

Action names, specified as a string vector with length equal to the number of actions.

T — State transition matrix


3D array

State transition matrix, specified as a 3-D array, which determines the possible movements of the
agent in an environment. State transition matrix T is a probability matrix that indicates how likely the
agent will move from the current state s to any possible next state s' by performing action a. T is an
S-by-S-by-A array, where S is the number of states and A is the number of actions. It is given by:

T s, s′, a = probability s′ s, a .

The sum of the transition probabilities out from a nonterminal state s following a given action must
sum up to one. Therefore, all stochastic transitions out of a given state must be specified at the same
time.

For example, to indicate that in state 1 following action 4 there is an equal probability of moving to
states 2 or 3, use the following:

MDP.T(1,[2 3],4) = [0.5 0.5];

You can also specify that, following an action, there is some probability of remaining in the same
state. For example:

MDP.T(1,[1 2 3 4],1) = [0.25 0.25 0.25 0.25];

R — Reward transition matrix


3D array

Reward transition matrix, specified as a 3-D array, which determines how much reward the agent
receives after performing an action in the environment. R has the same shape and size as state
transition matrix T. The reward for moving from state s to state s' by performing action a is given
by:

r = R s, s′, a .

TerminalStates — Terminal state names in the grid world


string vector

Terminal state names in the grid world, specified as a string vector of state names.

Version History
Introduced in R2019a

2-42
createMDP

See Also
rlMDPEnv | createGridWorld

Topics
“Train Reinforcement Learning Agent in MDP Environment”

2-43
2 Functions

evaluate
Package: rl.function

Evaluate function approximator object given observation (or observation-action) input data

Syntax
outData = evaluate(fcnAppx,inData)
[outData,state] = evaluate(fcnAppx,inData)

Description
outData = evaluate(fcnAppx,inData) evaluates the function approximator object (that is, the
actor or critic) fcnAppx given the input value inData. It returns the output value outData.

[outData,state] = evaluate(fcnAppx,inData) also returns the updated state of fcnAppx


when it contains a recurrent neural network.

Examples

Evaluate a Function Approximator Object

This example shows you how to evaluate a function approximator object (that is, an actor or a critic).
For this example, the function approximator object is a discrete categorical actor and you evaluate it
given some observation data, obtaining in return the action probability distribution and the updated
network state.

Load the same environment used in “Train PG Agent to Balance Cart-Pole System”, and obtain the
observation and action specifications.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env)

obsInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "CartPole States"
Description: "x, dx, theta, dtheta"
Dimension: [4 1]
DataType: "double"

actInfo = getActionInfo(env)

actInfo =
rlFiniteSetSpec with properties:

Elements: [-10 10]


Name: "CartPole Action"

2-44
evaluate

Description: [0x0 string]


Dimension: [1 1]
DataType: "double"

To approximate the policy within the actor, use a recurrent deep neural network. Define the network
as an array of layer objects. Get the dimensions of the observation space and the number of possible
actions directly from the environment specification objects.
net = [
sequenceInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(8)
reluLayer
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer(numel(actInfo.Elements)) ];

Convert the network to a dlnetwork object and display the number of weights.
net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 602

Inputs:
1 'sequenceinput' Sequence input with 4 dimensions

Create a stochastic actor representation for the network.


actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);

Use evaluate to return the probability of each of the two possible actions. Note that the type of the
returned numbers is single, not double.
[prob,state] = evaluate(actor,{rand(obsInfo.Dimension)});
prob{1}

ans = 2x1 single column vector

0.4847
0.5153

Since a recurrent neural network is used for the actor, the second output argument, representing the
updated state of the neural network, is not empty. In this case, it contains the updated (cell and
hidden) states for the eight units of the lstm layer used in the network.
state{:}

ans = 8x1 single column vector

-0.0833
0.0619
-0.0066
-0.0651
0.0714
-0.0957
0.0614

2-45
2 Functions

-0.0326

ans = 8x1 single column vector

-0.1367
0.1142
-0.0158
-0.1820
0.1305
-0.1779
0.0947
-0.0833

You can use getState and setState to extract and set the current state of the actor.

getState(actor)

ans=2×1 cell array


{8x1 single}
{8x1 single}

actor = setState(actor, ...


{-0.01*single(rand(8,1)), ...
0.01*single(rand(8,1))});

You can obtain action probabilities and updated states for a batch of observations. For example, use a
batch of five independent observations.

obsBatch = reshape(1:20,4,1,5,1);
[prob,state] = evaluate(actor,{obsBatch})

prob = 1x1 cell array


{2x5 single}

state=2×1 cell array


{8x5 single}
{8x5 single}

The output arguments contain action probabilities and updated states for each observation in the
batch.

Note that the actor treats observation data along the batch length dimension independently, not
sequentially.

prob{1}

ans = 2x5 single matrix

0.5187 0.5869 0.6048 0.6124 0.6155


0.4813 0.4131 0.3952 0.3876 0.3845

prob = evaluate(actor,{obsBatch(:,:,[5 4 3 1 2])});


prob{1}

2-46
evaluate

ans = 2x5 single matrix

0.6155 0.6124 0.6048 0.5187 0.5869


0.3845 0.3876 0.3952 0.4813 0.4131

To evaluate the actor using sequential observations, use the sequence length (time) dimension. For
example, obtain action probabilities for five independent sequences, each one made of nine
sequential observations.

[prob,state] = evaluate(actor, ...


{rand([obsInfo.Dimension 5 9])})

prob = 1x1 cell array


{2x5x9 single}

state=2×1 cell array


{8x5 single}
{8x5 single}

The first output argument contains a vector of two probabilities (first dimension) for each element of
the observation batch (second dimension) and for each time element of the sequence length (third
dimension).

The second output argument contains two vectors of final states for each observation batch (that is,
the network maintains a separate state history for each observation batch).

Display the probability of the second action, after the seventh sequential observation in the fourth
independent batch.

prob{1}(2,4,7)

ans = single
0.5675

For more information on input and output format for recurrent neural networks, see the Algorithms
section of lstmLayer.

Input Arguments
fcnAppx — Function approximator object
function approximator object

Function approximator object, specified as:

• rlValueFunction,
• rlQValueFunction,
• rlVectorQValueFunction,
• rlDiscreteCategoricalActor,
• rlContinuousDeterministicActor,
• rlContinuousGaussianActor,

2-47
2 Functions

• rlContinuousDeterministicTransitionFunction,
• rlContinuousGaussianTransitionFunction,
• rlContinuousDeterministicRewardFunction,
• rlContinuousGaussianRewardFunction,
• rlIsDoneFunction object.

inData — Input data for function approximator


cell array

Input data for the function approximator, specified as a cell array with as many elements as the
number of input channels of fcnAppx. In the following section, the number of observation channels is
indicated by NO.

• If fcnAppx is an rlQValueFunction, an
rlContinuousDeterministicTransitionFunction or an
rlContinuousGaussianTransitionFunction object, then each of the first NO elements of
inData must be a matrix representing the current observation from the corresponding
observation channel. They must be followed by a final matrix representing the action.
• If fcnAppx is a function approximator object representing an actor or critic (but not an
rlQValueFunction object), inData must contain NO elements, each one a matrix representing
the current observation from the corresponding observation channel.
• If fcnAppx is an rlContinuousDeterministicRewardFunction, an
rlContinuousGaussianRewardFunction, or an rlIsDoneFunction object, then each of the
first NO elements of inData must be a matrix representing the current observation from the
corresponding observation channel. They must be followed by a matrix representing the action,
and finally by NO elements, each one being a matrix representing the next observation from the
corresponding observation channel.

Each element of inData must be a matrix of dimension MC-by-LB-by-LS, where:

• MC corresponds to the dimensions of the associated input channel.


• LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of (independent)
inputs, specify LB > 1. If inData has multiple elements, then LB must be the same for all elements
of inData.
• LS specifies the sequence length (length of the sequence of inputs along the time dimension) for
recurrent neural network. If fcnAppx does not use a recurrent neural network (which is the case
for environment function approximators, as they do not support recurrent neural networks), then
LS = 1. If inData has multiple elements, then LS must be the same for all elements of inData.

For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.
Example: {rand(8,3,64,1),rand(4,1,64,1),rand(2,1,64,1)}

Output Arguments
outData — Output data from evaluation of function approximator object
cell array

2-48
evaluate

Output data from the evaluation of the function approximator object, returned as a cell array. The size
and contents of outData depend on the type of object you use for fcnAppx, and are shown in the
following list. Here, NO is the number of observation channels.

• rlContinuousDeterministicTransitionFunction - NO matrices, each one representing the


predicted observation from the corresponding observation channel.
• rlContinuousGaussianTransitionFunction - NO matrices representing the mean value of
the predicted observation for the corresponding observation channel, followed by NO matrices
representing the standard deviation of the predicted observation for the corresponding
observation channel.
• rlContinuousGaussianActor - Two matrices representing the mean value and standard
deviation of the action, respectively.
• rlDiscreteCategoricalActor - A matrix with the probabilities of each action.
• rlContinuousDeterministicActor A matrix with the action.
• rlVectorQValueFunction - A matrix with the values of each possible action.
• rlQValueFunction - A matrix with the value of the action.
• rlValueFunction - A matrix with the value of the current observation.
• rlContinuousDeterministicRewardFunction - A matrix with the predicted reward as a
function of current observation, action, and next observation following the action.
• rlContinuousGaussianRewardFunction - Two matrices representing the mean value and
standard deviation, respectively, of the predicted reward as a function of current observation,
action, and next observation following the action.
• rlIsDoneFunction - A vector with the probabilities of the predicted termination status.
Termination probabilities range from 0 (no termination predicted) or 1 (termination predicted),
and depend (in the most general case) on the values of observation, action, and next observation
following the action.

Each element of outData is a matrix of dimensions D-by-LB-by-LS, where:

• D is the vector of dimensions of the corresponding output channel of fcnAppx. Depending on the
type of approximator function, this channel can carry a predicted observation (or its mean value or
standard deviation), an action (or its mean value or standard deviation), the value (or values) of an
observation (or observation-action couple), a predicted reward, or a predicted termination status.
• LB is the batch size (length of a batch of independent inputs).
• LS is the sequence length (length of the sequence of inputs along the time dimension) for a
recurrent neural network. If fcnAppx does not use a recurrent neural network (which is the case
for environment function approximators, as they do not support recurrent neural networks), then
LS = 1.

Note If fcnAppx is a critic, then evaluate behaves identically to getValue except that it returns
results inside a single-cell array. If fcnAppx is an rlContinuousDeterministicActor actor, then
evaluate behaves identically to getAction. If fcnAppx is a stochastic actor such as an
rlDiscreteCategoricalActor or rlContinuousGaussianActor, then evaluate returns the
action probability distribution, while getAction returns a sample action. Specifically, for an
rlDiscreteCategoricalActor actor object, evaluate returns the probability of each possible
action. For an rlContinuousGaussianActor actor object, evaluate returns the mean and
standard deviation of the Gaussian distribution. For these kinds of actors, see also the note in
getAction regarding the enforcement of constraints set by the action specification.

2-49
2 Functions

Note If fcnAppx is an rlContinuousDeterministicRewardFunction object, then evaluate


behaves identically to predict except that it returns results inside a single-cell array. If fcnAppx is
an rlContinuousDeterministicTransitionFunction object, then evaluate behaves
identically to predict. If fcnAppx is an rlContinuousGaussianTransitionFunction object,
then evaluate returns the mean value and standard deviation the observation probability
distribution, while predict returns an observation sampled from this distribution. Similarly, for an
rlContinuousGaussianRewardFunction object, evaluate returns the mean value and standard
deviation the reward probability distribution, while predict returns a reward sampled from this
distribution. Finally, if fcnAppx is an rlIsDoneFunction object, then evaluate returns the
probabilities of the termination status being false or true, respectively, while predict returns a
predicted termination status sampled with these probabilities.

state — Updated state of function approximator object


cell array

Next state of the function approximator object, returned as a cell array. If fcnAppx does not use a
recurrent neural network (which is the case for environment function approximators), then state is
an empty cell array.

You can set the state of the representation to state using the setState function. For example:

critic = setState(critic,state);

Version History
Introduced in R2022a

See Also
getValue | getAction | getMaxQValue | rlValueFunction | rlQValueFunction |
rlVectorQValueFunction | rlContinuousDeterministicActor |
rlDiscreteCategoricalActor | rlContinuousGaussianActor |
rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction | accelerate | gradient | predict

Topics
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”

2-50
exteriorPenalty

exteriorPenalty
Exterior penalty value for a point with respect to a bounded region

Syntax
p = exteriorPenalty(x,xmin,xmax,method)

Description
p = exteriorPenalty(x,xmin,xmax,method) uses the specified method to calculate the
nonnegative (exterior) penalty vector p for the point x with respect to the region bounded by xmin
and xmax. p has the same dimension as x.

Examples

Calculate Exterior Penalty for Point

This example shows how to use the exteriorPenalty function to calculate the exterior penalty for
a given point, with respect to a bounded region.

Calculate the penalty value for the point 0.1 within the interval [-2,2], using the step method.

exteriorPenalty(0.1,-2,2,'step')

ans = 0

Calculate the penalty value for the point 4 outside the interval [-2,2], using the step method.

exteriorPenalty(4,-2,2,'step')

ans = 1

Calculate the penalty value for the point 4 outside the interval [-2,2], using the quadratic method.

exteriorPenalty(4,-2,2,'quadratic')

ans = 4

Calculate the penalty value for the point [-2,0,4] with respect to the box defined by the intervals
[0,1], [-1,1], and [-2,2] along the x, y, and z dimensions, respectively, using the quadratic method.

exteriorPenalty([-2 0 4],[0 -1 -2],[1 1 2],'quadratic')

ans = 3×1

4
0
4

2-51
2 Functions

Visualize Penalty Values for an Interval

Create a vector of 1001 equidistant points distributed between -5 and 5.

x = -5:0.01:5;

Calculate penalties for all the points in the vector, using the quadratic method.

p = exteriorPenalty(x,-2,2,'quadratic');

Plot the points and add grid, labels, and title.

plot(x,p)
grid
xlabel("point position");
ylabel("penalty value");
title("Penalty values over an interval");

Input Arguments
x — Point for which penalty is calculated
scalar | vector | matrix

Point for which the exterior penalty is calculated, specified as a numeric scalar, vector, or matrix.
Example: [-0.1, 1.3]

2-52
exteriorPenalty

xmin — Lower bounds


scalar | vector | matrix

Lower bounds for x, specified as a numeric scalar, vector, or matrix. To use the same minimum value
for all elements in x, specify xmin as a scalar.
Example: -2

xmax — Upper bounds


scalar | vector | matrix

Upper bounds for x, specified as a numeric scalar, vector, or matrix. To use the same maximum value
for all elements in x, specify xmax as a scalar.
Example: [5 10]

method — Function used to calculate the penalty


'step' | 'quadratic'

Function used to calculate the penalty, specified either as 'step' or 'quadratic'. You can also use
strings instead of character vectors.
Example: "quadratic"

Output Arguments
p — Penalty value
nonnegative vector

Penalty value, returned as a vector of nonnegative elements. With either of the two methods, each
element pi is zero if the corresponding xi is within the region specified by xmini and xmaxi, and it is
positive otherwise. Penalty functions are typically used to generate negative rewards when
constraints are violated, such as in generateRewardFunction.

Version History
Introduced in R2021b

Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

See Also
Functions
generateRewardFunction | hyperbolicPenalty | barrierPenalty

Topics
“Generate Reward Function from a Model Predictive Controller for a Servomotor”
“Define Reward Signals”

2-53
2 Functions

generatePolicyBlock
Generate Simulink block that evaluates policy of an agent or policy object

Syntax
generatePolicyBlock(agent)

generatePolicyBlock(policy)

generatePolicyBlock( ___ ,MATFileName=dataFileName)

Description
This function generates a Simulink Policy evaluation block from an agent or policy object. It also
creates a data file which stores policy information. The generated policy block loads this data file to
properly initialize itself prior to simulation. You can use the block to simulate the policy and generate
code for deployment purposes.

For more information on policies and value functions, see “Create Policies and Value Functions”.

generatePolicyBlock(agent) creates a block that evaluates the policy of the specified agent
using the default block name, policy name, and data file name.

generatePolicyBlock(policy) creates a block that evaluates the learned policy of the specified
policy object using the default block name, policy name, and data file name.

generatePolicyBlock( ___ ,MATFileName=dataFileName) specifies the file name of the data


file.

Examples

Create Policy Evaluation Block from PG Agent

First, create and train a reinforcement learning agent. For this example, load the PG agent trained in
“Train PG Agent to Balance Cart-Pole System”.

load("MATLABCartpolePG.mat","agent")

Then, create a policy evaluation block from this agent using default names.

generatePolicyBlock(agent);

This command creates an untitled Simulink® model, containing the policy block, and the
blockAgentData.mat file, containing information needed to create and initialize the policy block,
(such as the trained deep neural network used by the actor within the agent). The block loads this
data file to properly initialize itself prior to simulation.

2-54
generatePolicyBlock

You can now drag and drop the block in a Simulink® model and connect it so that it takes the
observation from the environment as input and so that the calculated action is returned to the
environment. This allows you to simulate the policy in a closed loop. You can then generate code for
deployment purposes. For more information, see “Deploy Trained Reinforcement Learning Policies”.

Close the model.

bdclose("untitled")

Create Policy Block from Deterministic Actor Policy Object

Create observation and action specification objects. For this example, define the observation and
action spaces as continuous four- and two-dimensional spaces, respectively.

obsInfo = rlNumericSpec([4 1]);


actInfo = rlNumericSpec([2 1]);

Alternatively use getObservationInfo and getActionInfo to extract the specification objects


from an environment.

Create a continuous deterministic actor. This actor must accept an observation as input and return an
action as output.

To approximate the policy function within the actor, use a recurrent deep neural network model.
Define the network as an array of layer objects, and get the dimension of the observation and action
spaces from the environment specification objects. To create a recurrent network, use a
sequenceInputLayer as the input layer (with size equal to the number of dimensions of the
observation channel) and include at least one lstmLayer.

layers = [
sequenceInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(10)
reluLayer
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer(20)
fullyConnectedLayer(actInfo.Dimension(1))
tanhLayer
];

2-55
2 Functions

Convert the network to a dlnetwork object and display the number of weights.

model = dlnetwork(layers);
summary(model)

Initialized: true

Number of learnables: 880

Inputs:
1 'sequenceinput' Sequence input with 4 dimensions (CTB)

Create the actor using model, and the observation and action specifications.

actor = rlContinuousDeterministicActor(model,obsInfo,actInfo)

actor =
rlContinuousDeterministicActor with properties:

ObservationInfo: [1×1 rl.util.rlNumericSpec]


ActionInfo: [1×1 rl.util.rlNumericSpec]
UseDevice: "cpu"

Check the actor with a random observation input.

act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}

ans = 2×1 single column vector

-0.0742
0.0158

Create a policy object from actor.

policy = rlDeterministicActorPolicy(actor)

policy =
rlDeterministicActorPolicy with properties:

Actor: [1×1 rl.function.rlContinuousDeterministicActor]


ObservationInfo: [1×1 rl.util.rlNumericSpec]
ActionInfo: [1×1 rl.util.rlNumericSpec]
SampleTime: -1

You can access the policy options using dot notation. Check the policy with a random observation
input.

act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}

ans = 2×1

-0.0060
-0.0161

2-56
generatePolicyBlock

You can train the policy with a custom training loop.

Then, create a policy evaluation block from this policy object using the default name for the
generated MAT-file.

generatePolicyBlock(policy);

This command creates an untitled Simulink® model, containing the policy block, and the
blockAgentData.mat file, containing information needed to create and initialize the policy block,
(such as the trained deep neural network used by the actor within the agent). The block loads this
data file to properly initialize itself prior to simulation.

You can now drag and drop the block in a Simulink® model and connect it so that it takes the
observation from the environment as input and so that the calculated action is returned to the
environment. This allows you to simulate the policy in a closed loop. You can then generate code for
deployment purposes. For more information, see “Deploy Trained Reinforcement Learning Policies”.

Close the model.

bdclose("untitled")

Input Arguments
agent — Reinforcement learning agent
reinforcement learning agent object

Trained reinforcement learning agent, specified as one of the following agent objects. To train your
agent, use the train function.

• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlPGAgent

2-57
2 Functions

• rlPPOAgent
• rlTRPOAgent
• rlSACAgent

For agents with a stochastic actor (PG, PPO, SAC, TRPO, AC), the action returned by the generated
policy function depends on the value of the UseExplorationPolicy property of the agent. By
default, UseExplorationPolicy is false and the generated action is deterministic. If
UseExplorationPolicy is true, the generated action is stochastic.

policy — Reinforcement learning policy


rlMaxQPolicy | rlDeterministicActorPolicy | rlStochasticActorPolicy

Reinforcement learning policy, specified as one of the following objects:

• rlMaxQPolicy
• rlDeterministicActorPolicy
• rlStochasticActorPolicy

Note rlAdditiveNoisePolicy and rlEpsilonGreedyPolicy policy objects are not supported.

dataFileName — Name of generated data file


"blockAgentData" (default) | string | character vector

Name of generated data file, specified as a string or character vector. If a file with the specified name
already exists in the current MATLAB folder, then an appropriate digit is added to the name so that
no existing file is overwritten.

The generated data file contains four structures that store data needed to fully characterize the
policy. Prior to simulation, the block (which is generated with the data file name as mask parameter)
loads this data file to properly initialize itself.

Version History
Introduced in R2019a

See Also
Policy | generatePolicyFunction | train | dlnetwork

Topics
“Generate Policy Block for Deployment”
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
“Deploy Trained Reinforcement Learning Policies”

2-58
generatePolicyFunction

generatePolicyFunction
Package: rl.policy

Generate function that evaluates policy of an agent or policy object

Syntax
generatePolicyFunction(agent)

generatePolicyFunction(policy)

generatePolicyFunction( ___ ,Name=Value)

Description
This function generates a policy evaluation function which you can use to:

• Generate code for deployment purposes using MATLAB Coder™ or GPU Coder™. For more
information, see “Deploy Trained Reinforcement Learning Policies”.
• Simulate the trained agent in Simulink using a MATLAB Function block.

This function also creates a data file which stores policy information. The evaluation function loads
this data file to properly initialize itself the first time it is called.

For more information on policies and value functions, see “Create Policies and Value Functions”.

generatePolicyFunction(agent) creates a function that evaluates the learned policy of the


specified agent using the default function name, policy name, and data file name.

generatePolicyFunction(policy) creates a function that evaluates the learned policy of the


specified policy object using the default function name, policy name, and data file name.

generatePolicyFunction( ___ ,Name=Value) specifies the function name, policy name, and data
file name using one or more name-value pair arguments.

Examples

Create Policy Evaluation Function for PG Agent

This example shows how to create a policy evaluation function for a PG Agent.

First, create and train a reinforcement learning agent. For this example, load the PG agent trained in
“Train PG Agent to Balance Cart-Pole System”.

load("MATLABCartpolePG.mat","agent")

Then, create a policy evaluation function for this agent using default names.

generatePolicyFunction(agent);

2-59
2 Functions

This command creates the evaluatePolicy.m file, which contains the policy function, and the
agentData.mat file, which contains the trained deep neural network actor.

View the generated function.

type evaluatePolicy.m

function action1 = evaluatePolicy(observation1)


%#codegen

% Reinforcement Learning Toolbox


% Generated on: 26-Nov-2022 18:54:46

persistent policy;
if isempty(policy)
policy = coder.loadRLPolicy("agentData.mat");
end
% evaluate the policy
action1 = getAction(policy,observation1);

Evaluate the policy for a random observation.

evaluatePolicy(rand(agent.ObservationInfo.Dimension))

ans = 10

You can now generate code for this policy function using MATLAB® Coder™. For more information,
see “Deploy Trained Reinforcement Learning Policies”.

Create Policy Evaluation Function from Policy Object

You can create and train a policy object in a custom training loop or extract a trained object from a
trained agent. For this example, load the PG agent trained in “Train PG Agent to Balance Cart-Pole
System”, and extract its greedy policy using getGreedyPolicy. Alternatively, you can extract an
explorative policy using getExplorationPolicy.

load("MATLABCartpolePG.mat","agent")
policy = getGreedyPolicy(agent)

policy =
rlStochasticActorPolicy with properties:

Actor: [1x1 rl.function.rlDiscreteCategoricalActor]


UseMaxLikelihoodAction: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1

Then, create a policy evaluation function for this policy using default names.

generatePolicyFunction(policy);

This command creates the evaluatePolicy.m file, which contains the policy function, and the
agentData.mat file, which contains the trained deep neural network actor.

2-60
generatePolicyFunction

View the generated function.

type evaluatePolicy.m

function action1 = evaluatePolicy(observation1)


%#codegen

% Reinforcement Learning Toolbox


% Generated on: 26-Nov-2022 18:54:53

persistent policy;
if isempty(policy)
policy = coder.loadRLPolicy("agentData.mat");
end
% evaluate the policy
action1 = getAction(policy,observation1);

Evaluate the policy for a random observation.

evaluatePolicy(rand(policy.ObservationInfo.Dimension))

ans = 10

You can now generate code for this policy function using MATLAB® Coder™. For more information,
see “Deploy Trained Reinforcement Learning Policies”.

Create Policy Evaluation Function for Q-Learning Agent

This example shows how to create a policy evaluation function for a Q-Learning Agent.

For this example, load the Q-learning agent trained in “Train Reinforcement Learning Agent in Basic
Grid World”

load("basicGWQAgent.mat","qAgent")

Create a policy evaluation function for this agent and specify the name of the agent data file.

generatePolicyFunction(qAgent,"MATFileName","policyFile.mat")

This command creates the evaluatePolicy.m file, which contains the policy function, and the
policyFile.mat file, which contains the trained Q table value function.

View the generated function.

type evaluatePolicy.m

function action1 = evaluatePolicy(observation1)


%#codegen

% Reinforcement Learning Toolbox


% Generated on: 05-May-2022 17:43:27

persistent policy;
if isempty(policy)
policy = coder.loadRLPolicy("policyFile.mat");
end

2-61
2 Functions

% evaluate the policy


action1 = getAction(policy,observation1);

Evaluate the policy for a random observation.

evaluatePolicy(randi(25))

ans = 3

You can now generate code for this policy function using MATLAB® Coder™. For more information,
see “Deploy Trained Reinforcement Learning Policies”.

Input Arguments
agent — Reinforcement learning agent
reinforcement learning agent object

Trained reinforcement learning agent, specified as one of the following agent objects. To train your
agent, use the train function.

• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlPGAgent
• rlPPOAgent
• rlTRPOAgent
• rlSACAgent

For agents with a stochastic actor (PG, PPO, SAC, TRPO, AC), the action returned by the generated
policy function depends on the value of the UseExplorationPolicy property of the agent. By
default, UseExplorationPolicy is false and the generated action is deterministic. If
UseExplorationPolicy is true, the generated action is stochastic.

policy — Reinforcement learning policy


rlMaxQPolicy | rlDeterministicActorPolicy | rlStochasticActorPolicy

Reinforcement learning policy, specified as one of the following objects:

• rlMaxQPolicy
• rlDeterministicActorPolicy
• rlStochasticActorPolicy

Note rlAdditiveNoisePolicy and rlEpsilonGreedyPolicy policy objects are not supported.

2-62
generatePolicyFunction

Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.
Example: FunctionName="computeAction"

FunctionName — Name of the generated function


"evaluatePolicy" (default) | string | character vector

Name of the generated function, specified as a string or character vector.

PolicyName — Name of the policy object within the generated function


"policy" (default) | string | character vector

Name of the policy object within the generated function, specified as a string or character vector.

MATFileName — Name of generated data file


"agentData" (default) | string | character vector

Name of generated data file, specified as a string or character vector. If a file with the specified name
already exists in the current MATLAB folder, then an appropriate digit is added to the name so that
no existing file is overwritten.

The generated data file contains four structures that store data needed to fully characterize the
policy. The evaluation function loads this data file to properly initialize itself the first time it is called.

Version History
Introduced in R2019a

Code generated by generatePolicyFunction now uses policy objects


Behavior change in future release

The code generated by generatePolicyFunction now loads a deployable policy object from a
reinforcement learning agent. The results from running the generated policy function remain the
same.

See Also
generatePolicyBlock | Policy | train | dlnetwork

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
“Deploy Trained Reinforcement Learning Policies”

2-63
2 Functions

generateRewardFunction
Generate a reward function from control specifications to train a reinforcement learning agent

Syntax
generateRewardFunction(mpcobj)
generateRewardFunction(blks)
generateRewardFunction( ___ ,'FunctionName',myFcnName)

Description
generateRewardFunction(mpcobj) generates a MATLAB reward function based on the cost and
constraints defined in the linear or nonlinear MPC object mpcobj. The generated reward function is
displayed in a new editor window and you can use it as a starting point for reward design. You can
tune the weights, use a different penalty function, and then use the resulting reward function within
an environment to train an agent.

This syntax requires Model Predictive Control Toolbox™ software.

generateRewardFunction(blks) generates a MATLAB reward function based on performance


constraints defined in the model verification blocks specified in the array of block paths blks.

This syntax requires Simulink Design Optimization™ software.

generateRewardFunction( ___ ,'FunctionName',myFcnName) specifies the name of the


generated reward function, and saves it into a file with the same name. It also overwrites any
preexisting file with the same name in the current directory. Provide this name after either of the
previous input arguments.

Examples

Generate a Reward Function from MPC object

This example shows how to generate a reinforcement learning reward function from an MPC object.

Define Plant and Create MPC Controller

Create a random plant using the rss function and set the feedthrough matrix to zero.

plant = rss(4,3,2);
plant.d = 0;

Specify which of the plant signals are manipulated variables, measured disturbances, measured
outputs and unmeasured outputs.

plant = setmpcsignals(plant,'MV',1,'MD',2,'MO',[1 2],'UO',3);

Create an MPC controller with a sample time of 0.1 and prediction and control horizons of 10 and 3
steps, respectively.

2-64
generateRewardFunction

mpcobj = mpc(plant,0.1,10,3);

-->The "Weights.ManipulatedVariables" property is empty. Assuming default 0.00000.


-->The "Weights.ManipulatedVariablesRate" property is empty. Assuming default 0.10000.
-->The "Weights.OutputVariables" property is empty. Assuming default 1.00000.
for output(s) y1 and zero weight for output(s) y2 y3

Set limits and a scale factor for the manipulated variable.


mpcobj.ManipulatedVariables.Min = -2;
mpcobj.ManipulatedVariables.Max = 2;
mpcobj.ManipulatedVariables.ScaleFactor = 4;

Set weights for the quadratic cost function.


mpcobj.Weights.OutputVariables = [10 1 0.1];
mpcobj.Weights.ManipulatedVariablesRate = 0.2;

Generate the Reward Function

Generate the reward function code from specifications in the mpc object using
generateRewardFunction. The code is displayed in the MATLAB Editor.
generateRewardFunction(mpcobj)

For this example, the is saved in the MATLAB function file myMpcRewardFcn.m. Display the
generated reward function.
type myMpcRewardFcn.m

function reward = myMpcRewardFcn(y,refy,mv,refmv,lastmv)


% MYMPCREWARDFCN generates rewards from MPC specifications.
%
% Description of input arguments:
%
% y : Output variable from plant at step k+1
% refy : Reference output variable at step k+1
% mv : Manipulated variable at step k
% refmv : Reference manipulated variable at step k
% lastmv : Manipulated variable at step k-1
%
% Limitations (MPC and NLMPC):
% - Reward computed based on first step in prediction horizon.
% Therefore, signal previewing and control horizon settings are ignored.
% - Online cost and constraint update is not supported.
% - Custom cost and constraint specifications are not considered.
% - Time varying cost weights and constraints are not supported.
% - Mixed constraint specifications are not considered (for the MPC case).

% Reinforcement Learning Toolbox


% 27-May-2021 14:47:29

%#codegen

%% Specifications from MPC object


% Standard linear bounds as specified in 'States', 'OutputVariables', and
% 'ManipulatedVariables' properties
ymin = [-Inf -Inf -Inf];
ymax = [Inf Inf Inf];

2-65
2 Functions

mvmin = -2;
mvmax = 2;
mvratemin = -Inf;
mvratemax = Inf;

% Scale factors as specified in 'States', 'OutputVariables', and


% 'ManipulatedVariables' properties
Sy = [1 1 1];
Smv = 4;

% Standard cost weights as specified in 'Weights' property


Qy = [10 1 0.1];
Qmv = 0;
Qmvrate = 0.2;

%% Compute cost
dy = (refy(:)-y(:)) ./ Sy';
dmv = (refmv(:)-mv(:)) ./ Smv';
dmvrate = (mv(:)-lastmv(:)) ./ Smv';
Jy = dy' * diag(Qy.^2) * dy;
Jmv = dmv' * diag(Qmv.^2) * dmv;
Jmvrate = dmvrate' * diag(Qmvrate.^2) * dmvrate;
Cost = Jy + Jmv + Jmvrate;

%% Penalty function weight (specify nonnegative)


Wy = [1 1 1];
Wmv = 10;
Wmvrate = 10;

%% Compute penalty
% Penalty is computed for violation of linear bound constraints.
%
% To compute exterior bound penalty, use the exteriorPenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% Alternaltively, use the hyperbolicPenalty or barrierPenalty function for
% computing hyperbolic and barrier penalties.
%
% For more information, see help for these functions.
%
% Set Pmv value to 0 if the RL agent action specification has
% appropriate 'LowerLimit' and 'UpperLimit' values.
Py = Wy * exteriorPenalty(y,ymin,ymax,'step');
Pmv = Wmv * exteriorPenalty(mv,mvmin,mvmax,'step');
Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'step');
Penalty = Py + Pmv + Pmvrate;

%% Compute reward
reward = -(Cost + Penalty);
end

The calculated reward depends only on the current values of the plant input and output signals and
their reference values, and it is composed of two parts.

The first is a negative cost that depends on the squared difference between desired and current plant
inputs and outputs. This part uses the cost function weights specified in the MPC object. The second

2-66
generateRewardFunction

part is a penalty that acts as a negative reward whenever the current plant signals violate the
constraints.

The generated reward function is a starting point for reward design. You can tune the weights or use
a different penalty function to define a more appropriate reward for your reinforcement learning
agent.

Generate Reward Function from Verification Block

This example shows how to generate a reinforcement learning reward function from a Simulink
Design Optimization model verification block.

For this example, open the Simulink model LevelCheckBlock.slx, which contains a Check Step
Response Characteristics block named Level Check.

open_system('LevelCheckBlock')

2-67
2 Functions

Generate the reward function code from specifications in the Level Check block, using
generateRewardFunction. The code is displayed in the MATLAB Editor.

generateRewardFunction('LevelCheckBlock/Level Check')

For this example, the code is saved in the MATLAB function file myBlockRewardFcn.m.

Display the generated reward function.

type myBlockRewardFcn.m

function reward = myBlockRewardFcn(x,t)


% MYBLOCKREWARDFCN generates rewards from Simulink block specifications.
%
% x : Input of LevelCheckBlock/Level Check
% t : Simulation time (s)

% Reinforcement Learning Toolbox


% 27-May-2021 16:45:27

%#codegen

%% Specifications from LevelCheckBlock/Level Check

2-68
generateRewardFunction

Block1_InitialValue = 1;
Block1_FinalValue = 2;
Block1_StepTime = 0;
Block1_StepRange = Block1_FinalValue - Block1_InitialValue;
Block1_MinRise = Block1_InitialValue + Block1_StepRange * 80/100;
Block1_MaxSettling = Block1_InitialValue + Block1_StepRange * (1+2/100);
Block1_MinSettling = Block1_InitialValue + Block1_StepRange * (1-2/100);
Block1_MaxOvershoot = Block1_InitialValue + Block1_StepRange * (1+10/100);
Block1_MinUndershoot = Block1_InitialValue - Block1_StepRange * 5/100;

if t >= Block1_StepTime
if Block1_InitialValue <= Block1_FinalValue
Block1_UpperBoundTimes = [0,5; 5,max(5+1,t+1)];
Block1_UpperBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettlin
Block1_LowerBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
Block1_LowerBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,
else
Block1_UpperBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
Block1_UpperBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,
Block1_LowerBoundTimes = [0,5; 5,max(5+1,t+1)];
Block1_LowerBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettlin
end

Block1_xmax = zeros(1,size(Block1_UpperBoundTimes,1));
for idx = 1:numel(Block1_xmax)
tseg = Block1_UpperBoundTimes(idx,:);
xseg = Block1_UpperBoundAmplitudes(idx,:);
Block1_xmax(idx) = interp1(tseg,xseg,t,'linear',NaN);
end
if all(isnan(Block1_xmax))
Block1_xmax = Inf;
else
Block1_xmax = max(Block1_xmax,[],'omitnan');
end

Block1_xmin = zeros(1,size(Block1_LowerBoundTimes,1));
for idx = 1:numel(Block1_xmin)
tseg = Block1_LowerBoundTimes(idx,:);
xseg = Block1_LowerBoundAmplitudes(idx,:);
Block1_xmin(idx) = interp1(tseg,xseg,t,'linear',NaN);
end
if all(isnan(Block1_xmin))
Block1_xmin = -Inf;
else
Block1_xmin = max(Block1_xmin,[],'omitnan');
end
else
Block1_xmin = -Inf;
Block1_xmax = Inf;
end

%% Penalty function weight (specify nonnegative)


Weight = 1;

%% Compute penalty

2-69
2 Functions

% Penalty is computed for violation of linear bound constraints.


%
% To compute exterior bound penalty, use the exteriorPenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% Alternaltively, use the hyperbolicPenalty or barrierPenalty function for
% computing hyperbolic and barrier penalties.
%
% For more information, see help for these functions.
Penalty = sum(exteriorPenalty(x,Block1_xmin,Block1_xmax,'step'));

%% Compute reward
reward = -Weight * Penalty;
end

The generated reward function takes as input arguments the current value of the verification block
input signals and the simulation time. A negative reward is calculated using a weighted penalty that
acts whenever the current block input signals violate the linear bound constraints defined in the
verification block.

The generated reward function is a starting point for reward design. You can tune the weights or use
a different penalty function to define a more appropriate reward for your reinforcement learning
agent.

Close the Simulink model.

close_system('LevelCheckBlock')

Input Arguments
mpcobj — Linear or nonlinear MPC object
mpc object | nlmpc object

Linear or nonlinear MPC object, specified as an mpc object or an nlmpc object, respectively.

Note that:

• The generated function calculates rewards using signal values at the current time only. Predicted
future values, signal previewing, and control horizon settings are not used in the reward
calculation.
• Using time-varying cost weights and constraints, or updating them online, is not supported.
• Only the standard quadratic cost function, as described in “Optimization Problem” (Model
Predictive Control Toolbox), is supported. Therefore, for mpc objects, using mixed constraint
specifications is not supported. Similarly, for nlmpc objects, custom cost and constraint
specifications are not supported.

Example: mpc(tf([1 1],[1 2 0]),0.1)

blks — Path to model verification blocks


char array | cell array | string array

Path to model verification blocks, specified as character array, cell array or string array. The
supported Simulink Design Optimization model verification blocks are the following ones.

2-70
generateRewardFunction

• Check Against Reference (Simulink Design Optimization)


• Check Custom Bounds (Simulink Design Optimization)
• Check Step Response Characteristics (Simulink Design Optimization)
• Any block belonging to the Simulink “Model Verification” (Simulink) library

The generated reward function takes as input arguments the current value of the verification block
input signals and the simulation time. A negative reward is calculated using a weighted penalty that
acts whenever the current block input signals violate the linear bound constraints defined in the
verification block.
Example: "mySimulinkModel02/Check Against Reference"

myFcnName — Function name


string | char vector

Function name, specified as a string object or character vector.


Example: "reward03epf_step"

Tips
By default, the exterior bound penalty function exteriorPenalty is used to calculate the penalty.
Alternatively, to calculate hyperbolic and barrier penalties, you can use the hyperbolicPenalty or
barrierPenalty functions.

Version History
Introduced in R2021b

See Also
Functions
exteriorPenalty | hyperbolicPenalty | barrierPenalty

Objects
mpc | nlmpc

Topics
“Generate Reward Function from a Model Predictive Controller for a Servomotor”
“Generate Reward Function from a Model Verification Block for a Water Tank System”
“Define Reward Signals”
“Create MATLAB Reinforcement Learning Environments”
“Create Simulink Reinforcement Learning Environments”

2-71
2 Functions

getActionInfo
Obtain action data specifications from reinforcement learning environment, agent, or experience
buffer

Syntax
actInfo = getActionInfo(env)
actInfo = getActionInfo(agent)
actInfo = getActionInfo(buffer)

Description
actInfo = getActionInfo(env) extracts action information from reinforcement learning
environment env.

actInfo = getActionInfo(agent) extracts action information from reinforcement learning


agent agent.

actInfo = getActionInfo(buffer) extracts action information from experience buffer buffer.

Examples

Extract Action and Observation Information from Reinforcement Learning Environment

Extract action and observation information that you can use to create other environments or agents.

The reinforcement learning environment for this example is the simple longitudinal dynamics for ego
car and lead car. The training goal is to make the ego car travel at a set velocity while maintaining a
safe distance from lead car by controlling longitudinal acceleration (and braking). This example uses
the same vehicle model as the “Adaptive Cruise Control System Using Model Predictive Control”
(Model Predictive Control Toolbox) example.

Open the model and create the reinforcement learning environment.

mdl = 'rlACCMdl';
open_system(mdl);
agentblk = [mdl '/RL Agent'];
% create the observation info
obsInfo = rlNumericSpec([3 1],'LowerLimit',-inf*ones(3,1),'UpperLimit',inf*ones(3,1));
obsInfo.Name = 'observations';
obsInfo.Description = 'information on velocity error and ego velocity';
% action Info
actInfo = rlNumericSpec([1 1],'LowerLimit',-3,'UpperLimit',2);
actInfo.Name = 'acceleration';
% define environment
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo)

env =
SimulinkEnvWithAgent with properties:

2-72
getActionInfo

Model : rlACCMdl
AgentBlock : rlACCMdl/RL Agent
ResetFcn : []
UseFastRestart : on

The reinforcement learning environment env is a SimulinkWithAgent object with the above
properties.

Extract the action and observation information from the reinforcement learning environment env.

actInfoExt = getActionInfo(env)

actInfoExt =
rlNumericSpec with properties:

LowerLimit: -3
UpperLimit: 2
Name: "acceleration"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"

obsInfoExt = getObservationInfo(env)

obsInfoExt =
rlNumericSpec with properties:

LowerLimit: [3x1 double]


UpperLimit: [3x1 double]
Name: "observations"
Description: "information on velocity error and ego velocity"
Dimension: [3 1]
DataType: "double"

The action information contains acceleration values while the observation information contains the
velocity and velocity error values of the ego vehicle.

Input Arguments
env — Reinforcement learning environment
rlFunctionEnv object | SimulinkEnvWithAgent object | rlNeuralNetworkEnvironment object
| predefined MATLAB environment object

Reinforcement learning environment from which to extract the action information, specified as one of
the following:

• MATLAB environment represented as one of the following objects.

• rlFunctionEnv
• rlNeuralNetworkEnvironment
• Predefined MATLAB environment created using rlPredefinedEnv

2-73
2 Functions

• Simulink environment represented as a SimulinkEnvWithAgent object.

For more information on reinforcement learning environments, see “Create MATLAB Reinforcement
Learning Environments” and “Create Simulink Reinforcement Learning Environments”.

agent — Reinforcement learning agent


rlQAgent object | rlSARSAAgent object | rlDQNAgent object | rlPGAgent object | rlDDPGAgent
object | rlTD3Agent object | rlACAgent object | rlPPOAgent object | rlTRPOAgent object |
rlSACAgent object | rlMBPOAgent object

Reinforcement learning agent from which to extract the action information, specified as one of the
following objects.

• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlPGAgent
• rlACAgent
• rlPPOAgent
• rlTRPOAgent
• rlSACAgent
• rlMBPOAgent

For more information on reinforcement learning agents, see “Reinforcement Learning Agents”.

buffer — Experience buffer


rlReplayMemory object | rlPrioritizedReplayMemory object

Experience buffer from which to extract the action information, specified as an rlReplayMemory or
rlPrioritizedReplayMemory object.

Output Arguments
actInfo — Action data specifications
array of rlNumericSpec objects | array of rlFiniteSetSpec objects

Action data specifications extracted from the reinforcement learning environment, returned as an
array of one of the following:

• rlNumericSpec objects
• rlFiniteSetSpec objects
• A mix of rlNumericSpec and rlFiniteSetSpec objects

Version History
Introduced in R2019a

2-74
getActionInfo

See Also
rlNumericSpec | rlFiniteSetSpec | getObservationInfo | rlQAgent | rlSARSAAgent |
rlDQNAgent | rlPGAgent | rlACAgent | rlDDPGAgent

Topics
“Create Simulink Reinforcement Learning Environments”
“Reinforcement Learning Agents”

2-75
2 Functions

getAction
Package: rl.policy

Obtain action from agent, actor, or policy object given environment observations

Syntax
action = getAction(agent,obs)
[action,agent] = getAction(agent,obs)

action = getAction(actor,obs)
[action,nextState] = getAction(actor,obs)

action = getAction(policy,obs)
[action,updatedPolicy] = getAction(policy,obs)

Description
Agent

action = getAction(agent,obs) returns the action generated from the policy of a


reinforcement learning agent, given environment observations. If agent contains internal states,
they are updated.

[action,agent] = getAction(agent,obs) also returns the updated agent as an output


argument.
Actor

action = getAction(actor,obs) returns the action generated from the policy represented by
the actor actor, given environment observations obs.

[action,nextState] = getAction(actor,obs) also returns the updated state of the actor


when it uses a recurrent neural network.
Policy

action = getAction(policy,obs) returns the action generated from the policy object policy,
given environment observations.

[action,updatedPolicy] = getAction(policy,obs) also returns the updated policy as an


output argument (any internal state of the policy, if used, is updated).

Examples

Get Actions from Agent

Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”.

2-76
getAction

% load predefined environment


env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

Obtain the observation and action specifications for this environment.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a TRPO agent from the environment observation and action specifications.

agent = rlTRPOAgent(obsInfo,actInfo);

Use getAction to return the action from a random observation.

getAction(agent, ...
{rand(obsInfo(1).Dimension), ...
rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[-2]}

You can also obtain actions for a batch of observations. For example, obtain actions for a batch of 10
observations.

actBatch = getAction(agent, ...


{rand([obsInfo(1).Dimension 10]), ...
rand([obsInfo(2).Dimension 10])});
size(actBatch{1})

ans = 1×3

1 1 10

actBatch{1}(1,1,7)

ans = -2

actBatch contains one action for each observation in the batch.

Get Action from Deterministic Actor

Create observation and action information. You can also obtain these specifications from an
environment.

obsinfo = rlNumericSpec([4 1]);


actinfo = rlNumericSpec([2 1]);

Create a deep neural network for the actor.

net = [featureInputLayer(obsinfo.Dimension(1), ...


'Normalization','none','Name','state')
fullyConnectedLayer(10,'Name','fc1')
reluLayer('Name','relu1')

2-77
2 Functions

fullyConnectedLayer(20,'Name','CriticStateFC2')
fullyConnectedLayer(actinfo.Dimension(1),'Name','fc2')
tanhLayer('Name','tanh1')];
net = dlnetwork(net);

Create a deterministic actor representation for the network.

actor = rlContinuousDeterministicActor(net, ...


obsinfo,actinfo,...
'ObservationInputNames',{'state'});

Obtain an action from this actor for a random batch of 20 observations.

act = getAction(actor,{rand(4,1,10)})

act = 1x1 cell array


{2x1x10 single}

act is a single cell array that contains the two computed actions for all 10 observations in the batch.

act{1}(:,1,7)

ans = 2x1 single column vector

0.2643
-0.2934

Get Actions from Policy Object

Create observation and action specification objects. For this example, define the observation and
action spaces as continuous four- and two-dimensional spaces, respectively.

obsInfo = rlNumericSpec([4 1]);


actInfo = rlNumericSpec([2 1]);

Alternatively, you can use getObservationInfo and getActionInfo to extract the specification
objects from an environment.

Create a continuous deterministic actor. This actor must accept an observation as input and return an
action as output.

To approximate the policy function within the actor, use a recurrent deep neural network model.
Define the network as an array of layer objects, and get the dimension of the observation and action
spaces from the environment specification objects. To create a recurrent network, use a
sequenceInputLayer as the input layer (with size equal to the number of dimensions of the
observation channel) and include at least one lstmLayer.

layers = [
sequenceInputLayer(obsInfo.Dimension(1))
lstmLayer(2)
reluLayer
fullyConnectedLayer(actInfo.Dimension(1))
];

2-78
getAction

Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)

Initialized: true

Number of learnables: 62

Inputs:
1 'sequenceinput' Sequence input with 4 dimensions

Create the actor using model, and the observation and action specifications.
actor = rlContinuousDeterministicActor(model,obsInfo,actInfo)

actor =
rlContinuousDeterministicActor with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlNumericSpec]
UseDevice: "cpu"

Check the actor with a random observation input.


act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}

ans = 2x1 single column vector

0.0568
0.0691

Create an additive noise policy object from actor.


policy = rlAdditiveNoisePolicy(actor)

policy =
rlAdditiveNoisePolicy with properties:

Actor: [1x1 rl.function.rlContinuousDeterministicActor]


NoiseType: "gaussian"
NoiseOptions: [1x1 rl.option.GaussianActionNoise]
EnableNoiseDecay: 1
UseNoisyAction: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlNumericSpec]
SampleTime: -1

Use dot notation to set the standard deviation decay rate.


policy.NoiseOptions.StandardDeviationDecayRate = 0.9;

Use getAction to generate an action from the policy, given a random observation input.
act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}

2-79
2 Functions

ans = 2×1

0.5922
-0.3745

Display the state of the recurrent neural network in the policy object.

xNN = getRNNState(policy);
xNN{1}

ans = 2x1 single column vector

0
0

Use getAction to also return the updated policy as a second argument.

[act, updatedPolicy] = getAction(policy,{rand(obsInfo.Dimension)});

Display the state of the recurrent neural network in the updated policy object.

xpNN = getRNNState(updatedPolicy);
xpNN{1}

ans = 2x1 single column vector

0.3327
-0.2479

As expected, the state is updated.

Input Arguments
agent — Reinforcement learning agent
reinforcement learning agent object

Reinforcement learning agent, specified as one of the following objects:

• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• Custom agent — For more information, see “Create Custom Reinforcement Learning Agents”.

2-80
getAction

Note agent is an handle object, so it is updated whether it is returned as an output argument or not.
For more information about handle objects, see “Handle Object Behavior”.

actor — Actor
rlContinousDeterministicActor object | rlContinousGaussianActor object |
rlDiscreteCategoricalActor object

Actor, specified as an rlContinuousDeterministicActor, rlDiscreteCategoricalActor or


rlContinuousGaussianActor object.

policy — Reinforcement learning policy


rlMaxQPolicy | rlEpsilonGreedyPolicy | rlDeterministicActorPolicy |
rlAdditiveNoisePolicy | rlStochasticActorPolicy

Reinforcement learning policy, specified as one of the following objects:

• rlMaxQPolicy
• rlEpsilonGreedyPolicy
• rlDeterministicActorPolicy
• rlAdditiveNoisePolicy
• rlStochasticActorPolicy

obs — Environment observations


cell array

Environment observations, specified as a cell array with as many elements as there are observation
input channels. Each element of obs contains an array of observations for a single observation input
channel.

The dimensions of each element in obs are MO-by-LB-by-LS, where:

• MO corresponds to the dimensions of the associated observation input channel.


• LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of observations,
specify LB > 1. If valueRep or qValueRep has multiple observation input channels, then LB must
be the same for all elements of obs.
• LS specifies the sequence length for a recurrent neural network. If valueRep or qValueRep does
not use a recurrent neural network, then LS = 1. If valueRep or qValueRep has multiple
observation input channels, then LS must be the same for all elements of obs.

LB and LS must be the same for both act and obs.

For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.

Output Arguments
action — Action
single-element cell array

Action, returned as single-element cell array containing an array with dimensions MA-by-LB-by-LS,
where:

2-81
2 Functions

• MA corresponds to the dimensions of the associated action specification.


• LB is the batch size.
• LS is the sequence length for recurrent neural networks. If the agent, actor, or policy calculating
action do not use recurrent neural networks, then LS = 1.

Note The following continuous action-space actor, policy and agent objects do not enforce the
constraints set by the action specification:

• rlContinuousDeterministicActor
• rlStochasticActorPolicy
• rlACAgent
• rlPGAgent
• rlPPOAgent

In these cases, you must enforce action space constraints within the environment.

nextState — Next state of the actor


cell array

Next state of the actor, returned as a cell array. If actor does not use a recurrent neural network,
then state is an empty cell array.

You can set the state of the representation to state using the setState function. For example:
actor = setState(actor,state);

agent — Updated agent


reinforcement learning agent object

Updated agent, returned as the same agent object as the agent in the input argument. Note that
agent is an handle object. Therefore, its internal states (if any) are updated whether agent is
returned as an output argument or not. For more information about handle objects, see “Handle
Object Behavior”.

updatedPolicy — Updated policy object


reinforcement learning policy object

Updated policy object. It is identical to the policy object supplied as first input argument, except that
its internal states (if any) are updated.

Tips
The function evaluate behaves, for actor objects, similarly to getAction except for the following
differences.

• For an rlDiscreteCategoricalActor actor object, evaluate returns the probability of each


possible actions, (instead of a sample action as getAction).
• For an rlContinuousGaussianActor actor object, evaluate returns the mean and standard
deviation of the Gaussian distribution, (instead of a sample action as getAction).

para

2-82
getAction

Version History
Introduced in R2020a

See Also
evaluate | getValue | getMaxQValue

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”

2-83
2 Functions

getActor
Package: rl.agent

Get actor from reinforcement learning agent

Syntax
actor = getActor(agent)

Description
actor = getActor(agent) returns the actor object from the specified reinforcement learning
agent.

Examples

Modify Actor Parameter Values

Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.

load('DoubleIntegDDPG.mat','agent')

Obtain the actor function approximator from the agent.

actor = getActor(agent);

Obtain the learnable parameters from the actor.

params = getLearnableParameters(actor)

params=2×1 cell array


{[-15.4622 -7.2252]}
{[ 0]}

Modify the parameter values. For this example, simply multiply all of the parameters by 2.

modifiedParams = cellfun(@(x) x*2,params,'UniformOutput',false);

Set the parameter values of the actor to the new modified values.

actor = setLearnableParameters(actor,modifiedParams);

Set the actor in the agent to the new modified actor.

setActor(agent,actor);

Display the new parameter values.

getLearnableParameters(getActor(agent))

2-84
getActor

ans=2×1 cell array


{[-30.9244 -14.4504]}
{[ 0]}

Modify Deep Neural Networks in Reinforcement Learning Agent

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.

Load the predefined environment.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");

Obtain observation and action specifications.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.

agent = rlPPOAgent(obsInfo,actInfo);

To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.

actor = getActor(agent);
critic = getCritic(agent);

Extract the deep neural networks from both the actor and critic function approximators.

actorNet = getModel(actor);
criticNet = getModel(critic);

The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.

For example, view the actor network.

plot(layerGraph(actorNet))

2-85
2 Functions

To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)

You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))

In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.

To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.

For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
createModifiedNetworks

2-86
getActor

Each of the modified networks includes an additional fullyConnectedLayer and reluLayer in


their main common path. View the modified actor network.

plot(layerGraph(modifiedActorNet))

After exporting the networks, insert the networks into the actor and critic function approximators.

actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);

Finally, insert the modified actor and critic function approximators into the actor and critic objects.

agent = setActor(agent,actor);
agent = setCritic(agent,critic);

Input Arguments
agent — Reinforcement learning agent
rlDDPGAgent object | rlTD3Agent object | rlPGAgent object | rlACAgent object | rlPPOAgent
object | rlSACAgent object

Reinforcement learning agent that contains an actor, specified as one of the following:

• rlPGAgent object
• rlDDPGAgent object

2-87
2 Functions

• rlTD3Agent object
• rlACAgent object
• rlSACAgent object
• rlPPOAgent object
• rlTRPOAgent object

Output Arguments
actor — Actor
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object

Actor object, returned as one of the following:

• rlContinuousDeterministicActor object — Returned when agent is an rlDDPGAgent or


rlTD3Agent object
• rlDiscreteCategoricalActor object — Returned when agent is an rlACAgent, rlPGAgent,
rlPPOAgent, rlTRPOAgent or rlSACAgent object for an environment with a discrete action
space.
• rlContinuousGaussianActor object — Returned when agent is an rlACAgent, rlPGAgent,
rlPPOAgent, rlTRPOAgent or rlSACAgent object for an environment with a continuous action
space.

Version History
Introduced in R2019a

See Also
getCritic | setActor | setCritic | getModel | setModel | getLearnableParameters |
setLearnableParameters

Topics
“Create Policies and Value Functions”
“Import Neural Network Models”

2-88
getCritic

getCritic
Package: rl.agent

Get critic from reinforcement learning agent

Syntax
critic = getCritic(agent)

Description
critic = getCritic(agent) returns the critic object from the specified reinforcement learning
agent.

Examples

Modify Critic Parameter Values

Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.

load('DoubleIntegDDPG.mat','agent')

Obtain the critic function approximator from the agent.

critic = getCritic(agent);

Obtain the learnable parameters from the critic.

params = getLearnableParameters(critic)

params=2×1 cell array


{[-4.9889 -1.5548 -0.3434 -0.1111 -0.0500 -0.0035]}
{[ 0]}

Modify the parameter values. For this example, simply multiply all of the parameters by 2.

modifiedParams = cellfun(@(x) x*2,params,'UniformOutput',false);

Set the parameter values of the critic to the new modified values.

critic = setLearnableParameters(critic,modifiedParams);

Set the critic in the agent to the new modified critic.

setCritic(agent,critic);

Display the new parameter values.

getLearnableParameters(getCritic(agent))

2-89
2 Functions

ans=2×1 cell array


{[-9.9778 -3.1095 -0.6867 -0.2223 -0.1000 -0.0069]}
{[ 0]}

Modify Deep Neural Networks in Reinforcement Learning Agent

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.

Load the predefined environment.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");

Obtain observation and action specifications.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.

agent = rlPPOAgent(obsInfo,actInfo);

To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.

actor = getActor(agent);
critic = getCritic(agent);

Extract the deep neural networks from both the actor and critic function approximators.

actorNet = getModel(actor);
criticNet = getModel(critic);

The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.

For example, view the actor network.

plot(layerGraph(actorNet))

2-90
getCritic

To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)

You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))

In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.

To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.

For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
createModifiedNetworks

2-91
2 Functions

Each of the modified networks includes an additional fullyConnectedLayer and reluLayer in


their main common path. View the modified actor network.

plot(layerGraph(modifiedActorNet))

After exporting the networks, insert the networks into the actor and critic function approximators.

actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);

Finally, insert the modified actor and critic function approximators into the actor and critic objects.

agent = setActor(agent,actor);
agent = setCritic(agent,critic);

Input Arguments
agent — Reinforcement learning agent
rlQAgent | rlSARSAAgent | rlDQNAgent | rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent
| rlSACAgent | rlPPOAgent | rlTRPOAgent

Reinforcement learning agent that contains a critic, specified as one of the following objects:

• rlQAgent
• rlSARSAAgent

2-92
getCritic

• rlDQNAgent
• rlPGAgent (when using a critic to estimate a baseline value function)
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent

Output Arguments
critic — Critic
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object | two-
element row vector of rlQValueFunction objects

Critic object, returned as one of the following:

• rlValueFunction object — Returned when agent is an rlACAgent, rlPGAgent, or


rlPPOAgent object.
• rlQValueFunction object — Returned when agent is an rlQAgent, rlSARSAAgent,
rlDQNAgent, rlDDPGAgent, or rlTD3Agent object with a single critic.
• rlVectorQValueFunction object — Returned when agent is an rlQAgent, rlSARSAAgent,
rlDQNAgent, object with a discrete action space, vector Q-value function critic.
• Two-element row vector of rlQValueFunction objects — Returned when agent is an
rlTD3Agent or rlSACAgent object with two critics.

Version History
Introduced in R2019a

See Also
getActor | setActor | setCritic | getModel | setModel | getLearnableParameters |
setLearnableParameters

Topics
“Create Policies and Value Functions”
“Import Neural Network Models”

2-93
2 Functions

getLearnableParameters
Package: rl.policy

Obtain learnable parameter values from agent, function approximator, or policy object

Syntax
pars = getLearnableParameters(agent)

pars = getLearnableParameters(fcnAppx)

pars = getLearnableParameters(policy)

Description
Agent

pars = getLearnableParameters(agent) returns the learnable parameter values from the


agent object agent.
Actor or Critic

pars = getLearnableParameters(fcnAppx) returns the learnable parameter values from the


actor or critic function approximator object fcnAppx.
Policy

pars = getLearnableParameters(policy) returns the learnable parameters values from the


policy object policy.

Examples

Modify Critic Parameter Values

Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.

load('DoubleIntegDDPG.mat','agent')

Obtain the critic function approximator from the agent.

critic = getCritic(agent);

Obtain the learnable parameters from the critic.

params = getLearnableParameters(critic)

params=2×1 cell array


{[-4.9889 -1.5548 -0.3434 -0.1111 -0.0500 -0.0035]}
{[ 0]}

2-94
getLearnableParameters

Modify the parameter values. For this example, simply multiply all of the parameters by 2.

modifiedParams = cellfun(@(x) x*2,params,'UniformOutput',false);

Set the parameter values of the critic to the new modified values.

critic = setLearnableParameters(critic,modifiedParams);

Set the critic in the agent to the new modified critic.

setCritic(agent,critic);

Display the new parameter values.

getLearnableParameters(getCritic(agent))

ans=2×1 cell array


{[-9.9778 -3.1095 -0.6867 -0.2223 -0.1000 -0.0069]}
{[ 0]}

Modify Actor Parameter Values

Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.

load('DoubleIntegDDPG.mat','agent')

Obtain the actor function approximator from the agent.

actor = getActor(agent);

Obtain the learnable parameters from the actor.

params = getLearnableParameters(actor)

params=2×1 cell array


{[-15.4622 -7.2252]}
{[ 0]}

Modify the parameter values. For this example, simply multiply all of the parameters by 2.

modifiedParams = cellfun(@(x) x*2,params,'UniformOutput',false);

Set the parameter values of the actor to the new modified values.

actor = setLearnableParameters(actor,modifiedParams);

Set the actor in the agent to the new modified actor.

setActor(agent,actor);

Display the new parameter values.

getLearnableParameters(getActor(agent))

2-95
2 Functions

ans=2×1 cell array


{[-30.9244 -14.4504]}
{[ 0]}

Input Arguments
agent — Reinforcement learning agent
reinforcement learning agent object

Reinforcement learning agent, specified as one of the following objects:

• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• Custom agent — For more information, see “Create Custom Reinforcement Learning Agents”.

fcnAppx — Actor or critic function object


rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object

Actor or critic function object, specified as one of the following:

• rlValueFunction object — Value function critic


• rlQValueFunction object — Q-value function critic
• rlVectorQValueFunction object — Multi-output Q-value function critic with a discrete action
space
• rlContinuousDeterministicActor object — Deterministic policy actor with a continuous
action space
• rlDiscreteCategoricalActor — Stochastic policy actor with a discrete action space
• rlContinuousGaussianActor object — Stochastic policy actor with a continuous action space

To create an actor or critic function object, use one of the following methods.

• Create a function object directly.


• Obtain the existing critic from an agent using getCritic.
• Obtain the existing actor from an agent using getActor.

2-96
getLearnableParameters

policy — Reinforcement learning policy


rlMaxQPolicy | rlEpsilonGreedyPolicy | rlDeterministicActorPolicy |
rlAdditiveNoisePolicy | rlStochasticActorPolicy

Reinforcement learning policy, specified as one of the following objects:

• rlMaxQPolicy
• rlEpsilonGreedyPolicy
• rlDeterministicActorPolicy
• rlAdditiveNoisePolicy
• rlStochasticActorPolicy

Output Arguments
pars — Learnable parameters
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object

Learnable parameter values for the function object, returned as a cell array. You can modify these
parameter values and set them in the original agent or a different agent using the
setLearnableParameters function.

Version History
Introduced in R2019a

getLearnableParameters now uses approximator objects instead of representation objects


Behavior changed in R2022a

Using representation objects to create actors and critics for reinforcement learning agents is no
longer recommended. Therefore, getLearnableParameters now uses function approximator
objects instead.

getLearnableParameterValues is now getLearnableParameters


Behavior changed in R2020a

getLearnableParameterValues is now getLearnableParameters. To update your code, change


the function name from getLearnableParameterValues to getLearnableParameters. The
syntaxes are equivalent.

See Also
setLearnableParameters | getActor | getCritic | setActor | setCritic

Topics
“Create Policies and Value Functions”
“Import Neural Network Models”

2-97
2 Functions

getMaxQValue
Obtain maximum estimated value over all possible actions from a Q-value function critic with discrete
action space, given environment observations

Syntax
[maxQ,maxActionIndex] = getMaxQValue(qValueFcnObj,obs)
[maxQ,maxActionIndex,state] = getMaxQValue( ___ )

Description
[maxQ,maxActionIndex] = getMaxQValue(qValueFcnObj,obs) evaluates the discrete-action-
space Q-value function critic qValueFcnObj and returns the maximum estimated value over all
possible actions maxQ, with the corresponding action index maxActionIndex, given environment
observations obs.

[maxQ,maxActionIndex,state] = getMaxQValue( ___ ) also returns the updated state of


qValueFcnObj when it contains a recurrent neural network.

Examples

Obtain Maximum Q-Value Function Estimates

Create an observation and action specification objects (or alternatively use getObservationInfo
and getActionInfo to extract the specification objects from an environment. For this example,
define the observation space as a continuous three-dimensional space, and the action space as a finite
set consisting of three possible values (named -1, 0, and 1).
obsInfo = rlNumericSpec([3 1]);
actInfo = rlFiniteSetSpec([-1 0 1]);

Create a custom basis function to approximate the Q-value function within the critic, and define an
initial parameter vector.
myBasisFcn = @(myobs,myact) [ ...
ones(4,1);
myobs(:); myact;
myobs(:).^2; myact.^2;
sin(myobs(:)); sin(myact);
cos(myobs(:)); cos(myact) ];
W0 = rand(20,1);

Create the critic.


critic = rlQValueFunction({myBasisFcn,W0}, ...
obsInfo,actInfo);

Use getMaxQValue to return the maximum value, among the possible actions, given a random
observation. Also return the index corresponding to the action that maximizes the value.
[v,i] = getMaxQValue(critic,{rand(3,1)})

2-98
getMaxQValue

v = 9.0719

i = 3

Create a batch set of 64 random independent observations. The third dimension is the batch size,
while the fourth is the sequence length for any recurrent neural network used by the critic (in this
case not used).

batchobs = rand(3,1,64,1);

Obtain maximum values for all the observations.

bv = getMaxQValue(critic,{batchobs});
size(bv)

ans = 1×2

1 64

Select the maximum value corresponding to the 44th observation.

bv(44)

ans = 10.4138

Input Arguments
qValueFcnObj — Q-value function critic
rlQValueFunction object | rlVectorQValueFunction object

Q-value function critic, specified as an rlQValueFunction or rlVectorQValueFunction object.

obs — Environment observations


cell array

Environment observations, specified as a cell array with as many elements as there are observation
input channels. Each element of obs contains an array of observations for a single observation input
channel.

The dimensions of each element in obs are MO-by-LB-by-LS, where:

• MO corresponds to the dimensions of the associated observation input channel.


• LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of observations,
specify LB > 1. If qValueFcnObj has multiple observation input channels, then LB must be the
same for all elements of obs.
• LS specifies the sequence length for a recurrent neural network. If qValueFcnObj does not use a
recurrent neural network, then LS = 1. If qValueFcnObj has multiple observation input channels,
then LS must be the same for all elements of obs.

LB and LS must be the same for both act and obs.

For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.

2-99
2 Functions

Output Arguments
maxQ — Maximum Q-value estimate
array

Maximum Q-value estimate across all possible discrete actions, returned as a 1-by-LB-by-LS array,
where:

• LB is the batch size.


• LS specifies the sequence length for a recurrent neural network. If qValueFcnObj does not use a
recurrent neural network, then LS = 1.

maxActionIndex — Action index


array

Action index corresponding to the maximum Q value, returned as a 1-by-LB-by-LS array, where:

• LB is the batch size.


• LS specifies the sequence length for a recurrent neural network. If qValueFcnObj does not use a
recurrent neural network, then LS = 1.

state — Updated state of the critic


cell array

Updated state of qValueFcnObj, returned as a cell array. If qValueFcnObj does not use a recurrent
neural network, then state is an empty cell array.

You can set the state of the critic to state using the setState function. For example:

qValueFcnObj = setState(qValueFcnObj,state);

Version History
Introduced in R2020a

See Also
getValue | evaluate | getAction

Topics
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”

2-100
getModel

getModel
Package: rl.function

Get function approximator model from actor or critic

Syntax
model = getModel(fcnAppx)

Description
model = getModel(fcnAppx) returns the function approximation model used by the actor or critic
function object fcnAppx.

Examples

Modify Deep Neural Networks in Reinforcement Learning Agent

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.

Load the predefined environment.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");

Obtain observation and action specifications.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.

agent = rlPPOAgent(obsInfo,actInfo);

To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.

actor = getActor(agent);
critic = getCritic(agent);

Extract the deep neural networks from both the actor and critic function approximators.

actorNet = getModel(actor);
criticNet = getModel(critic);

The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.

For example, view the actor network.

2-101
2 Functions

plot(layerGraph(actorNet))

To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)

You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))

In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.

To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.

For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.

2-102
getModel

createModifiedNetworks

Each of the modified networks includes an additional fullyConnectedLayer and reluLayer in


their main common path. View the modified actor network.

plot(layerGraph(modifiedActorNet))

After exporting the networks, insert the networks into the actor and critic function approximators.

actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);

Finally, insert the modified actor and critic function approximators into the actor and critic objects.

agent = setActor(agent,actor);
agent = setCritic(agent,critic);

Input Arguments
fcnAppx — Actor or critic function object
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object

Actor or critic function object, specified as one of the following:

2-103
2 Functions

• rlValueFunction object — Value function critic


• rlQValueFunction object — Q-value function critic
• rlVectorQValueFunction object — Multi-output Q-value function critic with a discrete action
space
• rlContinuousDeterministicActor object — Deterministic policy actor with a continuous
action space
• rlDiscreteCategoricalActor — Stochastic policy actor with a discrete action space
• rlContinuousGaussianActor object — Stochastic policy actor with a continuous action space

To create an actor or critic function object, use one of the following methods.

• Create a function object directly.


• Obtain the existing critic from an agent using getCritic.
• Obtain the existing actor from an agent using getActor.

Note For agents with more than one critic, such as TD3 and SAC agents, you must call getModel for
each critic representation individually, rather than calling getModel for the array returned by
getCritic.

critics = getCritic(myTD3Agent);
criticNet1 = getModel(critics(1));
criticNet2 = getModel(critics(2));

Output Arguments
model — Function approximation model
dlnetwork object | rlTable object | 1-by-2 cell array

Function approximation model, returned as one of the following:

• Deep neural network defined as a dlnetwork object


• rlTable object
• 1-by-2 cell array that contains the function handle for a custom basis function and the basis
function parameters

Version History
Introduced in R2020b

getModel now uses approximator objects instead of representation objects


Behavior changed in R2022a

Using representation objects to create actors and critics for reinforcement learning agents is no
longer recommended. Therefore, getModel now uses function approximator objects instead.

getModel returns a dlnetwork object


Behavior changed in R2021b

2-104
getModel

Starting from R2021b, built-in agents use dlnetwork objects as actor and critic representations, so
getModel returns a dlnetwork object.

• Due to numerical differences in the network calculations, previously trained agents might behave
differently. If this happens, you can retrain your agents.
• To use Deep Learning Toolbox™ functions that do not support dlnetwork, you must convert the
network to layerGraph. For example, to use deepNetworkDesigner, replace
deepNetworkDesigner(network) with deepNetworkDesigner(layerGraph(network)).

See Also
getActor | setActor | getCritic | setCritic | setModel | dlnetwork

Topics
“Create Policies and Value Functions”

2-105
2 Functions

getObservationInfo
Obtain observation data specifications from reinforcement learning environment, agent, or
experience buffer

Syntax
obsInfo = getObservationInfo(env)
obsInfo = getObservationInfo(agent)
obsInfo = getObservationInfo(buffer)

Description
obsInfo = getObservationInfo(env) extracts observation information from reinforcement
learning environment env.

obsInfo = getObservationInfo(agent) extracts observation information from reinforcement


learning agent agent.

obsInfo = getObservationInfo(buffer) extracts observation information from experience


buffer buffer.

Examples

Extract Action and Observation Information from Reinforcement Learning Environment

Extract action and observation information that you can use to create other environments or agents.

The reinforcement learning environment for this example is the simple longitudinal dynamics for ego
car and lead car. The training goal is to make the ego car travel at a set velocity while maintaining a
safe distance from lead car by controlling longitudinal acceleration (and braking). This example uses
the same vehicle model as the “Adaptive Cruise Control System Using Model Predictive Control”
(Model Predictive Control Toolbox) example.

Open the model and create the reinforcement learning environment.

mdl = 'rlACCMdl';
open_system(mdl);
agentblk = [mdl '/RL Agent'];
% create the observation info
obsInfo = rlNumericSpec([3 1],'LowerLimit',-inf*ones(3,1),'UpperLimit',inf*ones(3,1));
obsInfo.Name = 'observations';
obsInfo.Description = 'information on velocity error and ego velocity';
% action Info
actInfo = rlNumericSpec([1 1],'LowerLimit',-3,'UpperLimit',2);
actInfo.Name = 'acceleration';
% define environment
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo)

env =
SimulinkEnvWithAgent with properties:

2-106
getObservationInfo

Model : rlACCMdl
AgentBlock : rlACCMdl/RL Agent
ResetFcn : []
UseFastRestart : on

The reinforcement learning environment env is a SimulinkWithAgent object with the above
properties.

Extract the action and observation information from the reinforcement learning environment env.

actInfoExt = getActionInfo(env)

actInfoExt =
rlNumericSpec with properties:

LowerLimit: -3
UpperLimit: 2
Name: "acceleration"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"

obsInfoExt = getObservationInfo(env)

obsInfoExt =
rlNumericSpec with properties:

LowerLimit: [3x1 double]


UpperLimit: [3x1 double]
Name: "observations"
Description: "information on velocity error and ego velocity"
Dimension: [3 1]
DataType: "double"

The action information contains acceleration values while the observation information contains the
velocity and velocity error values of the ego vehicle.

Input Arguments
env — Reinforcement learning environment
rlFunctionEnv object | SimulinkEnvWithAgent object | rlNeuralNetworkEnvironment object
| predefined MATLAB environment object

Reinforcement learning environment from which to extract the observation information, specified as
one of the following objects.

• MATLAB environment represented as one of the following objects.

• rlFunctionEnv
• rlNeuralNetworkEnvironment
• Predefined MATLAB environment created using rlPredefinedEnv

2-107
2 Functions

• Simulink environment represented as a SimulinkEnvWithAgent object.

For more information on reinforcement learning environments, see “Create MATLAB Reinforcement
Learning Environments” and “Create Simulink Reinforcement Learning Environments”.

agent — Reinforcement learning agent


rlQAgent object | rlSARSAAgent object | rlDQNAgent object | rlPGAgent object | rlDDPGAgent
object | rlTD3Agent object | rlACAgent object | rlPPOAgent object | rlTRPOAgent object |
rlSACAgent object | rlMBPOAgent object

Reinforcement learning agent from which to extract the observation information, specified as one of
the following objects.

• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlPGAgent
• rlACAgent
• rlPPOAgent
• rlTRPOAgent
• rlSACAgent
• rlMBPOAgent

For more information on reinforcement learning agents, see “Reinforcement Learning Agents”.

buffer — Experience buffer


rlReplayMemory object | rlPrioritizedReplayMemory object

Experience buffer from which to extract the observation information, specified as an


rlReplayMemory or rlPrioritizedReplayMemory object.

Output Arguments
obsInfo — Observation data specifications
array of rlNumericSpec objects | array of rlFiniteSetSpec objects

Observation data specifications extracted from the reinforcement learning environment, returned as
an array of one of the following:

• rlNumericSpec objects
• rlFiniteSetSpec objects
• A mix of rlNumericSpec and rlFiniteSetSpec objects

Version History
Introduced in R2019a

2-108
getObservationInfo

See Also
rlNumericSpec | rlFiniteSetSpec | getActionInfo | rlQAgent | rlSARSAAgent |
rlDQNAgent | rlPGAgent | rlACAgent | rlDDPGAgent

Topics
“Create Simulink Reinforcement Learning Environments”
“Reinforcement Learning Agents”

2-109
2 Functions

getValue
Obtain estimated value from a critic given environment observations and actions

Syntax
value = getValue(valueFcnAppx,obs)

value = getValue(vqValueFcnAppx,obs)
value = getValue(qValueFcnAppx,obs,act)

[value,state] = getValue( ___ )

Description
Value Function Critic

value = getValue(valueFcnAppx,obs) evaluates the value function critic valueFcnAppx and


returns the value corresponding to the observation obs. In this case, valueFcnAppx is an
rlValueFunction approximator object.
Q-Value Function Critics

value = getValue(vqValueFcnAppx,obs) evaluates the discrete-action-space Q-value function


critic vqValueFcnAppx and returns the vector value, in which each element represents the
estimated value given the state corresponding to the observation obs and the action corresponding to
the element number of value. In this case, vqValueFcnAppx is an rlVectorQValueFunction
approximator object.

value = getValue(qValueFcnAppx,obs,act) evaluates the Q-value function critic


qValueFcnAppx and returns the scalar value, representing the value given the observation obs and
action act. In this case, qValueFcnAppx is an rlQValueFunction approximator object.
Return Recurrent Neural Network State

[value,state] = getValue( ___ ) also returns the updated state of the critic object when it
contains a recurrent neural network.

Examples

Obtain Value Function Estimates

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.

obsInfo = rlNumericSpec([4 1]);

To approximate the value function within the critic, create a neural network. Define a single path
from the network input (the observation) to its output (the value), as an array of layer objects.

2-110
getValue

net = [ featureInputLayer(4) ...


fullyConnectedLayer(1)];

Convert the network to a dlnetwork object and display the number of weights.

net = dlnetwork(net);
summary(net);

Initialized: true

Number of learnables: 5

Inputs:
1 'input' 4 features

Create a critic using the network and the observation specification object. When you use this syntax
the network input layer is automatically associated with the environment observation according to
the dimension specifications in obsInfo.

critic = rlValueFunction(net,obsInfo);

Obtain a value function estimate for a random single observation. Use an observation array with the
same dimensions as the observation specification.

val = getValue(critic,{rand(4,1)})

val = single
0.7904

You can also obtain value function estimates for a batch of observations. For example obtain value
functions for a batch of 20 observations.

batchVal = getValue(critic,{rand(4,1,20)});
size(batchVal)

ans = 1×2

1 20

valBatch contains one value function estimate for each observation in the batch.

Obtain Vector Q-Value Function Estimates

Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, define
the observation space as a continuous four-dimensional space, so that a single observation is a
column vector containing four doubles, and the action space as a finite set consisting of three
possible values (named 7, 5, and 3 in this case).

obsInfo = rlNumericSpec([4 1]);


actInfo = rlFiniteSetSpec([7 5 3]);

To approximate the Q-value function within the critic, create a neural network. Define a single path
from the network input to its output as array of layer objects. The input of the network must accept a

2-111
2 Functions

four-element vector, as defined by obsInfo. The output must be a single output layer having as many
elements as the number of possible discrete actions (three in this case, as defined by actInfo).

net = [featureInputLayer(4)
fullyConnectedLayer(3)];

Convert the network to a dlnetwork object and display the number of weights.

net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 15

Inputs:
1 'input' 4 features

Create the critic using the network, as well as the names of the observation and action specification
objects. The network input layers are automatically associated with the components of the
observation signals according to the dimension specifications in obsInfo.

critic = rlVectorQValueFunction(net,obsInfo,actInfo);

Use getValue to return the values of a random observation, using the current network weights.

v = getValue(critic,{rand(obsInfo.Dimension)})

v = 3x1 single column vector

0.7232
0.8177
-0.2212

v contains three value function estimates, one for each possible discrete action.

You can also obtain value function estimates for a batch of observations. For example, obtain value
function estimates for a batch of 10 observations.

batchV = getValue(critic,{rand([obsInfo.Dimension 10])});


size(batchV)

ans = 1×2

3 10

batchV contains three value function estimates for each observation in the batch.

Obtain Single-Output Q-Value Function Estimates

Create observation and action specification objects (or alternatively use getObservationInfo and
getObservationInfo to extract the specification object from an environment). For this example,
define the observation space as having two continuous channels, the first one carrying an 8 by 3

2-112
getValue

matrix, and the second one a continuous four-dimensional vector. The action specification is a
continuous column vector containing 2 doubles.

obsInfo = [rlNumericSpec([8 3]), rlNumericSpec([4 1])];


actInfo = rlNumericSpec([2 1]);

Create a custom basis function and its initial weight matrix.

myBasisFcn = @(obsA,obsB,act) [...


ones(30,1);
obsA(:); obsB(:); act(:);
obsA(:).^2; obsB(:).^2; act(:).^2;
sin(obsA(:)); sin(obsB(:)); sin(act(:));
cos(obsA(:)); cos(obsB(:)); cos(act(:))];
W0 = rand(150,1);

The output of the critic is the scalar W'*myBasisFcn(obs,act), representing the Q-value function
to be approximated.

Create the critic.

critic = rlQValueFunction({myBasisFcn,W0}, ...


obsInfo,actInfo);

Use getValue to return the value of a random observation-action pair, using the current parameter
matrix.

v = getValue(critic,{rand(8,3),(1:4)'},{rand(2,1)})

v = 68.8628

Create a random observation set of batch size 64 for each channel. The third dimension is the batch
size, while the fourth is the sequence length for any recurrent neural network used by the critic (in
this case not used).

batchobs_ch1 = rand(8,3,64,1);
batchobs_ch2 = rand(4,1,64,1);

Create a random action set of batch size 64.

batchact = rand(2,1,64,1);

Obtain the state-action value function estimate for the batch of observations and actions.

bv = getValue(critic,{batchobs_ch1,batchobs_ch2},{batchact});
size(bv)

ans = 1×2

1 64

bv(23)

ans = 46.6310

2-113
2 Functions

Input Arguments
valueFcnAppx — Value function critic
rlValueFunction object

Value function critic, specified as an rlValueFunction approximator object.

vqValueFcnAppx — Vector Q-value function critic


rlVectorQValueFunction object

Vector Q-value function critic, specified as an rlVectorQValueFunction approximator object.

qValueFcnAppx — Q-value function critic


rlQValueFunction object

Q-value function critic, specified as an rlQValueFunction object.

obs — Observations
cell array

Observations, specified as a cell array with as many elements as there are observation input
channels. Each element of obs contains an array of observations for a single observation input
channel.

The dimensions of each element in obs are MO-by-LB-by-LS, where:

• MO corresponds to the dimensions of the associated observation input channel.


• LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of observations,
specify LB > 1. If the critic object given as first input argument has multiple observation input
channels, then LB must be the same for all elements of obs.
• LS specifies the sequence length for a recurrent neural network. If the critic object given as first
input argument does not use a recurrent neural network, then LS = 1. If the critic object has
multiple observation input channels, then LS must be the same for all elements of obs.

LB and LS must be the same for both act and obs.

For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.

act — Action
single-element cell array

Action, specified as a single-element cell array that contains an array of action values.

The dimensions of this array are MA-by-LB-by-LS, where:

• MA corresponds to the dimensions of the associated action specification.


• LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of observations,
specify LB > 1.
• LS specifies the sequence length for a recurrent neural network. If the critic object given as a first
input argument does not use a recurrent neural network, then LS = 1.

LB and LS must be the same for both act and obs.

2-114
getValue

For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.

Output Arguments
value — Estimated value function
array

Estimated value function, returned as array with dimensions N-by-LB-by-LS, where:

• N is the number of outputs of the critic network.

• For a state value critics (valueFcnAppx), N = 1.


• For a single-output state-action value function critics (qValueFcnAppx), N = 1.
• For a multi-output state-action value function critics (vqValueFcnAppx), N is the number of
discrete actions.
• LB is the batch size.
• LS is the sequence length for a recurrent neural network.

state — Updated state of critic


cell array

Updated state of the critic, returned as a cell array. If the critic does not use a recurrent neural
network, then state is an empty cell array.

You can set the state of the critic to state using the setState function. For example:

valueFcnAppx = setState(valueFcnAppx,state);

Tips
The more general function evaluate behaves, for critic objects, similarly to getValue except that
evaluate returns results inside a single-cell array.

para

Version History
Introduced in R2020a

See Also
evaluate | getAction | getMaxQValue

Topics
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”

2-115
2 Functions

gradient
Package: rl.function

Evaluate gradient of function approximator object given observation and action input data

Syntax
grad = gradient(fcnAppx,'output-input',inData)
grad = gradient(fcnAppx,'output-parameters',inData)
grad = gradient(fcnAppx,lossFcn,inData,fcnData)

Description
grad = gradient(fcnAppx,'output-input',inData) evaluates the gradient of the sum of the
outputs of the function approximator object fcnAppx with respect to its inputs. It returns the value of
the gradient grad when the input of fcnAppx is inData.

grad = gradient(fcnAppx,'output-parameters',inData) evaluates the gradient of the sum


of the outputs of fcnAppx with respect to its parameters.

grad = gradient(fcnAppx,lossFcn,inData,fcnData) evaluates the gradient of a loss


function associated to the function handle lossFcn, with respect to the parameters of fcnAppx. The
last optional argument fcnData can contain additional inputs for the loss function.

Examples

Calculate Gradients for Continuous Gaussian Actor

Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, define
an observation space with of three channels. The first channel carries an observation from a
continuous three-dimensional space, so that a single observation is a column vector containing three
doubles. The second channel carries a discrete observation made of a two-dimensional row vector
that can take one of five different values. The third channel carries a discrete scalar observation that
can be either zero or one. Finally, the action space is a continuous four-dimensional space, so that a
single action is a column vector containing four doubles, each between -10 and 10.

obsInfo = [rlNumericSpec([3 1])


rlFiniteSetSpec({[1 2],[3 4],[5 6],[7 8],[9 10]})
rlFiniteSetSpec([0 1])];

actInfo = rlNumericSpec([4 1], ...


UpperLimit= 10*ones(4,1), ...
LowerLimit=-10*ones(4,1) );

To approximate the policy within the actor, use a recurrent deep neural network. For a continuous
Gaussian actor, the network must have two output layers (one for the mean values the other for the
standard deviation values), each having as many elements as the dimension of the action space.

2-116
gradient

Create a the network, defining each path as an array of layer objects. Use sequenceInputLayer as
the input layer and include an lstmLayer as one of the other network layers. Also use a softplus
layer to enforce nonnegativity of the standard deviations and a ReLU layer to scale the mean values
to the desired output range. Get the dimensions of the observation and action spaces from the
environment specification objects, and specify a name for the input layers, so you can later explicitly
associate them with the appropriate environment channel.

inPath1 = [ sequenceInputLayer( ...


prod(obsInfo(1).Dimension), ...
Name="netObsIn1")
fullyConnectedLayer(prod(actInfo.Dimension), ...
Name="infc1") ];

inPath2 = [ sequenceInputLayer( ...


prod(obsInfo(2).Dimension), ...
Name="netObsIn2")
fullyConnectedLayer( ...
prod(actInfo.Dimension), ...
Name="infc2") ];

inPath3 = [ sequenceInputLayer( ...


prod(obsInfo(3).Dimension), ...
Name="netObsIn3")
fullyConnectedLayer( ...
prod(actInfo.Dimension), ...
Name="infc3") ];

% Concatenate inputs along the first available dimension


jointPath = [ concatenationLayer(1,3,Name="cat")
tanhLayer(Name="tanhJnt");
lstmLayer(8,OutputMode="sequence",Name="lstm")
fullyConnectedLayer( ...
prod(actInfo.Dimension), ...
Name="jntfc");
];

% Path layers for mean value


% Using scalingLayer to scale range from (-1,1) to (-10,10)
meanPath = [ tanhLayer(Name="tanhMean");
fullyConnectedLayer(prod(actInfo.Dimension));
scalingLayer(Name="scale", ...
Scale=actInfo.UpperLimit) ];

% Path layers for standard deviations


% Using softplus layer to make them nonnegative
sdevPath = [ tanhLayer(Name="tanhStdv");
fullyConnectedLayer(prod(actInfo.Dimension));
softplusLayer(Name="splus") ];

% Add layers to network object


net = layerGraph;
net = addLayers(net,inPath1);
net = addLayers(net,inPath2);
net = addLayers(net,inPath3);
net = addLayers(net,jointPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);

2-117
2 Functions

% Connect layers
net = connectLayers(net,"infc1","cat/in1");
net = connectLayers(net,"infc2","cat/in2");
net = connectLayers(net,"infc3","cat/in3");
net = connectLayers(net,"jntfc","tanhMean/in");
net = connectLayers(net,"jntfc","tanhStdv/in");

% Plot network
plot(net)

% Convert to dlnetwork
net = dlnetwork(net);

% Display the number of weights


summary(net)

Initialized: true

Number of learnables: 784

Inputs:
1 'netObsIn1' Sequence input with 3 dimensions
2 'netObsIn2' Sequence input with 2 dimensions
3 'netObsIn3' Sequence input with 1 dimensions

2-118
gradient

Create the actor with rlContinuousGaussianActor, using the network, the observations and
action specification objects, as well as the names of the network input layer and the options object.

actor = rlContinuousGaussianActor(net, obsInfo, actInfo, ...


ActionMeanOutputNames="scale",...
ActionStandardDeviationOutputNames="splus",...
ObservationInputNames=["netObsIn1","netObsIn2","netObsIn3"]);

To return mean value and standard deviations of the Gaussian distribution as a function of the
current observation, use evaluate.

[prob,state] = evaluate(actor, {rand([obsInfo(1).Dimension 1 1]) , ...


rand([obsInfo(2).Dimension 1 1]) , ...
rand([obsInfo(3).Dimension 1 1]) });

The result is a cell array with two elements, the first one containing a vector of mean values, and the
second containing a vector of standard deviations.

prob{1}

ans = 4x1 single column vector

-1.5454
0.4908
-0.1697
0.8081

prob{2}

ans = 4x1 single column vector

0.6913
0.6141
0.7291
0.6475

To return an action sampled from the distribution, use getAction.

act = getAction(actor, {rand(obsInfo(1).Dimension) , ...


rand(obsInfo(2).Dimension) , ...
rand(obsInfo(3).Dimension) });
act{1}

ans = 4x1 single column vector

-3.2003
-0.0534
-1.0700
-0.4032

Calculate the gradients of the sum of the outputs (all the mean values plus all the standard
deviations) with respect to the inputs, given a random observation.

gro = gradient(actor,"output-input", ...


{rand(obsInfo(1).Dimension) , ...

2-119
2 Functions

rand(obsInfo(2).Dimension) , ...
rand(obsInfo(3).Dimension)} )

gro=3×1 cell array


{3x1 single}
{2x1 single}
{[ 0.1311]}

The result is a cell array with as many elements as the number of input channels. Each element
contains the derivatives of the sum of the outputs with respect to each component of the input
channel. Display the gradient with respect to the element of the second channel.

gro{2}

ans = 2x1 single column vector

-1.3404
0.6642

Obtain the gradient with respect of five independent sequences, each one made of nine sequential
observations.

gro_batch = gradient(actor,"output-input", ...


{rand([obsInfo(1).Dimension 5 9]) , ...
rand([obsInfo(2).Dimension 5 9]) , ...
rand([obsInfo(3).Dimension 5 9])} )

gro_batch=3×1 cell array


{3x5x9 single}
{2x5x9 single}
{1x5x9 single}

Display the derivative of the sum of the outputs with respect to the third observation element of the
first input channel, after the seventh sequential observation in the fourth independent batch.

gro_batch{1}(3,4,7)

ans = single
0.2020

Set the option to accelerate the gradient computations.

actor = accelerate(actor,true);

Calculate the gradients of the sum of the outputs with respect to the parameters, given a random
observation.

grp = gradient(actor,"output-parameters", ...


{rand(obsInfo(1).Dimension) , ...
rand(obsInfo(2).Dimension) , ...
rand(obsInfo(3).Dimension)} )

grp=15×1 cell array


{ 4x3 single}
{ 4x1 single}
{ 4x2 single}

2-120
gradient

{ 4x1 single}
{ 4x1 single}
{ 4x1 single}
{32x12 single}
{32x8 single}
{32x1 single}
{ 4x8 single}
{ 4x1 single}
{ 4x4 single}
{ 4x1 single}
{ 4x4 single}
{ 4x1 single}

Each array within a cell contains the gradient of the sum of the outputs with respect to a group of
parameters.
grp_batch = gradient(actor,"output-parameters", ...
{rand([obsInfo(1).Dimension 5 9]) , ...
rand([obsInfo(2).Dimension 5 9]) , ...
rand([obsInfo(3).Dimension 5 9])} )

grp_batch=15×1 cell array


{ 4x3 single}
{ 4x1 single}
{ 4x2 single}
{ 4x1 single}
{ 4x1 single}
{ 4x1 single}
{32x12 single}
{32x8 single}
{32x1 single}
{ 4x8 single}
{ 4x1 single}
{ 4x4 single}
{ 4x1 single}
{ 4x4 single}
{ 4x1 single}

If you use a batch of inputs, gradient uses the whole input sequence (in this case nine steps), and
all the gradients with respect to the independent batch dimensions (in this case five) are added
together. Therefore, the returned gradient has always the same size as the output from
getLearnableParameters.

Calculate Gradients for Vector Q-Value Function

Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, define
an observation space made of two channels. The first channel carries an observation from a
continuous four-dimensional space. The second carries a discrete scalar observation that can be
either zero or one. Finally, the action space consist of a scalar that can be -1, 0, or 1.
obsInfo = [rlNumericSpec([4 1])
rlFiniteSetSpec([0 1])];

2-121
2 Functions

actInfo = rlFiniteSetSpec([-1 0 1]);

To approximate the vector Q-value function within the critic, use a recurrent deep neural network.
The output layer must have three elements, each one expressing the value of executing the
corresponding action, given the observation.

Create the neural network, defining each network path as an array of layer objects. Get the
dimensions of the observation and action spaces from the environment specification objects, use
sequenceInputLayer as the input layer, and include an lstmLayer as one of the other network
layers.

inPath1 = [ sequenceInputLayer( ...


prod(obsInfo(1).Dimension), ...
Name="netObsIn1")
fullyConnectedLayer( ...
prod(actInfo.Dimension), ...
Name="infc1") ];

inPath2 = [ sequenceInputLayer( ...


prod(obsInfo(2).Dimension), ...
Name="netObsIn2")
fullyConnectedLayer( ...
prod(actInfo.Dimension), ...
Name="infc2") ];

% Concatenate inputs along first available dimension


jointPath = [ concatenationLayer(1,2,Name="cct")
tanhLayer(Name="tanhJnt")
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer(prod(numel(actInfo.Elements))) ];

% Add layers to network object


net = layerGraph;
net = addLayers(net,inPath1);
net = addLayers(net,inPath2);
net = addLayers(net,jointPath);

% Connect layers
net = connectLayers(net,"infc1","cct/in1");
net = connectLayers(net,"infc2","cct/in2");

% Plot network
plot(net)

2-122
gradient

% Convert to dlnetwork
net = dlnetwork(net);

% Display the number of weights


summary(net)

Initialized: true

Number of learnables: 386

Inputs:
1 'netObsIn1' Sequence input with 4 dimensions
2 'netObsIn2' Sequence input with 1 dimensions

Create the critic with rlVectorQValueFunction, using the network and the observation and action
specification objects.

critic = rlVectorQValueFunction(net,obsInfo,actInfo);

To return the value of the actions as a function of the current observation, use getValue or
evaluate.

val = evaluate(critic, ...


{rand(obsInfo(1).Dimension), ...
rand(obsInfo(2).Dimension)})

2-123
2 Functions

val = 1x1 cell array


{3x1 single}

When you use evaluate, the result is a single-element cell array, containing a vector with the values
of all the possible actions, given the observation.

val{1}

ans = 3x1 single column vector

-0.0054
-0.0943
0.0177

Calculate the gradients of the sum of the outputs with respect to the inputs, given a random
observation.

gro = gradient(critic,"output-input", ...


{rand(obsInfo(1).Dimension) , ...
rand(obsInfo(2).Dimension) } )

gro=2×1 cell array


{4x1 single}
{[ -0.0396]}

The result is a cell array with as many elements as the number of input channels. Each element
contains the derivative of the sum of the outputs with respect to each component of the input
channel. Display the gradient with respect to the element of the second channel.

gro{2}

ans = single
-0.0396

Obtain the gradient with respect of five independent sequences each one made of nine sequential
observations.

gro_batch = gradient(critic,"output-input", ...


{rand([obsInfo(1).Dimension 5 9]) , ...
rand([obsInfo(2).Dimension 5 9]) } )

gro_batch=2×1 cell array


{4x5x9 single}
{1x5x9 single}

Display the derivative of the sum of the outputs with respect to the third observation element of the
first input channel, after the seventh sequential observation in the fourth independent batch.

gro_batch{1}(3,4,7)

ans = single
0.0443

Set the option to accelerate the gradient computations.

2-124
gradient

critic = accelerate(critic,true);

Calculate the gradients of the sum of the outputs with respect to the parameters, given a random
observation.

grp = gradient(critic,"output-parameters", ...


{rand(obsInfo(1).Dimension) , ...
rand(obsInfo(2).Dimension) } )

grp=9×1 cell array


{[-0.0101 -0.0097 -0.0039 -0.0065]}
{[ -0.0122]}
{[ -0.0078]}
{[ -0.0863]}
{32x2 single }
{32x8 single }
{32x1 single }
{ 3x8 single }
{ 3x1 single }

Each array within a cell contains the gradient of the sum of the outputs with respect to a group of
parameters.

grp_batch = gradient(critic,"output-parameters", ...


{rand([obsInfo(1).Dimension 5 9]) , ...
rand([obsInfo(2).Dimension 5 9]) } )

grp_batch=9×1 cell array


{[-1.5801 -1.2618 -2.2168 -1.3463]}
{[ -3.7742]}
{[ -6.2158]}
{[ -12.6841]}
{32x2 single }
{32x8 single }
{32x1 single }
{ 3x8 single }
{ 3x1 single }

If you use a batch of inputs, gradient uses the whole input sequence (in this case nine steps), and
all the gradients with respect to the independent batch dimensions (in this case five) are added
together. Therefore, the returned gradient always has the same size as the output from
getLearnableParameters.

Input Arguments
fcnAppx — Function approximator object
function approximator object

Function approximator object, specified:

• rlValueFunction,
• rlQValueFunction,
• rlVectorQValueFunction,

2-125
2 Functions

• rlDiscreteCategoricalActor,
• rlContinuousDeterministicActor,
• rlContinuousGaussianActor,
• rlContinuousDeterministicTransitionFunction,
• rlContinuousGaussianTransitionFunction,
• rlContinuousDeterministicRewardFunction,
• rlContinuousGaussianRewardFunction,
• rlIsDoneFunction.

inData — Input data for the function approximator


cell array

Input data for the function approximator, specified as a cell array with as many elements as the
number of input channels of fcnAppx. In the following section, the number of observation channels is
indicated by NO.

• If fcnAppx is an rlQValueFunction, an
rlContinuousDeterministicTransitionFunction or an
rlContinuousGaussianTransitionFunction object, then each of the first NO elements of
inData must be a matrix representing the current observation from the corresponding
observation channel. They must be followed by a final matrix representing the action.
• If fcnAppx is a function approximator object representing an actor or critic (but not an
rlQValueFunction object), inData must contain NO elements, each one being a matrix
representing the current observation from the corresponding observation channel.
• If fcnAppx is an rlContinuousDeterministicRewardFunction, an
rlContinuousGaussianRewardFunction, or an rlIsDoneFunction object, then each of the
first NO elements of inData must be a matrix representing the current observation from the
corresponding observation channel. They must be followed by a matrix representing the action,
and finally by NO elements, each one being a matrix representing the next observation from the
corresponding observation channel.

Each element of inData must be a matrix of dimension MC-by-LB-by-LS, where:

• MC corresponds to the dimensions of the associated input channel.


• LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of (independent)
observations, specify LB > 1. If inData has multiple elements, then LB must be the same for all
elements of inData.
• LS specifies the sequence length (along the time dimension) for recurrent neural network. If
fcnAppx does not use a recurrent neural network, (which is the case of environment function
approximators, as they do not support recurrent neural networks) then LS = 1. If inData has
multiple elements, then LS must be the same for all elements of inData.

For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.
Example: {rand(8,3,64,1),rand(4,1,64,1),rand(2,1,64,1)}

lossFcn — Loss function


function handle

Loss function, specified as a function handle to a user-defined function. The user defined function can
either be an anonymous function or a function on the MATLAB path. The function first input

2-126
gradient

parameter must be a cell array like the one returned from the evaluation of fcnAppx. For more
information, see the description of outData in evaluate. The second, optional, input argument of
lossFcn contains additional data that might be needed for the gradient calculation, as described
below in fcnData. For an example of the signature that this function must have, see “Train
Reinforcement Learning Policy Using Custom Training Loop”.

fcnData — Additional input data for loss function


any MATLAB data type

Additional data for the loss function, specified as any MATLAB data type, typically a structure or cell
array. For an example see “Train Reinforcement Learning Policy Using Custom Training Loop”.

Output Arguments
grad — Value of the gradient
cell array

Value of the gradient, returned as a cell array.

When the type of gradient is from the sum of the outputs with respect to the inputs of fcnAppx, then
grad is a cell array in which each element contains the gradient of the sum of all the outputs with
respect to the corresponding input channel.

The numerical array in each cell has dimensions D-by-LB-by-LS, where:

• D corresponds to the dimensions of the input channel of fcnAppx.


• LB is the batch size (length of a batch of independent inputs).
• LS is the sequence length (length of the sequence of inputs along the time dimension) for a
recurrent neural network. If fcnAppx does not use a recurrent neural network (which is the case
of environment function approximators, as they do not support recurrent neural networks), then
LS = 1.

When the type of gradient is from the output with respect to the parameters of fcnAppx, then grad
is a cell array in which each element contains the gradient of the sum of outputs belonging to an
output channel with respect to the corresponding group of parameters. The gradient is calculated
using the whole history of LS inputs, and all the LB gradients with respect to the independent input
sequences are added together in grad. Therefore, grad has always the same size as the result from
getLearnableParameters.

For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.

Version History
Introduced in R2022a

See Also
evaluate | accelerate | getLearnableParameters | rlValueFunction | rlQValueFunction
| rlVectorQValueFunction | rlContinuousDeterministicActor |
rlDiscreteCategoricalActor | rlContinuousGaussianActor |
rlContinuousDeterministicTransitionFunction |

2-127
2 Functions

rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction

Topics
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”

2-128
hyperbolicPenalty

hyperbolicPenalty
Hyperbolic penalty value for a point with respect to a bounded region

Syntax
p = hyperbolicPenalty(x,xmin,xmax)
p = hyperbolicPenalty( ___ ,lambda,tau)

Description
p = hyperbolicPenalty(x,xmin,xmax) calculates the nonnegative (hyperbolic) penalty vector p
for the point x with respect to the region bounded by xmin and xmax. p has the same dimension as x.
This syntax uses the default values of 1 and 0.1 for the lambda and tau parameters of the
hyperbolic function, respectively.

p = hyperbolicPenalty( ___ ,lambda,tau) specifies both the lambda and tau parameters of
the hyperbolic function. If lambda is an empty matrix its default value is used. Likewise if tau is an
empty matrix or it is omitted, its default value is used instead.

Examples

Calculate Hyperbolic Penalty for a Point

This example shows how to use the hyperbolicPenalty function to calculate the hyperbolic
penalty for a given point with respect to a bounded region.

Calculate the penalty value for the point 0.1 within the interval [-2,2], using default values for the
lambda and tau parameters.

hyperbolicPenalty(0.1,-2,2)

ans = 0.0050

Calculate the penalty value for the point 4 outside the interval [-2,2].

hyperbolicPenalty(4,-2,2)

ans = 4.0033

Calculate the penalty value for the point 0.1 within the interval [-2,2], using a lambda parameter of
5.

hyperbolicPenalty(0.1,-2,2,5)

ans = 0.0010

Calculate the penalty value for the point 4 outside the interval [-2,2], using a lambda parameter of 5.

hyperbolicPenalty(4,-2,2,5)

ans = 20.0007

2-129
2 Functions

Calculate the penalty value for the point 4 outside the interval [-2,2], using a tau parameter of 0.5.

hyperbolicPenalty(4,-2,2,5,0.5)

ans = 20.0167

Calculate the penalty value for the point [-2,0,4] with respect to the box defined by the intervals
[0,1], [-1,1], and [-2,2] along the x, y, and z dimensions, respectively, using the default value for
lambda and a tau parameter of 0.

hyperbolicPenalty([-2 0 4],[0 -1 -2],[1 1 2],1,0)

ans = 3×1

4
0
4

Visualize Penalty Values for an Interval

Create a vector of 1001 equidistant points distributed between -5 and 5.

x = -5:0.01:5;

Calculate penalties for all the points in the vector, using default values for the lambda and tau
parameters.

p = hyperbolicPenalty(x,-2,2);

Plot the points and add grid, labels and title.

plot(x,p)
grid
xlabel("point position");
ylabel("penalty value");
title("Penalty values over an interval");

2-130
hyperbolicPenalty

Input Arguments
x — Point for which the penalty is calculated
scalar | vector | matrix

Point for which the penalty is calculated, specified as a numeric scalar, vector or matrix.
Example: [0.5; 1.6]

xmin — Lower bounds


scalar | vector | matrix

Lower bounds for x, specified as a numeric scalar, vector or matrix. To use the same minimum value
for all elements in x specify xmin as a scalar.
Example: -1

xmax — Upper bounds


scalar | vector | matrix

Upper bounds for x, specified as a numeric scalar, vector or matrix. To use the same maximum value
for all elements in x specify xmax as a scalar.
Example: 2

lambda — Lambda parameter of the hyperbolic function


1 (default) | nonnegative scalar

2-131
2 Functions

Lambda parameter of the hyperbolic function, specified as a scalar.


Example: 3

tau — Tau parameter of the hyperbolic function


0.1 (default) | nonnegative scalar

Tau parameter of the hyperbolic function, specified as a scalar.


Example: 0.3

Output Arguments
p — Penalty value
nonnegative vector

Penalty value, returned as a vector of nonnegative elements. Each element pi depends on the position
of xi with respect to the interval specified by xmini and xmaxi. The hyperbolic penalty function
returns the value:

2 2 2 2
p(x) = − λ x − xmin + λ x − xmin + τ2 − λ xmax − x + λ xmax − x + τ2

Here, λ is the argument lambda, and τ is the argument tau. Note that for positive values of τ the
returned penalty value is always positive, because on the right side of the equation the magnitude of
the second term is always greater than that of the first, and the magnitude of the fourth term is
always greater than that of the third. If τ is zero, then the returned penalty is zero inside the interval
defined by the bounds, and it grows linearly with x outside this interval. If x is multidimensional, then
the calculation is applied independently on each dimension. Penalty functions are typically used to
generate negative rewards when constraints are violated, such as in generateRewardFunction.

Version History
Introduced in R2021b

Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

See Also
Functions
generateRewardFunction | exteriorPenalty | barrierPenalty

Topics
“Generate Reward Function from a Model Predictive Controller for a Servomotor”
“Define Reward Signals”

2-132
inspectTrainingResult

inspectTrainingResult
Plot training information from a previous training session

Syntax
inspectTrainingResult(trainResults)

inspectTrainingResult(agentResults)

Description
By default, the train function shows the training progress and results in the Episode Manager
during training. If you configure training to not show the Episode Manager or you close the Episode
Manager after training, you can view the training results using the inspectTrainingResult
function, which opens the Episode Manager. You can also use inspectTrainingResult to view the
training results for agents saved during training.

inspectTrainingResult(trainResults) opens the Episode Manager and plots the training


results from a previous training session.

inspectTrainingResult(agentResults) opens the Episode Manager and plots the training


results from a previously saved agent structure.

Examples

View Results From Previous Training Session

For this example, assume that you have trained the agent in the “Train Reinforcement Learning
Agent in MDP Environment” example and subsequently closed the Episode Manager.

Load the training information returned by the train function.

load mdpTrainingStats trainingStats

Reopen the Episode Manager for this training session.

inspectTrainingResult(trainingStats)

2-133
2 Functions

View Training Results for Saved Agent

For this example, load the environment and agent for the “Train Reinforcement Learning Agent in
MDP Environment” example.

load mdpAgentAndEnvironment

Specify options for training the agent. Configure the SaveAgentCriteria and SaveAgentValue
options to save all agents after episode 30.

trainOpts = rlTrainingOptions;
trainOpts.MaxStepsPerEpisode = 50;
trainOpts.MaxEpisodes = 50;
trainOpts.Plots = "none";
trainOpts.SaveAgentCriteria = "EpisodeCount";
trainOpts.SaveAgentValue = 30;

Train the agent. During training, when an episode has a reward greater than or equal to 13, a copy of
the agent is saved in a savedAgents folder.

rng('default') % for reproducibility


trainingStats = train(qAgent,env,trainOpts);

Load the training results for one of the saved agents. This command loads both the agent and a
structure that contains the corresponding training results.

2-134
inspectTrainingResult

load savedAgents/Agent50

View the training results from the saved agent result structure.

inspectTrainingResult(savedAgentResult)

The Episode Manager shows the training progress up to the episode in which the agent was saved.

Input Arguments
trainResults — Training episode data
structure | structure array

Training episode data, specified as a structure or structure array returned by the train function.

agentResults — Saved agent results


structure

Saved agent results, specified as a structure previously saved by the train function. The train
function saves agents when you specify the SaveAgentCriteria and SaveAgentValue options in
the rlTrainingOptions object used during training.

When you load a saved agent, the agent and its training results are added to the MATLAB workspace
as saved_agent and savedAgentResultStruct, respectively. To plot the training data for this
agent, use the following command.

inspectTrainingResult(savedAgentResultStruct)

For multi-agent training, savedAgentResultStruct contains structure fields with training results
for all the trained agents.

Version History
Introduced in R2021a

See Also
Functions
train

Topics
“Train Reinforcement Learning Agents”

2-135
2 Functions

predict
Package: rl.function

Predict next observation, next reward, or episode termination given observation and action input data

Syntax
predNextObs = predict(tsnFcnAppx,obs,act)
predReward = predict(rwdFcnAppx,obs,act,nextObs)
predIsDone = predict(idnFcnAppx,obs,act)

Description
predNextObs = predict(tsnFcnAppx,obs,act) evaluates the environment transition function
approximator object tsnFcnAppx and returns the predicted next observation nextObs, given the
current observation obs and the action act.

predReward = predict(rwdFcnAppx,obs,act,nextObs) evaluates the environment reward


function approximator object rwdFcnAppx and returns the predicted reward predReward, given the
current observation obs, the action act, and the next observation nextObs.

predIsDone = predict(idnFcnAppx,obs,act) evaluates the environment is-done function


approximator object idnFcnAppx and returns the predicted is-done status predIsDone, given the
current observation obs, the action act, and the next observation nextObs.

Examples

Predict Next Observation Using Continuous Gaussian Transition Function Approximator

Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, two
observation channels carry vectors in a four- and two-dimensional space, respectively. The action is a
continuous three-dimensional vector.

obsInfo = [rlNumericSpec([4 1],UpperLimit=10*ones(4,1));


rlNumericSpec([1 2],UpperLimit=20*ones(1,2)) ];

actInfo = rlNumericSpec([3 1]);

Create a deep neural network to use as approximation model for the transition function approximator.
For a continuous Gaussian transition function approximator, the network must have two output layers
for each observation (one for the mean values the other for the standard deviation values).

Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects, and specify a name for the input layers, so
you can later explicitly associate them with the appropriate environment channel.

% Input path layers from first observation channel


inPath1 = [ featureInputLayer( ...

2-136
predict

prod(obsInfo(1).Dimension), ...
Name="netObsIn1")
fullyConnectedLayer(5,Name="infc1") ];

% Input path layers from second observation channel


inPath2 = [ featureInputLayer( ...
prod(obsInfo(2).Dimension), ...
Name="netObsIn2")
fullyConnectedLayer(5,Name="infc2") ];

% Input path layers from action channel


inPath3 = [ featureInputLayer( ...
prod(actInfo(1).Dimension), ...
Name="netActIn")
fullyConnectedLayer(5,Name="infc3") ];

% Joint path layers, concatenate 3 inputs along first dimension


jointPath = [ concatenationLayer(1,3,Name="concat")
tanhLayer(Name="tanhJnt");
fullyConnectedLayer(10,Name="jntfc") ];

% Path layers for mean values of first predicted obs


% Using scalingLayer to scale range from (-1,1) to (-10,10)
% Note that scale vector must be a column vector
meanPath1 = [ tanhLayer(Name="tanhMean1");
fullyConnectedLayer(prod(obsInfo(1).Dimension));
scalingLayer(Name="scale1", ...
Scale=obsInfo(1).UpperLimit) ];

% Path layers for standard deviations first predicted obs


% Using softplus layer to make them non negative
sdevPath1 = [ tanhLayer(Name="tanhStdv1");
fullyConnectedLayer(prod(obsInfo(1).Dimension));
softplusLayer(Name="splus1") ];

% Path layers for mean values of second predicted obs


% Using scalingLayer to scale range from (-1,1) to (-20,20)
% Note that scale vector must be a column vector
meanPath2 = [ tanhLayer(Name="tanhMean2");
fullyConnectedLayer(prod(obsInfo(2).Dimension));
scalingLayer(Name="scale2", ...
Scale=obsInfo(2).UpperLimit(:)) ];

% Path layers for standard deviations second predicted obs


% Using softplus layer to make them non negative
sdevPath2 = [ tanhLayer(Name="tanhStdv2");
fullyConnectedLayer(prod(obsInfo(2).Dimension));
softplusLayer(Name="splus2") ];

% Add layers to network object


net = layerGraph;
net = addLayers(net,inPath1);
net = addLayers(net,inPath2);
net = addLayers(net,inPath3);
net = addLayers(net,jointPath);
net = addLayers(net,meanPath1);
net = addLayers(net,sdevPath1);
net = addLayers(net,meanPath2);

2-137
2 Functions

net = addLayers(net,sdevPath2);

% Connect layers
net = connectLayers(net,"infc1","concat/in1");
net = connectLayers(net,"infc2","concat/in2");
net = connectLayers(net,"infc3","concat/in3");
net = connectLayers(net,"jntfc","tanhMean1/in");
net = connectLayers(net,"jntfc","tanhStdv1/in");
net = connectLayers(net,"jntfc","tanhMean2/in");
net = connectLayers(net,"jntfc","tanhStdv2/in");

% Plot network
plot(net)

% Convert to dlnetwork
net=dlnetwork(net);

% Display the number of weights


summary(net)

Initialized: true

Number of learnables: 352

Inputs:
1 'netObsIn1' 4 features
2 'netObsIn2' 2 features
3 'netActIn' 3 features

2-138
predict

Create a continuous Gaussian transition function approximator object, specifying the names of all the
input and output layers.
tsnFcnAppx = rlContinuousGaussianTransitionFunction(...
net,obsInfo,actInfo,...
ObservationInputNames=["netObsIn1","netObsIn2"], ...
ActionInputNames="netActIn", ...
NextObservationMeanOutputNames=["scale1","scale2"], ...
NextObservationStandardDeviationOutputNames=["splus1","splus2"] );

Predict the next observation for a random observation and action.


predObs = predict(tsnFcnAppx, ...
{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)}, ...
{rand(actInfo(1).Dimension)})

predObs=1×2 cell array


{4x1 single} {[-24.9934 0.9501]}

Each element of the resulting cell array represents the prediction for the corresponding observation
channel.

To display the mean values and standard deviations of the Gaussian probability distribution for the
predicted observations, use evaluate.
predDst = evaluate(tsnFcnAppx, ...
{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension), ...
rand(actInfo(1).Dimension)})

predDst=1×4 cell array


{4x1 single} {2x1 single} {4x1 single} {2x1 single}

The result is a cell array in which the first and second element represent the mean values for the
predicted observations in the first and second channel, respectively. The third and fourth element
represent the standard deviations for the predicted observations in the first and second channel,
respectively.

Create Deterministic Reward Function and Predict Reward

Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

To approximate the reward function, create a deep neural network. For this example, the network has
two input channels, one for the current action and one for the next observations. The single output
channel contains a scalar, which represents the value of the predicted reward.

Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specifications, and specify a name for the input layers, so you can
later explicitly associate them with the appropriate environment channel.

2-139
2 Functions

actionPath = featureInputLayer( ...


actInfo.Dimension(1), ...
Name="action");

nextStatePath = featureInputLayer( ...


obsInfo.Dimension(1), ...
Name="nextState");

commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64,Name="FC2")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(64,Name="FC3")
reluLayer(Name="CriticCommonRelu3")
fullyConnectedLayer(1,Name="reward")];

net = layerGraph(nextStatePath);
net = addLayers(net,actionPath);
net = addLayers(net,commonPath);

net = connectLayers(net,"nextState","concat/in1");
net = connectLayers(net,"action","concat/in2");

plot(net)

Create a dlnetwork object and display the number of weights.

2-140
predict

net = dlnetwork(net);
summary(net);

Initialized: true

Number of learnables: 8.7k

Inputs:
1 'nextState' 4 features
2 'action' 1 features

Create a deterministic transition function object.


rwdFcnAppx = rlContinuousDeterministicRewardFunction(...
net,obsInfo,actInfo,...
ActionInputNames="action", ...
NextObservationInputNames="nextState");

Using this reward function object, you can predict the next reward value based on the current action
and next observation. For example, predict the reward for a random action and next observation.
Since, for this example, only the action and the next observation influence the reward, use an empty
cell array for the current observation.
act = rand(actInfo.Dimension);
nxtobs = rand(obsInfo.Dimension);
reward = predict(rwdFcnAppx,{}, {act}, {nxtobs})

reward = single
0.1034

To predict the reward, you can also use evaluate.


reward_ev = evaluate(rwdFcnAppx, {act,nxtobs} )

reward_ev = 1x1 cell array


{[0.1034]}

Create Is-Done Function and Predict Termination

Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

To approximate the is-done function, use a deep neural network. The network has one input channel
for the next observations. The single output channel is for the predicted termination signal.

Create the neural network as a vecto of layer object.


commonPath = [
featureInputLayer( ...
obsInfo.Dimension(1), ...
Name="nextState")

2-141
2 Functions

fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64,Name="FC3")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(2,Name="isdone0")
softmaxLayer(Name="isdone")];

net = layerGraph(commonPath);

plot(net)

Covert the network to a dlnetwork object and display the number of weights.

net = dlnetwork(net);
summary(net);

Initialized: true

Number of learnables: 4.6k

Inputs:
1 'nextState' 4 features

Create an is-done function approximator object.

isDoneFcnAppx = rlIsDoneFunction(...
net,obsInfo,actInfo,...
NextObservationInputNames="nextState");

2-142
predict

Using this is-done function approximator object, you can predict the termination signal based on the
next observation. For example, predict the termination signal for a random next observation. Since
for this example the termination signal only depends on the next observation, use empty cell arrays
for the current action and observation inputs.
nxtobs = rand(obsInfo.Dimension);
predIsDone = predict(isDoneFcnAppx,{},{},{nxtobs})

predIsDone = 0

You can obtain the termination probability using evaluate.


predIsDoneProb = evaluate(isDoneFcnAppx,{nxtobs})

predIsDoneProb = 1x1 cell array


{2x1 single}

predIsDoneProb{1}

ans = 2x1 single column vector

0.5405
0.4595

The first number is the probability of obtaining a 0 (no termination predicted), the second one is the
probability of obtaining a 1 (termination predicted).

Input Arguments
tsnFcnAppx — Environment transition function approximator object
rlContinuousDeterministicTransitionFunction object |
rlContinuousGaussianTransitionFunction object

Environment transition function approximator object, specified as one of the following:

• rlContinuousDeterministicTransitionFunction object
• rlContinuousGaussianTransitionFunction object

rwdFcnAppx — Environment reward function


rlContinuousDeterministicRewardFunction object |
rlContinuousGaussianRewardFunction object | function handle

Environment reward function approximator object, specified as one of the following:

• rlContinuousDeterministicRewardFunction object
• rlContinuousGaussianRewardFunction object
• Function handle object. For more information about function handle objects, see “What Is a
Function Handle?”.

idnFcnAppx — Environment is-done function approximator object


rlIsDoneFunction object

Environment is-done function approximator object, specified as an rlIsDoneFunction object.

2-143
2 Functions

obs — Observations
cell array

Observations, specified as a cell array with as many elements as there are observation input
channels. Each element of obs contains an array of observations for a single observation input
channel.

The dimensions of each element in obs are MO-by-LB, where:

• MO corresponds to the dimensions of the associated observation input channel.


• LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of observations,
specify LB > 1. If valueRep or qValueRep has multiple observation input channels, then LB must
be the same for all elements of obs.

LB must be the same for both act and obs.

For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.

act — Action
single-element cell array

Action, specified as a single-element cell array that contains an array of action values.

The dimensions of this array are MA-by-LB, where:

• MA corresponds to the dimensions of the associated action specification.


• LB is the batch size. To specify a single observation, set LB = 1. To specify a batch of observations,
specify LB > 1.

LB must be the same for both act and obs.

For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.

nextObs — Next observation


cell array

Next observations, that is the observation following the action act from the observation obs,
specified as a cell array of the same dimension as obs.

Output Arguments
predNextObs — Predicted next observation
cell array

Predicted next observation, that is the observation predicted by the transition function approximator
tsnFcnAppx given the current observation obs and the action act, retuned as a cell array of the
same dimension as obs.

predReward — Predicted next observation


single

2-144
predict

Predicted reward, that is the reward predicted by the reward function approximator rwdFcnAppx
given the current observation obs, the action act, and the following observation nextObs, retuned
as a single.

predIsDone — Predicted next observation


double

Predicted is-done episode status, that is the episode termination status predicted by the is-done
function approximator rwdFcnAppx given the current observation obs, the action act, and the
following observation nextObs, returned as a double.

Note If fcnAppx is an rlContinuousDeterministicRewardFunction object, then evaluate


behaves identically to predict except that it returns results inside a single-cell array. If fcnAppx is
an rlContinuousDeterministicTransitionFunction object, then evaluate behaves
identically to predict. If fcnAppx is an rlContinuousGaussianTransitionFunction object,
then evaluate returns the mean value and standard deviation the observation probability
distribution, while predict returns an observation sampled from this distribution. Similarly, for an
rlContinuousGaussianRewardFunction object, evaluate returns the mean value and standard
deviation the reward probability distribution, while predict returns a reward sampled from this
distribution. Finally, if fcnAppx is an rlIsDoneFunction object, then evaluate returns the
probabilities of the termination status being false or true, respectively, while predict returns a
predicted termination status sampled with these probabilities.

Version History
Introduced in R2022a

See Also
Objects
rlNeuralNetworkEnvironment | rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction | evaluate | accelerate | gradient

Topics
“Model-Based Policy Optimization Agents”

2-145
2 Functions

reset
Package: rl.policy

Reset environment, agent, experience buffer, or policy object

Syntax
initialObs = reset(env)

reset(agent)
agent = reset(agent)

resetPolicy = reset(policy)

reset(buffer)

Description
initialObs = reset(env) resets the specified MATLAB environment to an initial state and
returns the resulting initial observation value.

Do not use reset for Simulink environments, which are implicitly reset when running a new
simulation. Instead, customize the reset behavior using the ResetFcn property of the environment.

reset(agent) resets the specified agent. Resetting a built-in agent performs the following actions,
if applicable.

• Empty experience buffer.


• Set recurrent neural network states of actor and critic networks to zero.
• Reset the states of any noise models used by the agent.

agent = reset(agent) also returns the reset agent as an output argument.

resetPolicy = reset(policy) returns the policy object resetPolicy in which any recurrent
neural network states are set to zero and any noise model states are set to their initial conditions.
This syntax has no effect if the policy object does not use a recurrent neural network and does not
have a noise model with state.

reset(buffer) resets the specified replay memory buffer by removing all the experiences.

Examples

Reset Environment

Create a reinforcement learning environment. For this example, create a continuous-time cart-pole
system.

env = rlPredefinedEnv("CartPole-Continuous");

2-146
reset

Reset the environment and return the initial observation.

initialObs = reset(env)

initialObs = 4×1

0
0
0.0315
0

Reset Agent

Create observation and action specifications.

obsInfo = rlNumericSpec([4 1]);


actInfo = rlNumericSpec([1 1]);

Create a default DDPG agent using these specifications.

initOptions = rlAgentInitializationOptions(UseRNN=true);
agent = rlDDPGAgent(obsInfo,actInfo,initOptions);

Reset the agent.

agent = reset(agent);

Reset Experience Buffer

Create observation and action specifications.

obsInfo = rlNumericSpec([4 1]);


actInfo = rlNumericSpec([1 1]);

Create a replay memory experience buffer.

buffer = rlReplayMemory(obsInfo,actInfo,10000);

Add experiences to the buffer. For this example, add 20 random experiences.

for i = 1:20
expBatch(i).Observation = {obsInfo.UpperLimit.*rand(4,1)};
expBatch(i).Action = {actInfo.UpperLimit.*rand(1,1)};
expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(4,1)};
expBatch(i).Reward = 10*rand(1);
expBatch(i).IsDone = 0;
end
expBatch(20).IsDone = 1;

append(buffer,expBatch);

Reset and clear the buffer.

2-147
2 Functions

reset(buffer)

Reset Policy

Create observation and action specifications.


obsInfo = rlNumericSpec([4 1]);
actInfo = rlFiniteSetSpec([-1 0 1]);

To approximate the Q-value function within the critic, use a deep neural network. Create each
network path as an array of layer objects.
% Create Paths
obsPath = [featureInputLayer(4)
fullyConnectedLayer(1,Name="obsout")];

actPath = [featureInputLayer(1)
fullyConnectedLayer(1,Name="actout")];

comPath = [additionLayer(2,Name="add") ...


fullyConnectedLayer(1)];

% Add Layers
net = layerGraph;
net = addLayers(net,obsPath);
net = addLayers(net,actPath);
net = addLayers(net,comPath);
net = connectLayers(net,"obsout","add/in1");
net = connectLayers(net,"actout","add/in2");

% Convert to dlnetwork object


net = dlnetwork(net);

% Diplay the number of weights


summary(net)

Initialized: true

Number of learnables: 9

Inputs:
1 'input' 4 features
2 'input_1' 1 features

Create an epsilon-greedy policy object using a Q-value function approximator.


critic = rlQValueFunction(net,obsInfo,actInfo);
policy = rlEpsilonGreedyPolicy(critic)

policy =
rlEpsilonGreedyPolicy with properties:

QValueFunction: [1x1 rl.function.rlQValueFunction]


ExplorationOptions: [1x1 rl.option.EpsilonGreedyExploration]
UseEpsilonGreedyAction: 1
EnableEpsilonDecay: 1

2-148
reset

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: -1

Reset the policy.


policy = reset(policy);

Input Arguments
env — Reinforcement learning environment
environment object | ...

Reinforcement learning environment, specified as one of the following objects.

• rlFunctionEnv — Environment defined using custom functions


• rlMDPEnv — Markov decision process environment
• rlNeuralNetworkEnvironment — Environment with deep neural network transition models
• Predefined environment created using rlPredefinedEnv
• Custom environment created from a template (rlCreateEnvTemplate)

agent — Reinforcement learning agent


reinforcement learning agent object | ...

Reinforcement learning agent, specified as one of the following objects.

• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• rlMBPOAgent
• Custom agent — For more information, see “Create Custom Reinforcement Learning Agents”.

Note agent is a handle object, so it is reset whether is returned as an output argument or not. For
more information about handle objects, see “Handle Object Behavior”.

buffer — Experience buffer


rlReplayMemory object | rlPrioritizedReplayMemory object

Experience buffer, specified as an rlReplayMemory or rlPrioritizedReplayMemory object.

2-149
2 Functions

policy — Reinforcement learning policy


rlMaxQPolicy | rlEpsilonGreedyPolicy | rlDeterministicActorPolicy |
rlAdditiveNoisePolicy | rlStochasticActorPolicy

Reinforcement learning policy, specified as one of the following objects:

• rlMaxQPolicy
• rlEpsilonGreedyPolicy
• rlDeterministicActorPolicy
• rlAdditiveNoisePolicy
• rlStochasticActorPolicy

Output Arguments
initialObs — Initial environment observation
array | cell array

Initial environment observation after reset, returned as one of the following:

• Array with dimensions matching the observation specification for an environment with a single
observation channel.
• Cell array with length equal to the number of observation channel for an environment with
multiple observation channels. Each element of the cell array contains an array with dimensions
matching the corresponding element of the environment observation specifications.

resetPolicy — Reset agent


rlDeterministicActorPolicy object | rlAdditiveNoisePolicy object |
rlEpsilonGreedyPolicy object | rlMaxQPolicy object | rlStochasticActorPolicy object

Reset policy, returned as a policy object of the same type as agent but with its recurrent neural
network states set to zero.

agent — Reset agent


reinforcement learning agent object

Reset agent, returned as an agent object. Note that agent is a handle object. Therefore, if it contains
any recurrent neural network, its state is reset whether agent is returned as an output argument or
not. For more information about handle objects, see “Handle Object Behavior”.

Version History
Introduced in R2022a

See Also
runEpisode | setup | cleanup

2-150
resize

resize
Package: rl.replay

Resize replay memory experience buffer

Syntax
resize(buffer,maxLength)

Description
resize(buffer,maxLength) resizes experience buffer buffer to have a maximum length of
maxLength.

• If maxLength is greater than or equal to the number of experiences stored in the buffer, then
buffer retains its stored experiences.
• If maxLength is less than the number of experiences stored in the buffer, then buffer retains
only the maxLength most recent experiences.

Examples

Resize Experience Buffer in Reinforcement Learning Agent

Create an environment for training the agent. For this example, load a predefined environment.

env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

Extract the observation and action specifications from the agent.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a DQN agent from the environment specifications.

agent = rlDQNAgent(obsInfo,actInfo);

By default, the agent uses an experience buffer with a maximum size of 10,000.

agent.ExperienceBuffer

ans =
rlReplayMemory with properties:

MaxLength: 10000
Length: 0

Increase the maximum size of the experience buffer to 20,000.

resize(agent.ExperienceBuffer,20000)

2-151
2 Functions

View the updated experience buffer.

agent.ExperienceBuffer

ans =
rlReplayMemory with properties:

MaxLength: 20000
Length: 0

Input Arguments
buffer — Experience buffer
rlReplayMemory object | rlPrioritizedReplayMemory object

Experience buffer, specified as an rlReplayMemory or rlPrioritizedReplayMemory object.

maxLength — Maximum buffer length


nonnegative integer

This property is read-only.

Maximum buffer length, specified as a nonnegative integer.

Version History
Introduced in R2022b

See Also
rlReplayMemory | rlPrioritizedReplayMemory

2-152
rlCreateEnvTemplate

rlCreateEnvTemplate
Create custom reinforcement learning environment template

Syntax
rlCreateEnvTemplate(className)

Description
rlCreateEnvTemplate(className) creates and opens a MATLAB script that contains a template
class representing a reinforcement learning environment. The template class contains an
implementation of a simple cart-pole balancing environment. To define your custom environment,
modify this template class. For more information, see “Create Custom MATLAB Environment from
Template”.

Examples

Create Custom Environment Template File

This example shows how to create and open a template file for a reinforcement learning environment.

For this example, name the class myEnvClass

rlCreateEnvTemplate("myEnvClass")

This function opens a MATLAB® script that contains the class. By default, this template class
describes a simple cart-pole environment.

Modify this template class, and save the file as myEnvClass.m

Input Arguments
className — Name of environment class
string | character vector

Name of environment class, specified as a string or character vector. This name defines the name of
the class and the name of the MATLAB script.

Version History
Introduced in R2019a

See Also
Topics
“Create MATLAB Reinforcement Learning Environments”

2-153
2 Functions

rlDataLogger
Creates either a file logger object or a monitor logger object to log training data

Syntax
fileLgr = rlDataLogger()
monLgr = rlDataLogger(tpm)

Description
fileLgr = rlDataLogger() creates the FileLogger object fileLgr for logging training data to
disk.

monLgr = rlDataLogger(tpm) creates the MonitorLogger object monLgr for logging training
data to the TrainingProgressMonitor object tpm, and its associated window.

Examples

Log Data to Disk during Built-in Training

This example shows how to log data to disk when using train.

Create a FileLogger object using rlDataLogger.

logger = rlDataLogger();

Specify a directory to save logged data.

logger.LoggingOptions.LoggingDirectory = "myDataLog";

Create callback functions to log the data (for this example, see the helper function section), and
specify the appropriate callback functions in the logger object. For a related example, see “Log
Training Data To Disk”.

logger.EpisodeFinishedFcn = @myEpisodeFinishedFcn;
logger.AgentStepFinishedFcn = @myAgentStepFinishedFcn;
logger.AgentLearnFinishedFcn = @myAgentLearnFinishedFcn;

To train the agent, you can now call train, passing logger as an argument such as in the following
command.

trainResult = train(agent, env, trainOpts, Logger=logger);

While the training progresses, data will be logged to the specified directory, according to the rule
specified in the FileNameRule property of logger.LoggingOptions.

logger.LoggingOptions.FileNameRule

ans =
"loggedData<id>"

2-154
rlDataLogger

Example Logging Functions

Episode completion logging function.


function dataToLog = myEpisodeFinishedFcn(data)
dataToLog.Experience = data.Experience;
end

Agent step completion logging function.


function dataToLog = myAgentStepFinishedFcn(data)
dataToLog.State = getState(getExplorationPolicy(data.Agent));
end

Learn subroutine completion logging function.


function dataToLog = myAgentLearnFinishedFcn(data)
dataToLog.ActorLoss = data.ActorLoss;
dataToLog.CriticLoss = data.CriticLoss;
end

Log and Visualize Data with a Training Progress Monitor Object

This example shows how to log and visualize data to the window of a trainingProgressMonitor
object when using train.

Create a trainingProgressMonitor object. Creating the object also opens a window associated
with the object.
monitor = trainingProgressMonitor();

Create a MonitorLogger object using rlDataLogger.

2-155
2 Functions

logger = rlDataLogger(monitor);

Create callback functions to log the data (for this example, see the helper function section), and
specify the appropriate callback functions in the logger object.
logger.AgentLearnFinishedFcn = @myAgentLearnFinishedFcn;

To train the agent, you can now call train, passing logger as an argument such as in the following
command.
trainResult = train(agent, env, trainOpts, Logger=logger);

While the training progresses, data will be logged to the training monitor object, and visualized in the
associated window.

Note that only scalar data can be logged with a monitor logger object.

Example Logging Functions

Define a logging function that logs data periodically at the completion of the learning subroutine.
function dataToLog = myAgentLearnFinishedFcn(data)

if mod(data.AgentLearnCount, 2) == 0
dataToLog.ActorLoss = data.ActorLoss;
dataToLog.CriticLoss = data.CriticLoss;
else
dataToLog = [];
end

end

Log Data to Disk in a Custom Training Loop

This example shows how to log data to disk when training an agent using a custom training loop.

Create a FileLogger object using rlDataLogger.


flgr = rlDataLogger();

Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.
setup(flgr);

Within a custom training loop, you can now store data to the logger object memory and write data to
file.

For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.
for iter = 1:10

2-156
rlDataLogger

% Store three random numbers in memory


% as elements of the variable "Context1"
for ct = 1:3
store(flgr, "Context1", rand, iter);
end

% Store a random number in memory


% as the variable "Context2"
store(flgr, "Context2", rand, iter);

% Write data to file every 4 iterations


if mod(iter,4)==0
write(flgr);
end

end

Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.

cleanup(flgr);

Input Arguments
tpm — Training progress monitor object
trainingProgressMonitor object

Training progress monitor object, specified as a trainingProgressMonitor object. This object is


used to track the progress of training, update information fields, record metric values, and produce
training plots for custom deep learning training loops. For more information, see “Monitor Custom
Training Loop Progress”.

Output Arguments
fileLgr — File logger object
FileLogger object

File logger object, returned as a FileLogger object.

monLgr — Monitor logger object


MonitorLogger object

Monitor logger object, returned as a MonitorLogger object.

Limitations
• Only scalar data is supported when logging data with a MonitorLogger object. The structure
returned by the callback functions must contain fields with scalar data.
• Resuming of training from a previous training result is not supported when logging data with a
MonitorLogger object.
• Logging data using the AgentStepFinishedFcn callback is not supported when training agents
in parallel with the train function.

2-157
2 Functions

Version History
Introduced in R2022b

See Also
Functions
train

Objects
FileLogger | MonitorLogger | trainingProgressMonitor

Topics
“Log Training Data To Disk”
“Monitor Custom Training Loop Progress”

2-158
rlOptimizer

rlOptimizer
Creates an optimizer object for actors and critics

Syntax
algobj = rlOptimizer
algobj = rlOptimizer(algOptions)

Description
Create an optimizer object that updates the learnable parameters of an actor or critic in a custom
training loop

algobj = rlOptimizer creates a default optimizer object. You can modify the object properties
using dot notation.

algobj = rlOptimizer(algOptions) creates an optimizer object with properties specified by the


optimizer options object algOptions.

Examples

Create Optimizer Object

Use rlOprimizer to create a default optimizer algorithm object to use for the training of an actor or
critic in a custom training loop.
myAlg = rlOptimizer

myAlg =
rlADAMOptimizer with properties:

GradientDecayFactor: 0.9000
SquaredGradientDecayFactor: 0.9990
Epsilon: 1.0000e-08
LearnRate: 0.0100
L2RegularizationFactor: 1.0000e-04
GradientThreshold: Inf
GradientThresholdMethod: "l2norm"

By default, the function returns an rlADAMOptimizer object with default options. You can use dot
notation to change some parameters.
myAlg.LearnRate = 0.1;

You can now create a structure and set its CriticOptimizer or ActorOptimizer field to myAlg.
When you call runEpisode, pass the structure as an input parameter. The runEpisode function can
then use the update method of myAlg to update the learnable parameters of your actor or critic.

2-159
2 Functions

Create Optimizer Object Specifying an Option Set

Use rlOprimizerOptions to create an optimizer option object. Specify the algorithm as


"rmsprop" and set the learning rate to 0.2.
myOptions=rlOptimizerOptions( ...
Algorithm="rmsprop", ...
LearnRate=0.2);

Use rlOptimizer to create an optimizer algorithm object to use for the training of an actor or critic
in a custom training loop. Specify the optimizer option set myOptions as input parameter.
myAlg=rlOptimizer(myOptions)

myAlg =
rlRMSPropOptimizer with properties:

SquaredGradientDecayFactor: 0.9990
Epsilon: 1.0000e-08
LearnRate: 0.2000
L2RegularizationFactor: 1.0000e-04
GradientThreshold: Inf
GradientThresholdMethod: "l2norm"

The function returns an rlRMSPropOptimizer object with default options. You can use dot notation
to change some parameters.
myAlg.GradientThreshold = 2;

You can now create a structure and set its CriticOptimizer or ActorOptimizer field to myAlg.
When you call runEpisode, pass the structure as an input parameter. The runEpisode function can
then use the update method of myAlg to update the learnable parameters of your actor or critic.

Input Arguments
algOptions — Algorithm options object
default Adam option set (default) | rlOptimizerOptions object

Algorithm options object, specified as an rlOptimizerOptions object.


Example: rlOptimizerOptions(Algorithm="sgdm",LearnRate=0.2)

Output Arguments
algobj — Algorithm optimizer object
rlADAMOptimizer object | rlSGDMOptimizer object | rlRMSPropOptimizer object

Algorithm optimizer object, returned as an rlADAMOptimizer, rlSGDMOptimizer, or


rlRMSPropOptimizer object. The runEpisode function uses the update method of the returned
object to update the learnable parameter of an actor or critic.

Version History
Introduced in R2022a

2-160
rlOptimizer

See Also
Functions
rlOptimizerOptions

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

2-161
2 Functions

rlPredefinedEnv
Create a predefined reinforcement learning environment

Syntax
env = rlPredefinedEnv(keyword)

Description
env = rlPredefinedEnv(keyword) takes a predefined keyword keyword representing the
environment name to create a MATLAB or Simulink reinforcement learning environment env. The
environment env models the dynamics with which the agent interacts, generating rewards and
observations in response to agent actions.

Examples

Basic Grid World Reinforcement Learning Environment

Use the predefined 'BasicGridWorld' keyword to create a basic grid world reinforcement learning
environment.
env = rlPredefinedEnv('BasicGridWorld')

env =
rlMDPEnv with properties:

Model: [1x1 rl.env.GridWorld]


ResetFcn: []

Continuous Double Integrator Reinforcement Learning Environment

Use the predefined 'DoubleIntegrator-Continuous' keyword to create a continuous double


integrator reinforcement learning environment.
env = rlPredefinedEnv('DoubleIntegrator-Continuous')

env =
DoubleIntegratorContinuousAction with properties:

Gain: 1
Ts: 0.1000
MaxDistance: 5
GoalThreshold: 0.0100
Q: [2x2 double]
R: 0.0100
MaxForce: Inf
State: [2x1 double]

2-162
rlPredefinedEnv

You can visualize the environment using the plot function and interact with it using the reset and
step functions.
plot(env)
observation = reset(env)

observation = 2×1

4
0

[observation,reward,isDone] = step(env,16)

observation = 2×1

4.0800
1.6000

reward = -16.5559

isDone = logical
0

Create Continuous Simple Pendulum Model Environment

Use the predefined 'SimplePendulumModel-Continuous' keyword to create a continuous simple


pendulum model reinforcement learning environment.
env = rlPredefinedEnv('SimplePendulumModel-Continuous')

env =
SimulinkEnvWithAgent with properties:

Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on

2-163
2 Functions

Input Arguments
keyword — Predefined keyword representing the environment name
'BasicGridWorld' | 'CartPole-Discrete' | 'DoubleIntegrator-Continuous' |
'SimplePendulumWithImage-Discrete' | 'SimplePendulumModel-Discrete' |
'SimplePendulumModel-Continuous' | 'CartPoleSimscapeModel-Continuous' | ...

Predefined keyword representing the environment name, specified as one of the following:

MATLAB Environment

• 'BasicGridWorld'
• 'CartPole-Discrete'
• 'CartPole-Continuous'
• 'DoubleIntegrator-Discrete'
• 'DoubleIntegrator-Continuous'
• 'SimplePendulumWithImage-Discrete'
• 'SimplePendulumWithImage-Continuous'
• 'WaterFallGridWorld-Stochastic'
• 'WaterFallGridWorld-Deterministic'

Simulink Environment

• 'SimplePendulumModel-Discrete'
• 'SimplePendulumModel-Continuous'
• 'CartPoleSimscapeModel-Discrete'
• 'CartPoleSimscapeModel-Continuous'

Output Arguments
env — MATLAB or Simulink environment object
rlMDPEnv object | CartPoleDiscreteAction object | CartPoleContinuousAction object |
DoubleIntegratorDiscreteAction object | DoubleIntegratorContinuousAction object |
SimplePendlumWithImageDiscreteAction object |
SimplePendlumWithImageContinuousAction object | SimulinkEnvWithAgent object

MATLAB or Simulink environment object, returned as one of the following:

• rlMDPEnv object, when you use one of the following keywords.

• 'BasicGridWorld'
• 'WaterFallGridWorld-Stochastic'
• 'WaterFallGridWorld-Deterministic'
• CartPoleDiscreteAction object, when you use the 'CartPole-Discrete' keyword.
• CartPoleContinuousAction object, when you use the 'CartPole-Continuous' keyword.
• DoubleIntegratorDiscreteAction object, when you use the 'DoubleIntegrator-
Discrete' keyword.

2-164
rlPredefinedEnv

• DoubleIntegratorContinuousAction object, when you use the 'DoubleIntegrator-


Continuous' keyword.
• SimplePendlumWithImageDiscreteAction object, when you use the
'SimplePendulumWithImage-Discrete' keyword.
• SimplePendlumWithImageContinuousAction object, when you use the
'SimplePendulumWithImage-Continuous' keyword.
• SimulinkEnvWithAgent object, when you use one of the following keywords.

• 'SimplePendulumModel-Discrete'
• 'SimplePendulumModel-Continuous'
• 'CartPoleSimscapeModel-Discrete'
• 'CartPoleSimscapeModel-Continuous'

Version History
Introduced in R2019a

See Also
Topics
“Create MATLAB Reinforcement Learning Environments”
“Create Simulink Reinforcement Learning Environments”
“Load Predefined Control System Environments”
“Load Predefined Simulink Environments”

2-165
2 Functions

rlRepresentation
(Not recommended) Model representation for reinforcement learning agents

Note rlRepresentation is not recommended. Use rlValueRepresentation,


rlQValueRepresentation, rlDeterministicActorRepresentation, or
rlStochasticActorRepresentation instead. For more information, see “Compatibility
Considerations”.

Syntax
rep = rlRepresentation(net,obsInfo,'Observation',obsNames)
rep = rlRepresentation(net,obsInfo,actInfo,'Observation',obsNames,'Action',
actNames)

tableCritic = rlRepresentation(tab)

critic = rlRepresentation(basisFcn,W0,obsInfo)
critic = rlRepresentation(basisFcn,W0,oaInfo)
actor = rlRepresentation(basisFcn,W0,obsInfo,actInfo)

rep = rlRepresentation( ___ ,repOpts)

Description
Use rlRepresentation to create a function approximator representation for the actor or critic of a
reinforcement learning agent. To do so, you specify the observation and action signals for the training
environment and options that affect the training of an agent that uses the representation. For more
information on creating representations, see “Create Policies and Value Functions”.

rep = rlRepresentation(net,obsInfo,'Observation',obsNames) creates a representation


for the deep neural network net. The observation names obsNames are the network input layer
names. obsInfo contains the corresponding observation specifications for the training environment.
Use this syntax to create a representation for a critic that does not require action inputs, such as a
critic for an rlACAgent or rlPGAgent agent.

rep = rlRepresentation(net,obsInfo,actInfo,'Observation',obsNames,'Action',
actNames) creates a representation with action signals specified by the names actNames and
specification actInfo. Use this syntax to create a representation for any actor, or for a critic that
takes both observation and action as input, such as a critic for an rlDQNAgent or rlDDPGAgent
agent.

tableCritic = rlRepresentation(tab) creates a critic representation for the value table or Q


table tab. When you create a table representation, you specify the observation and action
specifications when you create tab.

critic = rlRepresentation(basisFcn,W0,obsInfo) creates a linear basis function


representation using the handle to a custom basis function basisFcn and initial weight vector W0.
obsInfo contains the corresponding observation specifications for the training environment. Use

2-166
rlRepresentation

this syntax to create a representation for a critic that does not require action inputs, such as a critic
for an rlACAgent or rlPGAgent agent.

critic = rlRepresentation(basisFcn,W0,oaInfo) creates a linear basis function


representation using the specification cell array oaInfo, where oaInfo = {obsInfo,actInfo}.
Use this syntax to create a representation for a critic that takes both observations and actions as
inputs, such as a critic for an rlDQNAgent or rlDDPGAgent agent.

actor = rlRepresentation(basisFcn,W0,obsInfo,actInfo) creates a linear basis function


representation using the specified observation and action specifications, obsInfo and actInfo,
respectively. Use this syntax to create a representation for an actor that takes observations as inputs
and generates actions.

rep = rlRepresentation( ___ ,repOpts) creates a representation using additional options that
specify learning parameters for the representation when you train an agent. Available options include
the optimizer used for training and the learning rate. Use rlRepresentationOptions to create the
options set repOpts. You can use this syntax with any of the previous input-argument combinations.

Examples

Create Actor and Critic Representations

Create an actor representation and a critic representation that you can use to define a reinforcement
learning agent such as an Actor Critic (AC) agent.

For this example, create actor and critic representations for an agent that can be trained against the
cart-pole environment described in “Train AC Agent to Balance Cart-Pole System”. First, create the
environment. Then, extract the observation and action specifications from the environment. You need
these specifications to define the agent and critic representations.

env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

For a state-value-function critic such as those used for AC or PG agents, the inputs are the
observations and the output should be a scalar value, the state value. For this example, create the
critic representation using a deep neural network with one output, and with observation signals
corresponding to x,xdot,theta,thetadot as described in “Train AC Agent to Balance Cart-Pole
System”. You can obtain the number of observations from the obsInfo specification. Name the
network layer input 'observation'.
numObservation = obsInfo.Dimension(1);
criticNetwork = [
imageInputLayer([numObservation 1 1],'Normalization','none','Name','observation')
fullyConnectedLayer(1,'Name','CriticFC')];

Specify options for the critic representation using rlRepresentationOptions. These options
control parameters of critic network learning, when you train an agent that incorporates the critic
representation. For this example, set the learning rate to 0.05 and the gradient threshold to 1.
repOpts = rlRepresentationOptions('LearnRate',5e-2,'GradientThreshold',1);

Create the critic representation using the specified neural network and options. Also, specify the
action and observation information for the critic. Set the observation name to 'observation',
which is the name you used when you created the network input layer for criticNetwork.

2-167
2 Functions

critic = rlRepresentation(criticNetwork,obsInfo,'Observation',{'observation'},repOpts)

critic =
rlValueRepresentation with properties:

Options: [1x1 rl.option.rlRepresentationOptions]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: {1x0 cell}

Similarly, create a network for the actor. An AC agent decides which action to take given observations
using an actor representation. For an actor, the inputs are the observations, and the output depends
on whether the action space is discrete or continuous. For the actor of this example, there are two
possible discrete actions, –10 or 10. Thus, to create the actor, use a deep neural network with the
same observation input as the critic, that can output these two values. You can obtain the number of
actions from the actInfo specification. Name the output 'action'.
numAction = numel(actInfo.Elements);
actorNetwork = [
imageInputLayer([4 1 1], 'Normalization','none','Name','observation')
fullyConnectedLayer(numAction,'Name','action')];

Create the actor representation using the observation name and specification and the action name
and specification. Use the same representation options.
actor = rlRepresentation(actorNetwork,obsInfo,actInfo,...
'Observation',{'observation'},'Action',{'action'},repOpts)

actor =
rlStochasticActorRepresentation with properties:

Options: [1x1 rl.option.rlRepresentationOptions]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]

You can now use the actor and critic representations to create an AC agent.
agentOpts = rlACAgentOptions(...
'NumStepsToLookAhead',32,...
'DiscountFactor',0.99);
agent = rlACAgent(actor,critic,agentOpts)

agent =
rlACAgent with properties:

AgentOptions: [1x1 rl.option.rlACAgentOptions]

Create Q Table Representation

This example shows how to create a Q Table representation:

Create an environment interface.


env = rlPredefinedEnv("BasicGridWorld");

Create a Q table using the action and observation specifications from the environment.

2-168
rlRepresentation

qTable = rlTable(getObservationInfo(env),getActionInfo(env));

Create a representation for the Q table.

tableRep = rlRepresentation(qTable);

Create Quadratic Basis Function Critic Representation

This example shows how to create a linear basis function critic representation.

Assume that you have an environment, env. For this example, load the environment used in the
“Train Custom LQR Agent” example.

load myLQREnv.mat

Obtain the observation and action specifications from the environment.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a custom basis function. In this case, use the quadratic basis function from “Train Custom
LQR Agent”.

Set the dimensions and parameters required for your basis function.

n = 6;

Set an initial weight vector.

w0 = 0.1*ones(0.5*(n+1)*n,1);

Create a representation using a handle to the custom basis function.


critic = rlRepresentation(@(x,u) computeQuadraticBasis(x,u,n),w0,{obsInfo,actInfo});

Function to compute the quadratic basis from “Train Custom LQR Agent”.

function B = computeQuadraticBasis(x,u,n)
z = cat(1,x,u);
idx = 1;
for r = 1:n
for c = r:n
if idx == 1
B = z(r)*z(c);
else
B = cat(1,B,z(r)*z(c));
end
idx = idx + 1;
end
end
end

Input Arguments
net — Deep neural network for actor or critic
array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object

2-169
2 Functions

Deep neural network for actor or critic, specified as one of the following:

• Array of Layer objects


• layerGraph object
• DAGNetwork object
• SeriesNetwork object
• dlnetwork object

For a list of deep neural network layers, see “List of Deep Learning Layers”. For more information on
creating deep neural networks for reinforcement learning, see “Create Policies and Value Functions”.

obsNames — Observation names


cell array of character vectors

Observation names, specified as a cell array of character vectors. The observation names are the
network input layer names you specify when you create net. The names in obsNames must be in the
same order as the observation specifications in obsInfo.
Example: {'observation'}

obsInfo — Observation specification


spec object | array of spec objects

Observation specification, specified as a reinforcement learning spec object or an array of spec


objects. You can extract obsInfo from an existing environment using getObservationInfo. Or,
you can construct the specs manually using a spec command such as rlFiniteSetSpec or
rlNumericSpec. This specification defines such information about the observations as the
dimensions and names of the observation signals.

actNames — Action name


single-element cell array that contains a character vector

Action name, specified as a single-element cell array that contains a character vector. The action
name is the network layer name you specify when you create net. For critic networks, this layer is
the first layer of the action input path. For actors, this layer is the last layer of the action output path.
Example: {'action'}

actInfo — Action specification


spec object

Action specification, specified as a reinforcement learning spec object. You can extract actInfo from
an existing environment using getActionInfo. Or, you can construct the spec manually using a
spec command such as rlFiniteSetSpec or rlNumericSpec. This specification defines such
information about the action as the dimensions and name of the action signal.

For linear basis function representations, the action signal must be a scalar, a column vector, or a
discrete action.

tab — Value table or Q table for critic


rlTable object

Value table or Q table for critic, specified as an rlTable object. The learnable parameters of a table
representation are the elements of tab.

2-170
rlRepresentation

basisFcn — Custom basis function


function handle

Custom basis function, specified as a function handle to a user-defined function. For a linear basis
function representation, the output of the representation is f = W'B, where W is a weight array and B
is the column vector returned by the custom basis function. The learnable parameters of a linear
basis function representation are the elements of W.

When creating:

• A critic representation with observation inputs only, your basis function must have the following
signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here obs1 to obsN are observations in the same order and with the same data type and
dimensions as the observation specifications in obsInfo.
• A critic representation with observation and action inputs, your basis function must have the
following signature.

B = myBasisFunction(obs1,obs2,...,obsN,act)

Here obs1 to obsN are observations in the same order and with the same data type and
dimensions as the observation specifications in the first element of oaInfo, and act has the same
data type and dimensions as the action specification in the second element of oaInfo.
• An actor representation, your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here, obs1 to obsN are observations in the same order and with the same data type and
dimensions as the observation specifications in obsInfo. The data types and dimensions of the
action specification in actInfo affect the data type and dimensions of f.

Example: @(x,u) myBasisFunction(x,u)

W0 — Initial value for linear basis function weight vector


column vector | array

Initial value for linear basis function weight array, W, specified as one of the following:

• Column vector — When creating a critic representation or an actor representation with a


continuous scalar action signal
• Array — When creating an actor representation with a column vector continuous action signal or a
discrete action space.

oaInfo — Observation and action specifications


cell array

Observation and action specifications for creating linear basis function critic representations,
specified as the cell array {obsInfo,actInfo}.

repOpts — Representation options


rlRepresentationOptions object

2-171
2 Functions

Representation options, specified as an option set that you create with rlRepresentationOptions.
Available options include the optimizer used for training and the learning rate. See
rlRepresentationOptions for details.

Output Arguments
rep — Deep neural network representation
rlLayerRepresentation object

Deep neural network representation, returned as an rlLayerRepresentation object. Use this


representation to create an agent for reinforcement learning. For more information, see
“Reinforcement Learning Agents”.

tableCritic — Value or Q table critic representation


rlTableRepresentation object

Value or Q table critic representation, returned as an rlTableRepresentation object. Use this


representation to create an agent for reinforcement learning. For more information, see
“Reinforcement Learning Agents”.

critic — Linear basis function critic representation


rlLinearBasisRepresentation object

Linear basis function critic representation, returned as and rlLinearBasisRepresentation


object. Use this representation to create an agent for reinforcement learning. For more information,
see “Reinforcement Learning Agents”.

actor — Linear basis function actor representation


rlLinearBasisRepresentation object

Linear basis function actor representation, returned as and rlLinearBasisRepresentation


object. Use this representation to create an agent for reinforcement learning. For more information,
see “Reinforcement Learning Agents”.

Version History
Introduced in R2019a

rlRepresentation is not recommended


Not recommended starting in R2020a

rlRepresentation is not recommended. Depending on the type of representation being created,


use one of the following objects instead:

• rlValueRepresentation — State value critic, computed based on observations from the


environment.
• rlQValueRepresentation — State-action value critic, computed based on both actions and
observations from the environment.
• rlDeterministicActorRepresentation — Actor with deterministic actions, based on
observations from the environment.
• rlStochasticActorRepresentation — Actor with stochastic actions, based on observations
from the environment.

2-172
rlRepresentation

The following table shows some typical uses of the rlRepresentation function to create neural
network-based critics and actors, and how to update your code with one of the new objects instead.

Network-Based Representations: Not Network-Based Representations:


Recommended Recommended
rep = rep =
rlRepresentation(net,obsInfo,'Observat rlValueRepresentation(net,obsInfo,'Obs
ion',obsName), with net having only ervation',obsName). Use this syntax to create
observations as inputs, and a single scalar a representation for a critic that does not require
output. action inputs, such as a critic for an rlACAgent
or rlPGAgent agent.
rep = rep =
rlRepresentation(net,obsInfo,actInfo,' rlQValueRepresentation(net,obsInfo,act
Observation',obsName,'Action',actName) Info,'Observation',obsName,'Action',ac
, with net having both observations and action as tName). Use this syntax to create a single-output
inputs, and a single scalar output. state-action value representation for a critic that
takes both observation and action as input, such
as a critic for an rlDQNAgent or rlDDPGAgent
agent.
rep = rep =
rlRepresentation(net,obsInfo,actInfo,' rlDeterministicActorRepresentation(net
Observation',obsName,'Action',actName) ,obsInfo,actInfo,'Observation',obsName
, with net having observations as inputs and ,'Action',actName). Use this syntax to create
actions as outputs, and actInfo defining a a deterministic actor representation for a
continuous action space. continuous action space.
rep = rep =
rlRepresentation(net,obsInfo,actInfo,' rlStochasticActorRepresentation(net,ob
Observation',obsName,'Action',actName) sInfo,actInfo,'Observation',obsName).
, with net having observations as inputs and Use this syntax to create a stochastic actor
actions as outputs, and actInfo defining a representation for a discrete action space.
discrete action space.

The following table shows some typical uses of the rlRepresentation objects to express table-
based critics with discrete observation and action spaces, and how to update your code with one of
the new objects instead.

Table-Based Representations: Not Table-Based Representations: Recommended


Recommended
rep = rlRepresentation(tab), with tab rep =
containing a value table consisting in a column rlValueRepresentation(tab,obsInfo). Use
vector as long as the number of possible this syntax to create a representation for a critic
observations. that does not require action inputs, such as a
critic for an rlACAgent or rlPGAgent agent.

2-173
2 Functions

Table-Based Representations: Not Table-Based Representations: Recommended


Recommended
rep = rlRepresentation(tab), with tab rep =
containing a Q-value table with as many rows as rlQValueRepresentation(tab,obsInfo,act
the possible observations and as many columns Info). Use this syntax to create a single-output
as the possible actions. state-action value representation for a critic that
takes both observation and action as input, such
as a critic for an rlDQNAgent or rlDDPGAgent
agent.

The following table shows some typical uses of the rlRepresentation function to create critics and
actors which use a custom basis function, and how to update your code with one of the new objects
instead. In the recommended function calls, the first input argument is a two-elements cell containing
both the handle to the custom basis function and the initial weight vector or matrix.

Custom Basis Function-Based Custom Basis Function-Based


Representations: Not Recommended Representations: Recommended
rep = rep =
rlRepresentation(basisFcn,W0,obsInfo), rlValueRepresentation({basisFcn,W0},ob
where the basis function has only observations as sInfo). Use this syntax to create a
inputs and W0 is a column vector. representation for a critic that does not require
action inputs, such as a critic for an rlACAgent
or rlPGAgent agent.
rep = rlRepresentation(basisFcn,W0, rep =
{obsInfo,actInfo}), where the basis function rlQValueRepresentation({basisFcn,W0},o
has both observations and action as inputs and bsInfo,actInfo). Use this syntax to create a
W0 is a column vector. single-output state-action value representation
for a critic that takes both observation and action
as input, such as a critic for an rlDQNAgent or
rlDDPGAgent agent.
rep = rep =
rlRepresentation(basisFcn,W0,obsInfo,a rlDeterministicActorRepresentation({ba
ctInfo), where the basis function has sisFcn,W0},obsInfo,actInfo). Use this
observations as inputs and actions as outputs, W0 syntax to create a deterministic actor
is a matrix, and actInfo defines a continuous representation for a continuous action space.
action space.
rep = rep =
rlRepresentation(basisFcn,W0,obsInfo,a rlStochasticActorRepresentation({basis
ctInfo), where the basis function has Fcn,W0},obsInfo,actInfo). Use this syntax
observations as inputs and actions as outputs, W0 to create a deterministic actor representation for
is a matrix, and actInfo defines a discrete a discrete action space.
action space.

See Also
Functions
rlValueRepresentation | rlQValueRepresentation |
rlDeterministicActorRepresentation | rlStochasticActorRepresentation |
rlRepresentationOptions | getActionInfo | getObservationInfo

2-174
rlRepresentation

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

2-175
2 Functions

rlSimulinkEnv
Create reinforcement learning environment using dynamic model implemented in Simulink

Syntax
env = rlSimulinkEnv(mdl,agentBlocks)
env = rlSimulinkEnv(mdl,agentBlocks,obsInfo,actInfo)
env = rlSimulinkEnv( ___ ,'UseFastRestart',fastRestartToggle)

Description
The rlSimulinkEnv function creates a reinforcement learning environment object from a Simulink
model. The environment object acts an interface so that when you call sim or train, these functions
in turn call the Simulink model to generate experiences for the agents.

env = rlSimulinkEnv(mdl,agentBlocks) creates the reinforcement learning environment


object env for the Simulink model mdl. agentBlocks contains the paths to one or more
reinforcement learning agent blocks in mdl. If you use this syntax, each agent block must reference
an agent object already in the MATLAB workspace.

env = rlSimulinkEnv(mdl,agentBlocks,obsInfo,actInfo) creates the reinforcement


learning environment object env for the model mdl. The two cell arrays obsInfo and actInfo must
contain the observation and action specifications for each agent block in mdl, in the same order as
they appear in agentBlocks.

env = rlSimulinkEnv( ___ ,'UseFastRestart',fastRestartToggle) creates a


reinforcement learning environment object env and additionally enables fast restart. Use this syntax
after any of the input arguments in the previous syntaxes.

Examples

Create Simulink Environment Using Agent in Workspace

Create a Simulink environment using the trained agent and corresponding Simulink model from the
“Create Simulink Environment and Train Agent” example.

Load the agent in the MATLAB® workspace.


load rlWaterTankDDPGAgent

Create an environment for the rlwatertank model, which contains an RL Agent block. Since the
agent used by the block is already in the workspace, you do not need to pass the observation and
action specifications to create the environment.
env = rlSimulinkEnv('rlwatertank','rlwatertank/RL Agent')

env =
SimulinkEnvWithAgent with properties:

Model : rlwatertank

2-176
rlSimulinkEnv

AgentBlock : rlwatertank/RL Agent


ResetFcn : []
UseFastRestart : on

Validate the environment by performing a short simulation for two sample times.

validateEnvironment(env)

You can now train and simulate the agent within the environment by using train and sim,
respectively.

Create Reinforcement Learning Environment for Simulink Model

For this example, consider the rlSimplePendulumModel Simulink model. The model is a simple
frictionless pendulum that initially hangs in a downward position.

Open the model.

mdl = 'rlSimplePendulumModel';
open_system(mdl)

Create rlNumericSpec and rlFiniteSetSpec objects for the observation and action information,
respectively.

The observation is a vector containing three signals: the sine, cosine, and time derivative of the
angle.

obsInfo = rlNumericSpec([3 1])

obsInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: [0×0 string]
Description: [0×0 string]
Dimension: [3 1]
DataType: "double"

The action is a scalar expressing the torque and can be one of three possible values, -2 Nm, 0 Nm and
2 Nm.

actInfo = rlFiniteSetSpec([-2 0 2])

actInfo =
rlFiniteSetSpec with properties:

Elements: [3×1 double]


Name: [0×0 string]
Description: [0×0 string]
Dimension: [1 1]
DataType: "double"

2-177
2 Functions

You can use dot notation to assign property values for the rlNumericSpec and rlFiniteSetSpec
objects.
obsInfo.Name = 'observations';
actInfo.Name = 'torque';

Assign the agent block path information, and create the reinforcement learning environment for the
Simulink model using the information extracted in the previous steps.
agentBlk = [mdl '/RL Agent'];
env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo)

env =
SimulinkEnvWithAgent with properties:

Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on

You can also include a reset function using dot notation. For this example, randomly initialize theta0
in the model workspace.
env.ResetFcn = @(in) setVariable(in,'theta0',randn,'Workspace',mdl)

env =
SimulinkEnvWithAgent with properties:

Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : @(in)setVariable(in,'theta0',randn,'Workspace',mdl)
UseFastRestart : on

Create Simulink Environment for Multiple Agents

Create an environment for the Simulink model from the example “Train Multiple Agents to Perform
Collaborative Task”.

Load the agents in the MATLAB workspace.


load rlCollaborativeTaskAgents

Create an environment for the rlCollaborativeTask model, which has two agent blocks. Since the
agents used by the two blocks (agentA and agentB) are already in the workspace, you do not need
to pass their observation and action specifications to create the environment.
env = rlSimulinkEnv( ...
'rlCollaborativeTask', ...
["rlCollaborativeTask/Agent A","rlCollaborativeTask/Agent B"])

env =
SimulinkEnvWithAgent with properties:

Model : rlCollaborativeTask

2-178
rlSimulinkEnv

AgentBlock : [
rlCollaborativeTask/Agent A
rlCollaborativeTask/Agent B
]
ResetFcn : []
UseFastRestart : on

You can now simulate or train the agents within the environment using sim or train, respectively.

Input Arguments
mdl — Simulink model name
string | character vector

Simulink model name, specified as a string or character vector. The model must contain at least one
RL Agent block.

agentBlocks — Agent block paths


string | character vector | string array

Agent block paths, specified as a string, character vector, or string array.

If mdl contains a single RL Agent block, specify agentBlocks as a string or character vector
containing the block path.

If mdl contains multiple RL Agent blocks, specify agentBlocks as a string array, where each
element contains the path of one agent block.

mdl can contain RL Agent blocks whose path is not included in agentBlocks. Such agent blocks
behave as part of the environment, selecting actions based on their current policies. When you call
sim or train, the experiences of these agents are not returned and their policies are not updated.

Multi-agent simulation is not supported for MATLAB environments.

The agent blocks can be inside of a model reference. For more information on configuring an agent
block for reinforcement learning, see RL Agent.

obsInfo — Observation information


rlNumericSpec object | rlFiniteSetSpec object | array of rlNumericSpec objects | array of
rlFiniteSetSpec objects | cell array

Observation information, specified as a specification object, an array of specification objects, or a cell


array.

If mdl contains a single agent block, specify obsInfo as an rlNumericSpec object, an


rlFiniteSetSpec object, or an array containing a mix of such objects.

If mdl contains multiple agent blocks, specify obsInfo as a cell array, where each cell contains a
specification object or array of specification objects for the corresponding block in agentBlocks.

For more information, see getObservationInfo.

actInfo — Action information


rlNumericSpec object | rlFiniteSetSpec object | cell array

2-179
2 Functions

Action information, specified as a specification object or a cell array.

If mdl contains a single agent block, specify actInfo as an rlNumericSpec or rlFiniteSetSpec


object.

If mdl contains multiple agent blocks, specify actInfo as a cell array, where each cell contains a
specification object for the corresponding block in agentBlocks.

For more information, see getActionInfo.

fastRestartToggle — Option to toggle fast restart


'on' (default) | 'off'

Option to toggle fast restart, specified as either 'on' or 'off'. Fast restart allows you to perform
iterative simulations without compiling a model or terminating the simulation each time.

For more information on fast restart, see “How Fast Restart Improves Iterative Simulations”
(Simulink).

Output Arguments
env — Reinforcement learning environment
SimulinkEnvWithAgent object

Reinforcement learning environment, returned as a SimulinkEnvWithAgent object.

For more information on reinforcement learning environments, see “Create Simulink Reinforcement
Learning Environments”.

Version History
Introduced in R2019a

See Also
Functions
train | sim | getObservationInfo | getActionInfo | rlNumericSpec | rlFiniteSetSpec

Blocks
RL Agent

Topics
“Train DDPG Agent to Control Double Integrator System”
“Train DDPG Agent to Swing Up and Balance Pendulum”
“Train DDPG Agent to Swing Up and Balance Cart-Pole System”
“Train DDPG Agent to Swing Up and Balance Pendulum with Bus Signal”
“Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation”
“Train DDPG Agent for Adaptive Cruise Control”
“How Fast Restart Improves Iterative Simulations” (Simulink)

2-180
runEpisode

runEpisode
Package: rl.env

Simulate reinforcement learning environment against policy or agent

Syntax
output = runEpisode(env,policy)

output = runEpisode(env,agent)

output = runEpisode( ___ ,Name=Value)

Description
output = runEpisode(env,policy) runs a single simulation of the environment env against the
policy policy.

output = runEpisode(env,agent) runs a single simulation of the environment env against the
agent agent.

output = runEpisode( ___ ,Name=Value) specifies nondefault simulation options using one or
more name-value arguments.

Examples

Simulate Environment and Agent

Create a reinforcement learning environment and extract its observation and action specifications.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

To approximate the Q-value function withing the critic, use a neural network. Create a network as an
array of layer objects.
net = [...
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(2)
softmaxLayer];

Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
net = dlnetwork(net);
summary(net)

2-181
2 Functions

Initialized: true

Number of learnables: 770

Inputs:
1 'input' 4 features

Create a discrete categorical actor using the network.

actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);

Check your actor with a random observation.

act = getAction(actor,{rand(obsInfo.Dimension)})

act = 1x1 cell array


{[-10]}

Create a policy object from the actor.

policy = rlStochasticActorPolicy(actor);

Create an experience buffer.

buffer = rlReplayMemory(obsInfo,actInfo);

Set up the environment for running multiple simulations. For this example, configure the training to
log any errors rather than send them to the command window.

setup(env,StopOnError="off")

Simulate multiple episodes using the environment and policy. After each episode, append the
experiences to the buffer. For this example, run 100 episodes.

for i = 1:100
output = runEpisode(env,policy,MaxSteps=300);
append(buffer,output.AgentData.Experiences)
end

Clean up the environment.

cleanup(env)

Sample a mini-batch of experiences from the buffer. For this example, sample 10 experiences.

batch = sample(buffer,10);

You can then learn from the sampled experiences and update the policy and actor.

Input Arguments
env — Reinforcement learning environment
environment object | ...

Reinforcement learning environment, specified as one of the following objects.

2-182
runEpisode

• rlFunctionEnv — Environment defined using custom functions


• SimulinkEnvWithAgent — Simulink environment created using rlSimulinkEnv or
createIntegratedEnv
• rlMDPEnv — Markov decision process environment
• rlNeuralNetworkEnvironment — Environment with deep neural network transition models
• Predefined environment created using rlPredefinedEnv
• Custom environment created from a template (rlCreateEnvTemplate)

policy — Policy
policy object | array of policy objects

Policy object, specified as one of the following objects.

• rlDeterministicActorPolicy
• rlAdditiveNoisePolicy
• rlEpsilonGreedyPolicy
• rlMaxQPolicy
• rlStochasticActorPolicy

If env is a Simulink environment configured for multi-agent training, specify policy as an array of
policy objects. The order of the policies in the array must match the agent order used to create env.

For more information on a policy object, at the MATLAB command line, type help followed by the
policy object name.

agent — Reinforcement learning agent


agent object | array of agent objects

Reinforcement learning agent, specified as one of the following objects.

• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• rlMBPOAgent
• Custom agent — For more information, see “Create Custom Reinforcement Learning Agents”.

If env is a Simulink environment configured for multi-agent training, specify agent as an array of
agent objects. The order of the agents in the array must match the agent order used to create env.

2-183
2 Functions

Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.
Example: MaxSteps=1000

MaxSteps — Maximum simulation steps


500 (default) | positive integer

Maximum simulation steps, specified as a positive integer.

ProcessExperienceFcn — Function for processing experiences


function handle | cell array of function handles

Function for processing experiences and updating the policy or agent based on each experience as it
occurs during the simulation, specified as a function handle with the following signature.

[updatedPolicy,updatedData] = myFcn(experience,episodeInfo,policy,data)

Here:

• experience is a structure that contains a single experience. For more information on the
structure fields, see output.Experiences.
• episodeInfo contains data about the current episode and corresponds to
output.EpisodeInfo.
• policy is the policy or agent object being simulated.
• data contains experience processing data. For more information, see ProcessExperienceData.
• updatedPolicy is the updated policy or agent.
• updatedData is the updated experience processing data, which is used as the data input when
processing the next experience.

If env is a Simulink environment configured for multi-agent training, specify


ProcessExperienceFcn as a cell array of function handles. The order of the function handles in the
array must match the agent order used to create env.

ProcessExperienceData — Experience processing data


any MATLAB data type | cell array

Experience processing data, specified as any MATLAB data, such as an array or structure. Use this
data to pass additional parameters or information to the experience processing function.

You can also update this data within the experience processing function to use different parameters
when processing the next experience. The data values that you specify when you call runEpisode
are used to process the first experience in the simulation.

If env is a Simulink environment configured for multi-agent training, specify


ProcessExperienceData as a cell array. The order of the array elements must match the agent
order used to create env.

CleanupPostSim — Option to clean up environment


true (default) | false

2-184
runEpisode

Option to clean up the environment after the simulation, specified as true or false. When
CleanupPostSim is true, runEpisode calls cleanup(env) when the simulation ends.

To run multiple episodes without cleaning up the environment, set CleanupPostSim to false. You
can then call cleanup(env) after running your simulations.

If env is a SimulinkEnvWithAgent object and the associated Simulink model is configured to use
fast restart, then the model remains in a compiled state between simulations when CleanUpPostSim
is false.

LogExperiences — Option to log experiences


true (default) | false

Option to log experiences for each policy or agent, specified as true or false. When
LogExperiences is true, the experiences of the policy or agent are logged in
output.Experiences.

Output Arguments
output — Simulation output
structure | structure array | rl.env.Future object

Simulation output, returned as a structure with the fields AgentData and SimulationInfo. When
you simulate multiple policies or agents, output is returned as a structure array.

The AgentData field is a structure with the following fields.

Field Description
Experiences Logged experience of the policy or agent, returned as a
structure array. Each experience contains the following fields.

• Observation — Observation
• Action — Action taken
• NextObservation — Resulting next observation
• Reward — Corresponding reward
• IsDone — Termination signal
Time Simulation times of experiences, returned as a vector.
EpisodeInfo Episode information, returned as a structure with the following
fields.

• CumulativeReward — Total reward for all experiences


• StepsTaken — Number of simulation steps taken
• InitialObservation — Initial observation at the start of
the simulation
ProcessExperienceData Experience processing data
Agent Policy or agent used in the simulation

The SimulationInfo field is one of the following:

2-185
2 Functions

• For MATLAB environments — Structure containing the field SimulationError. This structure
contains any errors that occurred during simulation.
• For Simulink environments — Simulink.SimulationOutput object containing simulation data.
Recorded data includes any signals and states that the model is configured to log, simulation
metadata, and any errors that occurred.

If env is configured to run simulations on parallel workers, then output is an rl.env.Future


object to support deferred evaluation of the simulation.

Tips
• You can speed up episode simulation by using parallel computing. To do so, use the setup
function and set the UseParallel argument to true.

setup(env,UseParallel=true)

Version History
Introduced in R2022a

See Also
setup | cleanup | reset

Topics
“Custom Training Loop with Simulink Action Noise”

2-186
sample

sample
Package: rl.replay

Sample experiences from replay memory buffer

Syntax
experience = sample(buffer,batchSize)
experience = sample(buffer,batchSize,Name=Value)
[experience,Mask] = sample(buffer,batchSize,Name=Value)

Description
experience = sample(buffer,batchSize) returns a mini-batch of N experiences from the
replay memory buffer, where N is specified using batchSize.

experience = sample(buffer,batchSize,Name=Value) specifies additional sampling options


using one or more name-value pair arguments.

[experience,Mask] = sample(buffer,batchSize,Name=Value) returns a sequence padding


mask indicating which the padded experiences at the end of a sampled sequence.

Examples

Create Experience Buffer

Define observation specifications for the environment. For this example, assume that the environment
has a single observation channel with three continuous signals in specified ranges.

obsInfo = rlNumericSpec([3 1],...


LowerLimit=0,...
UpperLimit=[1;5;10]);

Define action specifications for the environment. For this example, assume that the environment has
a single action channel with two continuous signals in specified ranges.

actInfo = rlNumericSpec([2 1],...


LowerLimit=0,...
UpperLimit=[5;10]);

Create an experience buffer with a maximum length of 20,000.

buffer = rlReplayMemory(obsInfo,actInfo,20000);

Append a single experience to the buffer using a structure. Each experience contains the following
elements: current observation, action, next observation, reward, and is-done.

For this example, create an experience with random observation, action, and reward values. Indicate
that this experience is not a terminal condition by setting the IsDone value to 0.

2-187
2 Functions

exp.Observation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Action = {actInfo.UpperLimit.*rand(2,1)};
exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Reward = 10*rand(1);
exp.IsDone = 0;

Append the experience to the buffer.

append(buffer,exp);

You can also append a batch of experiences to the experience buffer using a structure array. For this
example, append a sequence of 100 random experiences, with the final experience representing a
terminal condition.

for i = 1:100
expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)};
expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Reward = 10*rand(1);
expBatch(i).IsDone = 0;
end
expBatch(100).IsDone = 1;

append(buffer,expBatch);

After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.

miniBatch = sample(buffer,50);

You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive
experiences with a discount factor of 0.95.

horizonSample = sample(buffer,1,...
NStepHorizon=10,...
DiscountFactor=0.95);

The returned sample includes the following information.

• Observation and Action are the observation and action from the first experience in the
horizon.
• NextObservation and IsDone are the next observation and termination signal from the final
experience in the horizon.
• Reward is the cumulative reward across the horizon using the specified discount factor.

You can also sample a sequence of consecutive experiences. In this case, the structure fields contain
arrays with values for all sampled experiences.

sequenceSample = sample(buffer,1,...
SequenceLength=20);

2-188
sample

Create Experience Buffer with Multiple Observation Channels

Define observation specifications for the environment. For this example, assume that the environment
has two observation channels: one channel with two continuous observations and one channel with a
three-valued discrete observation.

obsContinuous = rlNumericSpec([2 1],...


LowerLimit=0,...
UpperLimit=[1;5]);
obsDiscrete = rlFiniteSetSpec([1 2 3]);
obsInfo = [obsContinuous obsDiscrete];

Define action specifications for the environment. For this example, assume that the environment has
a single action channel with one continuous action in a specified range.

actInfo = rlNumericSpec([2 1],...


LowerLimit=0,...
UpperLimit=[5;10]);

Create an experience buffer with a maximum length of 5,000.

buffer = rlReplayMemory(obsInfo,actInfo,5000);

Append a sequence of 50 random experiences to the buffer.

for i = 1:50
exp(i).Observation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
exp(i).Action = {actInfo.UpperLimit.*rand(2,1)};
exp(i).NextObservation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
exp(i).Reward = 10*rand(1);
exp(i).IsDone = 0;
end

append(buffer,exp);

After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 10 experiences from the buffer.

miniBatch = sample(buffer,10);

Input Arguments
buffer — Experience buffer
rlReplayMemory object | rlPrioritizedReplayMemeory

Experience buffer, specified as an rlReplayMemory or rlPrioritizedReplayMemory object.

batchSize — Batch size


positive integer

Batch size of experiences to sample, specified as a positive integer.

If batchSize is greater than the current length of the buffer, then sample returns no experiences.

2-189
2 Functions

Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.
Example: DiscountFactor=0.95

SequenceLength — Sequence length


1 (default) | positive integer

Sequence length, specified as a positive integer. For each batch element, sample up to
SequenceLength consecutive experiences. If a sampled experience has a nonzero IsDone value,
stop the sequence at that experience.

NStepHorizon — N-step horizon length


1 (default) | positive integer

N-step horizon length, specified as a positive integer. For each batch element, sample up to
NStepHorizon consecutive experiences. If a sampled experience has a nonzero IsDone value, stop
the horizon at that experience. Return the following experience information based on the sampled
horizon.

• Observation and Action values from the first experience in the horizon
• NextObservation and IsDone values from the final experience in the horizon.
• Cumulative reward across the horizon using the specified discount factor, DiscountFactor.

Sampling an n-step horizon is not supported when sampling sequences. Therefore, if


SequenceLength > 1, then NStepHorizon must be 1.

DiscountFactor — Discount factor


0.99 (default) | nonnegative scalar less than or equal to one

Discount factor, specified as a nonnegative scalar less than or equal to one. When you sample a
horizon of experiences (NStepHorizon > 1), sample returns the cumulative reward R computed as
follows.

N
R= ∑ γiRi
i=1

Here:

• γ is the discount factor.


• N is the sampled horizon length, which can be less than NStepHorizon.
• Ri is the reward for the ith horizon step.

DiscountFactor applies only when NStepHorizon is greater than one.

DataSourceID — Data source index


-1 (default) | nonnegative integer

Data source index, specified as one of the following:

2-190
sample

• -1 — Sample from the experiences of all data sources.


• Nonnegative integer — Sample from the experiences of only the data source specified by
DataSourceID.

Output Arguments
experience — Experiences sampled from the buffer
structure

Experiences sampled from the buffer, returned as a structure with the following fields.

Observation — Observation
cell array

Observation, returned as a cell array with length equal to the number of observation specifications
specified when creating the buffer. Each element of Observation contains a DO-by-batchSize-by-
SequenceLength array, where DO is the dimension of the corresponding observation specification.

Action — Agent action


cell array

Agent action, returned as a cell array with length equal to the number of action specifications
specified when creating the buffer. Each element of Action contains a DA-by-batchSize-by-
SequenceLength array, where DA is the dimension of the corresponding action specification.

Reward — Reward value


scalar | array

Reward value obtained by taking the specified action from the observation, returned as a 1-by-1-by-
SequenceLength array.

NextObservation — Next observation


cell array

Next observation reached by taking the specified action from the observation, returned as a cell array
with the same format as Observation.

IsDone — Termination signal


integer | array

Termination signal, returned as a 1-by-1-by-SequenceLength array of integers. Each element of


IsDone has one of the following values.

• 0 — This experience is not the end of an episode.


• 1 — The episode terminated because the environment generated a termination signal.
• 2 — The episode terminated by reaching the maximum episode length.

Mask — Sequence padding mask


logical array

Sequence padding mask, returned as a logical array with length equal to SequenceLength. When
the sampled sequence length is less than SequenceLength, the data returned in experience is
padded. Each element of Mask is true for a real experience and false for a padded experience.

2-191
2 Functions

You can ignore Mask when SequenceLength is 1.

Version History
Introduced in R2022a

See Also
rlReplayMemory | append

2-192
setActor

setActor
Package: rl.agent

Set actor of reinforcement learning agent

Syntax
agent = setActor(agent,actor)

Description
agent = setActor(agent,actor) updates the reinforcement learning agent, agent, to use the
specified actor object, actor.

Examples

Modify Actor Parameter Values

Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.

load('DoubleIntegDDPG.mat','agent')

Obtain the actor function approximator from the agent.

actor = getActor(agent);

Obtain the learnable parameters from the actor.

params = getLearnableParameters(actor)

params=2×1 cell array


{[-15.4622 -7.2252]}
{[ 0]}

Modify the parameter values. For this example, simply multiply all of the parameters by 2.

modifiedParams = cellfun(@(x) x*2,params,'UniformOutput',false);

Set the parameter values of the actor to the new modified values.

actor = setLearnableParameters(actor,modifiedParams);

Set the actor in the agent to the new modified actor.

setActor(agent,actor);

Display the new parameter values.

getLearnableParameters(getActor(agent))

2-193
2 Functions

ans=2×1 cell array


{[-30.9244 -14.4504]}
{[ 0]}

Modify Deep Neural Networks in Reinforcement Learning Agent

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.

Load the predefined environment.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");

Obtain observation and action specifications.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.

agent = rlPPOAgent(obsInfo,actInfo);

To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.

actor = getActor(agent);
critic = getCritic(agent);

Extract the deep neural networks from both the actor and critic function approximators.

actorNet = getModel(actor);
criticNet = getModel(critic);

The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.

For example, view the actor network.

plot(layerGraph(actorNet))

2-194
setActor

To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)

You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))

In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.

To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.

For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
createModifiedNetworks

2-195
2 Functions

Each of the modified networks includes an additional fullyConnectedLayer and reluLayer in


their main common path. View the modified actor network.

plot(layerGraph(modifiedActorNet))

After exporting the networks, insert the networks into the actor and critic function approximators.

actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);

Finally, insert the modified actor and critic function approximators into the actor and critic objects.

agent = setActor(agent,actor);
agent = setCritic(agent,critic);

Input Arguments
agent — Reinforcement learning agent
rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent | rlSACAgent | rlPPOAgent |
rlTRPOAgent

Reinforcement learning agent that contains an actor, specified as one of the following:

• rlPGAgent object
• rlDDPGAgent object

2-196
setActor

• rlTD3Agent object
• rlACAgent object
• rlSACAgent object
• rlPPOAgent object
• rlTRPOAgent object

Note agent is an handle object. Therefore is updated by setActor whether agent is returned as
an output argument or not. For more information about handle objects, see “Handle Object
Behavior”.

actor — Actor
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object

Actor object, specified as one of the following:

• rlContinuousDeterministicActor object — Specify when agent is an rlDDPGAgent or


rlTD3Agent object
• rlDiscreteCategoricalActor object — Specify when agent is an rlACAgent, rlPGAgent,
rlPPOAgent, rlTRPOAgent or rlSACAgent object for an environment with a discrete action
space.
• rlContinuousGaussianActor object — Specify when agent is an rlACAgent, rlPGAgent,
rlPPOAgent, rlTRPOAgent or rlSACAgent object for an environment with a continuous action
space.

The input and outputs of the approximation model in the actor (typically, a neural network) must
match the observation and action specifications of the original agent.

To create an actor, use one of the following methods:

• Create the actor using the corresponding function approximator object.


• Obtain the existing actor from an agent using getActor.

Output Arguments
agent — Updated reinforcement learning agent
rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent | rlSACAgent | rlPPOAgent |
rlTRPOAgent

Updated agent, returned as an agent object. Note that agent is an handle object. Therefore its actor
is updated by setActor whether agent is returned as an output argument or not. For more
information about handle objects, see “Handle Object Behavior”.

Version History
Introduced in R2019a

2-197
2 Functions

See Also
getActor | getCritic | setCritic | getModel | setModel | getLearnableParameters |
setLearnableParameters

Topics
“Create Policies and Value Functions”
“Import Neural Network Models”

2-198
setCritic

setCritic
Package: rl.agent

Set critic of reinforcement learning agent

Syntax
agent = setCritic(agent,critic)

Description
agent = setCritic(agent,critic) updates the reinforcement learning agent, agent, to use the
specified critic object, critic.

Examples

Modify Critic Parameter Values

Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.

load('DoubleIntegDDPG.mat','agent')

Obtain the critic function approximator from the agent.

critic = getCritic(agent);

Obtain the learnable parameters from the critic.

params = getLearnableParameters(critic)

params=2×1 cell array


{[-4.9889 -1.5548 -0.3434 -0.1111 -0.0500 -0.0035]}
{[ 0]}

Modify the parameter values. For this example, simply multiply all of the parameters by 2.

modifiedParams = cellfun(@(x) x*2,params,'UniformOutput',false);

Set the parameter values of the critic to the new modified values.

critic = setLearnableParameters(critic,modifiedParams);

Set the critic in the agent to the new modified critic.

setCritic(agent,critic);

Display the new parameter values.

getLearnableParameters(getCritic(agent))

2-199
2 Functions

ans=2×1 cell array


{[-9.9778 -3.1095 -0.6867 -0.2223 -0.1000 -0.0069]}
{[ 0]}

Modify Deep Neural Networks in Reinforcement Learning Agent

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.

Load the predefined environment.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");

Obtain observation and action specifications.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.

agent = rlPPOAgent(obsInfo,actInfo);

To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.

actor = getActor(agent);
critic = getCritic(agent);

Extract the deep neural networks from both the actor and critic function approximators.

actorNet = getModel(actor);
criticNet = getModel(critic);

The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.

For example, view the actor network.

plot(layerGraph(actorNet))

2-200
setCritic

To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)

You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))

In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.

To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.

For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
createModifiedNetworks

2-201
2 Functions

Each of the modified networks includes an additional fullyConnectedLayer and reluLayer in


their main common path. View the modified actor network.

plot(layerGraph(modifiedActorNet))

After exporting the networks, insert the networks into the actor and critic function approximators.

actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);

Finally, insert the modified actor and critic function approximators into the actor and critic objects.

agent = setActor(agent,actor);
agent = setCritic(agent,critic);

Input Arguments
agent — Reinforcement learning agent
rlQAgent | rlSARSAAgent | rlDQNAgent | rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent
| rlSACAgent | rlPPOAgent | rlTRPOAgent

Reinforcement learning agent that contains a critic, specified as one of the following:

• rlQAgent
• rlSARSAAgent

2-202
setCritic

• rlDQNAgent
• rlPGAgent (when using a critic to estimate a baseline value function)
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent

Note agent is an handle object. Therefore is updated by setCritic whether agent is returned as
an output argument or not. For more information about handle objects, see “Handle Object
Behavior”.

critic — Critic
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object | two-
element row vector of rlQValueFunction objects

Critic object, specified as one of the following:

• rlValueFunction object — Returned when agent is an rlACAgent, rlPGAgent, or


rlPPOAgent object.
• rlQValueFunction object — Returned when agent is an rlQAgent, rlSARSAAgent,
rlDQNAgent, rlDDPGAgent, or rlTD3Agent object with a single critic.
• rlVectorQValueFunction object — Returned when agent is an rlQAgent, rlSARSAAgent,
rlDQNAgent, object with a discrete action space, vector Q-value function critic.
• Two-element row vector of rlQValueFunction objects — Returned when agent is an
rlTD3Agent or rlSACAgent object with two critics.

Output Arguments
agent — Updated reinforcement learning agent
rlQAgent | rlSARSAAgent | rlDQNAgent | rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent
| rlSACAgent | rlPPOAgent | rlTRPOAgent

Updated agent, returned as an agent object. Note that agent is an handle object. Therefore its actor
is updated by setCritic whether agent is returned as an output argument or not. For more
information about handle objects, see “Handle Object Behavior”.

Version History
Introduced in R2019a

See Also
getActor | getCritic | setActor | getModel | setModel | getLearnableParameters |
setLearnableParameters

2-203
2 Functions

Topics
“Create Policies and Value Functions”
“Import Neural Network Models”

2-204
setLearnableParameters

setLearnableParameters
Package: rl.policy

Set learnable parameter values of agent, function approximator, or policy object

Syntax
setLearnableParameters(agent)
agent = setLearnableParameters(agent)

newFcn = setLearnableParameters(oldFcn,pars)

newPol = setLearnableParameters(oldPol,pars)

Description
Agent

setLearnableParameters(agent) sets the learnable parameter values specified in pars in the


specified agent.

agent = setLearnableParameters(agent) also returns the new agent as an output argument.


Actor or Critic

newFcn = setLearnableParameters(oldFcn,pars) returns a new actor or critic function


approximator object, newFcn, with the same structure as the original function object, oldFcn, and
the learnable parameter values specified in pars.
Policy

newPol = setLearnableParameters(oldPol,pars) returns a new policy object, newPol, with


the same structure as the original function object, oldFcn, and the learnable parameter values
specified in pars.

Examples

Modify Critic Parameter Values

Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.

load('DoubleIntegDDPG.mat','agent')

Obtain the critic function approximator from the agent.

critic = getCritic(agent);

Obtain the learnable parameters from the critic.

params = getLearnableParameters(critic)

2-205
2 Functions

params=2×1 cell array


{[-4.9889 -1.5548 -0.3434 -0.1111 -0.0500 -0.0035]}
{[ 0]}

Modify the parameter values. For this example, simply multiply all of the parameters by 2.

modifiedParams = cellfun(@(x) x*2,params,'UniformOutput',false);

Set the parameter values of the critic to the new modified values.

critic = setLearnableParameters(critic,modifiedParams);

Set the critic in the agent to the new modified critic.

setCritic(agent,critic);

Display the new parameter values.

getLearnableParameters(getCritic(agent))

ans=2×1 cell array


{[-9.9778 -3.1095 -0.6867 -0.2223 -0.1000 -0.0069]}
{[ 0]}

Modify Actor Parameter Values

Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.

load('DoubleIntegDDPG.mat','agent')

Obtain the actor function approximator from the agent.

actor = getActor(agent);

Obtain the learnable parameters from the actor.

params = getLearnableParameters(actor)

params=2×1 cell array


{[-15.4622 -7.2252]}
{[ 0]}

Modify the parameter values. For this example, simply multiply all of the parameters by 2.

modifiedParams = cellfun(@(x) x*2,params,'UniformOutput',false);

Set the parameter values of the actor to the new modified values.

actor = setLearnableParameters(actor,modifiedParams);

Set the actor in the agent to the new modified actor.

setActor(agent,actor);

2-206
setLearnableParameters

Display the new parameter values.

getLearnableParameters(getActor(agent))

ans=2×1 cell array


{[-30.9244 -14.4504]}
{[ 0]}

Input Arguments
agent — Reinforcement learning agent
reinforcement learning agent object

Reinforcement learning agent, specified as one of the following objects:

• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• Custom agent — For more information, see “Create Custom Reinforcement Learning Agents”.

Note agent is an handle object. Therefore its parameters are updated by


setLearnableParameters whether agent is returned as an output argument or not. For more
information about handle objects, see “Handle Object Behavior”.

oldFcn — Original actor or critic function object


rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object

Original actor or critic function object, specified as one of the following:

• rlValueFunction object — Value function critic


• rlQValueFunction object — Q-value function critic
• rlVectorQValueFunction object — Multi-output Q-value function critic with a discrete action
space
• rlContinuousDeterministicActor object — Deterministic policy actor with a continuous
action space
• rlDiscreteCategoricalActor — Stochastic policy actor with a discrete action space

2-207
2 Functions

• rlContinuousGaussianActor object — Stochastic policy actor with a continuous action space

To create an actor or critic function object, use one of the following methods.

• Create a function object directly.


• Obtain the existing critic from an agent using getCritic.
• Obtain the existing actor from an agent using getActor.

oldPol — Reinforcement learning policy


rlMaxQPolicy | rlEpsilonGreedyPolicy | rlDeterministicActorPolicy |
rlAdditiveNoisePolicy | rlStochasticActorPolicy

Reinforcement learning policy, specified as one of the following objects:

• rlMaxQPolicy
• rlEpsilonGreedyPolicy
• rlDeterministicActorPolicy
• rlAdditiveNoisePolicy
• rlStochasticActorPolicy

pars — Learnable parameter values


cell array

Learnable parameter values for the representation object, specified as a cell array. The parameters in
pars must be compatible with the structure and parameterization of the agent, function
approximator, or policy object passed as a first argument.

To obtain a cell array of learnable parameter values from an existing agent, function approximator, or
policy object , which you can then modify, use the getLearnableParameters function.

Output Arguments
newFcn — New actor or critic object
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object

New actor or critic object, returned as a function object of the same type as oldFcn. Apart from the
learnable parameter values, newFcn is the same as oldFcn.

newPol — New reinforcement learning policy


rlMaxQPolicy | rlEpsilonGreedyPolicy | rlDeterministicActorPolicy |
rlAdditiveNoisePolicy | rlStochasticActorPolicy

New reinforcement learning policy, returned as a policy object of the same type as oldPol. Apart
from the learnable parameter values, newPol is the same as oldPol.

agent — Updated agent


reinforcement learning agent object

Updated agent, returned as an agent object. Note that agent is an handle object. Therefore its
parameters are updated by setLearnableParameters whether agent is returned as an output
argument or not. For more information about handle objects, see “Handle Object Behavior”.

2-208
setLearnableParameters

Version History
Introduced in R2019a

setLearnableParameters now uses approximator objects instead of representation objects


Behavior changed in R2022a

Using representation objects to create actors and critics for reinforcement learning agents is no
longer recommended. Therefore, setLearnableParameters now uses function approximator
objects instead.

setLearnableParameterValues is now setLearnableParameters


Behavior changed in R2020a

setLearnableParameterValues is now setLearnableParameters. To update your code, change


the function name from setLearnableParameterValues to setLearnableParameters. The
syntaxes are equivalent.

See Also
getLearnableParameters | getActor | getCritic | setActor | setCritic

Topics
“Create Policies and Value Functions”
“Import Neural Network Models”

2-209
2 Functions

setModel
Package: rl.function

Set function approximation model for actor or critic

Syntax
newFcnAppx = setModel(oldFcnAppx,model)

Description
newFcnAppx = setModel(oldFcnAppx,model) returns a new actor or critic function object,
newFcnAppx, with the same configuration as the original function object, oldFcnAppx, and the
computational model specified in model.

Examples

Modify Deep Neural Networks in Reinforcement Learning Agent

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.

Load the predefined environment.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");

Obtain observation and action specifications.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.

agent = rlPPOAgent(obsInfo,actInfo);

To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.

actor = getActor(agent);
critic = getCritic(agent);

Extract the deep neural networks from both the actor and critic function approximators.

actorNet = getModel(actor);
criticNet = getModel(critic);

The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.

2-210
setModel

For example, view the actor network.

plot(layerGraph(actorNet))

To validate a network, use analyzeNetwork. For example, validate the critic network.

analyzeNetwork(criticNet)

You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.

deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))

In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.

To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.

2-211
2 Functions

For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
createModifiedNetworks

Each of the modified networks includes an additional fullyConnectedLayer and reluLayer in


their main common path. View the modified actor network.
plot(layerGraph(modifiedActorNet))

After exporting the networks, insert the networks into the actor and critic function approximators.
actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);

Finally, insert the modified actor and critic function approximators into the actor and critic objects.
agent = setActor(agent,actor);
agent = setCritic(agent,critic);

Input Arguments
oldFcnAppx — Original actor or critic function object
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object

2-212
setModel

Original actor or critic function object, specified as one of the following:

• rlValueFunction object — Value function critic


• rlQValueFunction object — Q-value function critic
• rlVectorQValueFunction object — Multi-output Q-value function critic with a discrete action
space
• rlContinuousDeterministicActor object — Deterministic policy actor with a continuous
action space
• rlDiscreteCategoricalActor — Stochastic policy actor with a discrete action space
• rlContinuousGaussianActor object — Stochastic policy actor with a continuous action space

To create an actor or critic function object, use one of the following methods.

• Create a function object directly.


• Obtain the existing critic from an agent using getCritic.
• Obtain the existing actor from an agent using getActor.

model — Function approximation model


array of Layer objects | layerGraph object | DAGNetwork object | dlnetwork object | rlTable
object | 1-by-2 cell array

Function approximation model, specified as one of the following:

• Deep neural network defined as an array of Layer objects, a layerGraph object, a DAGNetwork
object, or a dlnetwork object. The input and output layers of model must have the same names
and dimensions as the network returned by getModel for the same function object. Here, the
output layer is the layer immediately before the output loss layer.
• rlTable object with the same dimensions as the table model defined in newRep.
• 1-by-2 cell array that contains the function handle for a custom basis function and the basis
function parameters.

When specifying a new model, you must use the same type of model as the one already defined in
newRep.

Note For agents with more than one critic, such as TD3 and SAC agents, you must call setModel for
each critic representation individually, rather than calling setModel for the array of returned by
getCritic.
critics = getCritic(myTD3Agent);
% Modify critic networks.
critics(1) = setModel(critics(1),criticNet1);
critics(2) = setModel(critics(2),criticNet2);
myTD3Agent = setCritic(myTD3Agent,critics);

Output Arguments
newFcnAppx — New actor or critic function object
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object

2-213
2 Functions

New actor or critic function object, returned as a function object of the same type as oldFcnAppx.
Apart from the new computational model, newFcnAppx is the same as oldFcnAppx.

Version History
Introduced in R2020b

setModel now uses approximator objects instead of representation objects


Behavior changed in R2022a

Using representation objects to create actors and critics for reinforcement learning agents is no
longer recommended. Therefore, setModel now uses function approximator objects instead.

See Also
getActor | setActor | getCritic | setCritic | getModel

Topics
“Create Policies and Value Functions”

2-214
setup

setup
Package: rl.env

Set up reinforcement learning environment or initialize data logger object

Syntax
setup(env)
setup(env,Name=Value)

setup(lgr)

Description
When you define a custom training loop for reinforcement learning, you can simulate an agent or
policy against an environment using the runEpisode function. Use the setup function to configure
the environment for running simulations using multiple calls to runEpisode.

Also use setup to initialize a FileLogger or MonitorLogger object before logging data within a
custom training loop.
Environment Objects

setup(env) sets up the specified reinforcement learning environment for running multiple
simulations using runEpisode.

setup(env,Name=Value) specifies nondefault configuration options using one or more name-value


pair arguments.
Data Logger Objects

setup(lgr) sets up the specified data logger object. Setup tasks may include setting up a
visualization, or creating directories for logging to file.

Examples

Simulate Environment and Agent

Create a reinforcement learning environment and extract its observation and action specifications.

env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

To approximate the Q-value function withing the critic, use a neural network. Create a network as an
array of layer objects.

net = [...
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(24)
reluLayer

2-215
2 Functions

fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(2)
softmaxLayer];

Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 770

Inputs:
1 'input' 4 features

Create a discrete categorical actor using the network.


actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);

Check your actor with a random observation.


act = getAction(actor,{rand(obsInfo.Dimension)})

act = 1x1 cell array


{[-10]}

Create a policy object from the actor.


policy = rlStochasticActorPolicy(actor);

Create an experience buffer.


buffer = rlReplayMemory(obsInfo,actInfo);

Set up the environment for running multiple simulations. For this example, configure the training to
log any errors rather than send them to the command window.
setup(env,StopOnError="off")

Simulate multiple episodes using the environment and policy. After each episode, append the
experiences to the buffer. For this example, run 100 episodes.
for i = 1:100
output = runEpisode(env,policy,MaxSteps=300);
append(buffer,output.AgentData.Experiences)
end

Clean up the environment.


cleanup(env)

Sample a mini-batch of experiences from the buffer. For this example, sample 10 experiences.
batch = sample(buffer,10);

You can then learn from the sampled experiences and update the policy and actor.

2-216
setup

Log Data to Disk in a Custom Training Loop

This example shows how to log data to disk when training an agent using a custom training loop.

Create a FileLogger object using rlDataLogger.


flgr = rlDataLogger();

Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.
setup(flgr);

Within a custom training loop, you can now store data to the logger object memory and write data to
file.

For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.
for iter = 1:10

% Store three random numbers in memory


% as elements of the variable "Context1"
for ct = 1:3
store(flgr, "Context1", rand, iter);
end

% Store a random number in memory


% as the variable "Context2"
store(flgr, "Context2", rand, iter);

% Write data to file every 4 iterations


if mod(iter,4)==0
write(flgr);
end

end

Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.
cleanup(flgr);

Input Arguments
env — Reinforcement learning environment
rlFunctionEnv object | SimulinkEnvWithAgent object | rlNeuralNetworkEnvironment object
| rlMDPEnv object | ...

Reinforcement learning environment, specified as one of the following objects.

2-217
2 Functions

• rlFunctionEnv — Environment defined using custom functions.


• SimulinkEnvWithAgent — Simulink environment created using rlSimulinkEnv or
createIntegratedEnv
• rlMDPEnv — Markov decision process environment
• rlNeuralNetworkEnvironment — Environment with deep neural network transition models
• Predefined environment created using rlPredefinedEnv
• Custom environment created from a template (rlCreateEnvTemplate)

lgr — Date logger object


FileLogger object | MonitorLogger object | ...

Data logger object, specified as either a FileLogger or a MonitorLogger object.

Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.
Example: StopOnError="on"

StopOnError — Option to stop episode when error occurs


"on" (default) | "off"

Option to stop an episode when an error occurs, specified as one of the following:

• "on" — Stop the episode when an error occurs and generate an error message in the MATLAB
command window.
• "off" — Log errors in the SimulationInfo output of runEpisode.

UseParallel — Option for using parallel simulations


false (default) | true

Option for using parallel simulations, specified as a logical value. Using parallel computing allows
the usage of multiple cores, processors, computer clusters, or cloud resources to speed up simulation.

When you set UseParallel to true, the output of a subsequent call to runEpisode is an
rl.env.Future object, which supports deferred evaluation of the simulation.

SetupFcn — Function to run on each worker before running an episode


[] (default) | function handle

Function to run on the each worker before running an episode, specified as a handle to a function
with no input arguments. Use this function to perform any preprocessing required before running an
episode.

CleanupFcn — Function to run on each worker when cleaning up environment


[] (default) | function handle

Function to run on each worker when cleaning up the environment, specified as a handle to a
function with no input arguments. Use this function to clean up the workspace or perform other
processing after calling runEpisode.

2-218
setup

TransferBaseWorkspaceVariables — Option to send model and workspace variables to


parallel workers
"on" (default) | "off"

Option to send model and workspace variables to parallel workers, specified as "on" or "off". When
the option is "on", the client sends variables used in models and defined in the base MATLAB
workspace to the workers.

AttachedFiles — Additional files to attach to parallel pool


string | string array

Additional files to attach to the parallel pool before running an episode, specified as a string or string
array.

WorkerRandomSeeds — Work random seeds


-1 (default) | vector

Worker random seeds, specified as one of the following:

• -1 — Set the random seed of each worker to the worker ID.


• Vector with length equal to the number of workers — Specify the random seed for each worker.

Version History
Introduced in R2022a

See Also
Functions
runEpisode | cleanup | reset | store | write

Objects
rlFunctionEnv | rlMDPEnv | SimulinkEnvWithAgent | rlNeuralNetworkEnvironment |
FileLogger | MonitorLogger

Topics
“Custom Training Loop with Simulink Action Noise”

2-219
2 Functions

sim
Package: rl.env

Simulate trained reinforcement learning agents within specified environment

Syntax
experience = sim(env,agents)
experience = sim(agents,env)

env = sim( ___ ,simOpts)

Description
experience = sim(env,agents) simulates one or more reinforcement learning agents within an
environment, using default simulation options.

experience = sim(agents,env) performs the same simulation as the previous syntax.

env = sim( ___ ,simOpts) uses the simulation options object simOpts. Use simulation options to
specify parameters such as the number of steps per simulation or the number of simulations to run.
Use this syntax after any of the input arguments in the previous syntaxes.

Examples

Simulate Reinforcement Learning Environment

Simulate a reinforcement learning environment with an agent configured for that environment. For
this example, load an environment and agent that are already configured. The environment is a
discrete cart-pole environment created with rlPredefinedEnv. The agent is a policy gradient
(rlPGAgent) agent. For more information about the environment and agent used in this example, see
“Train PG Agent to Balance Cart-Pole System”.

rng(0) % for reproducibility


load RLSimExample.mat
env

env =
CartPoleDiscreteAction with properties:

Gravity: 9.8000
MassCart: 1
MassPole: 0.1000
Length: 0.5000
MaxForce: 10
Ts: 0.0200
ThetaThresholdRadians: 0.2094
XThreshold: 2.4000
RewardForNotFalling: 1
PenaltyForFalling: -5

2-220
sim

State: [4x1 double]

agent

agent =
rlPGAgent with properties:

AgentOptions: [1x1 rl.option.rlPGAgentOptions]


UseExplorationPolicy: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 0.1000

Typically, you train the agent using train and simulate the environment to test the performance of
the trained agent. For this example, simulate the environment using the agent you loaded. Configure
simulation options, specifying that the simulation run for 100 steps.

simOpts = rlSimulationOptions('MaxSteps',100);

For the predefined cart-pole environment used in this example. you can use plot to generate a
visualization of the cart-pole system. When you simulate the environment, this plot updates
automatically so that you can watch the system evolve during the simulation.

plot(env)

Simulate the environment.

experience = sim(env,agent,simOpts)

2-221
2 Functions

experience = struct with fields:


Observation: [1x1 struct]
Action: [1x1 struct]
Reward: [1x1 timeseries]
IsDone: [1x1 timeseries]
SimulationInfo: [1x1 struct]

The output structure experience records the observations collected from the environment, the
action and reward, and other data collected during the simulation. Each field contains a timeseries
object or a structure of timeseries data objects. For instance, experience.Action is a
timeseries containing the action imposed on the cart-pole system by the agent at each step of the
simulation.
experience.Action

ans = struct with fields:


CartPoleAction: [1x1 timeseries]

Simulate Simulink Environment with Multiple Agents

Simulate an environment created for the Simulink® model used in the example “Train Multiple
Agents to Perform Collaborative Task”, using the agents trained in that example.

Load the agents in the MATLAB® workspace.


load rlCollaborativeTaskAgents

Create an environment for the rlCollaborativeTask Simulink® model, which has two agent
blocks. Since the agents used by the two blocks (agentA and agentB) are already in the workspace,
you do not need to pass their observation and action specifications to create the environment.
env = rlSimulinkEnv( ...
'rlCollaborativeTask', ...
["rlCollaborativeTask/Agent A","rlCollaborativeTask/Agent B"]);

Load the parameters that are needed by the rlCollaborativeTask Simulink® model to run.

2-222
sim

rlCollaborativeTaskParams

Simulate the agents against the environment, saving the experiences in xpr.

xpr = sim(env,[agentA agentB]);

Plot actions of both agents.

subplot(2,1,1); plot(xpr(1).Action.forces)
subplot(2,1,2); plot(xpr(2).Action.forces)

Input Arguments
env — Environment
reinforcement learning environment object

Environment in which the agents act, specified as one of the following kinds of reinforcement
learning environment object:

• A predefined MATLAB or Simulink environment created using rlPredefinedEnv. This kind of


environment does not support training multiple agents at the same time.
• A custom MATLAB environment you create with functions such as rlFunctionEnv or
rlCreateEnvTemplate. This kind of environment does not support training multiple agents at
the same time.

2-223
2 Functions

• A custom Simulink environment you create using rlSimulinkEnv. This kind of environment
supports training multiple agents at the same time.

For more information about creating and configuring environments, see:

• “Create MATLAB Reinforcement Learning Environments”


• “Create Simulink Reinforcement Learning Environments”

When env is a Simulink environment, calling sim compiles and simulates the model associated with
the environment.

agents — Agents
reinforcement learning agent object | array of agent objects

Agents to simulate, specified as a reinforcement learning agent object, such as rlACAgent or


rlDDPGAgent, or as an array of such objects.

If env is a multi-agent environment created with rlSimulinkEnv, specify agents as an array. The
order of the agents in the array must match the agent order used to create env. Multi-agent
simulation is not supported for MATLAB environments.

For more information about how to create and configure agents for reinforcement learning, see
“Reinforcement Learning Agents”.

simOpts — Simulation options


rlSimulationOptions object

Simulation options, specified as an rlSimulationOptions object. Use this argument to specify


options such as:

• Number of steps per simulation


• Number of simulations to run

For details, see rlSimulationOptions.

Output Arguments
experience — Simulation results
structure | structure array

Simulation results, returned as a structure or structure array. The number of rows in the array is
equal to the number of simulations specified by the NumSimulations option of
rlSimulationOptions. The number of columns in the array is the number of agents. The fields of
each experience structure are as follows.

Observation — Observations
structure

Observations collected from the environment, returned as a structure with fields corresponding to the
observations specified in the environment. Each field contains a timeseries of length N + 1, where
N is the number of simulation steps.

To obtain the current observation and the next observation for a given simulation step, use code such
as the following, assuming one of the fields of Observation is obs1.

2-224
sim

Obs = getSamples(experience.Observation.obs1,1:N);
NextObs = getSamples(experience.Observation.obs1,2:N+1);

These values can be useful if you are writing your own training algorithm using sim to generate
experiences for training.

Action — Actions
structure

Actions computed by the agent, returned as a structure with fields corresponding to the action
signals specified in the environment. Each field contains a timeseries of length N, where N is the
number of simulation steps.

Reward — Rewards
timeseries

Reward at each step in the simulation, returned as a timeseries of length N, where N is the
number of simulation steps.

IsDone — Flag indicating termination of episode


timeseries

Flag indicating termination of the episode, returned as a timeseries of a scalar logical signal. This
flag is set at each step by the environment, according to conditions you specify for episode
termination when you configure the environment. When the environment sets this flag to 1,
simulation terminates.

SimulationInfo — Information collected during simulation


structure | vector of Simulink.SimulationOutput objects

Information collected during simulation, returned as one of the following:

• For MATLAB environments, a structure containing the field SimulationError. This structure
contains any errors that occurred during simulation.
• For Simulink environments, a Simulink.SimulationOutput object containing simulation data.
Recorded data includes any signals and states that the model is configured to log, simulation
metadata, and any errors that occurred.

Version History
Introduced in R2019a

See Also
train | rlSimulationOptions

Topics
“Train Reinforcement Learning Agents”

2-225
2 Functions

store
Package: rl.logging

Store data in the internal memory of a (file or monitor) logger object

Syntax
store(lgr,context,data,iter)

Description
store(lgr,context,data,iter) stores data into lgr internal memory, grouped in a variable
named context and associated with iteration iter.

Examples

Log Data to Disk in a Custom Training Loop

This example shows how to log data to disk when training an agent using a custom training loop.

Create a FileLogger object using rlDataLogger.

flgr = rlDataLogger();

Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.

setup(flgr);

Within a custom training loop, you can now store data to the logger object memory and write data to
file.

For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.

for iter = 1:10

% Store three random numbers in memory


% as elements of the variable "Context1"
for ct = 1:3
store(flgr, "Context1", rand, iter);
end

% Store a random number in memory


% as the variable "Context2"
store(flgr, "Context2", rand, iter);

2-226
store

% Write data to file every 4 iterations


if mod(iter,4)==0
write(flgr);
end

end

Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.

cleanup(flgr);

Input Arguments
lgr — Date logger object
FileLogger object | MonitorLogger object | ...

Data logger object, specified as either a FileLogger or a MonitorLogger object.

context — Name of the saved variable


string | char array | ...

Name of the saved variable, specified as either a string or character array. All data associated with
iteration iter and the name context is vertically concatenated in a single MATLAB variable named
context. This variable is then written to the logger target (either a MAT file or a
trainingProgressMonitor object) when write is invoked for lgr.

data — Data to be saved


any fundamental MATLAB datatype | ...

Data to be saved, specified as any fundamental MATLAB datatype. Data associated with the same
iteration and the same context name is grouped in a single variable.

iter — Iteration number


positive integer | ...

Iteration number, specified as a positive integer. When store is executed multiple times with the
same iteration number, data is appended to the memory entry for that iteration. This memory entry,
which can contain several context variables, is then written to the logger target as a single unit (for
example as a single MAT file) when write is invoked for lgr.

Version History
Introduced in R2022b

See Also
Functions
rlDataLogger | train | write | setup | cleanup

Objects
FileLogger | MonitorLogger

2-227
2 Functions

Topics
“Log Training Data To Disk”
“Function Handles”
“Handle Object Behavior”

2-228
train

train
Package: rl.agent

Train reinforcement learning agents within a specified environment

Syntax
trainStats = train(env,agents)
trainStats = train(agents,env)

trainStats = train( ___ ,trainOpts)

trainStats = train(agents,env,prevTrainStats)

Description
trainStats = train(env,agents) trains one or more reinforcement learning agents within a
specified environment, using default training options. Although agents is an input argument, after
each training episode, train updates the parameters of each agent specified in agents to maximize
their expected long-term reward from the environment. This is possible because each agent is an
handle object. When training terminates, agents reflects the state of each agent at the end of the
final training episode.

trainStats = train(agents,env) performs the same training as the previous syntax.

trainStats = train( ___ ,trainOpts) trains agents within env, using the training options
object trainOpts. Use training options to specify training parameters such as the criteria for
terminating training, when to save agents, the maximum number of episodes to train, and the
maximum number of steps per episode. Use this syntax after any of the input arguments in the
previous syntaxes.

trainStats = train(agents,env,prevTrainStats) resumes training from the last values of


the agent parameters and training results contained in prevTrainStats obtained after the previous
function call to train.

Examples

Train a Reinforcement Learning Agent

Train the agent configured in the “Train PG Agent to Balance Cart-Pole System” example, within the
corresponding environment. The observation from the environment is a vector containing the position
and velocity of a cart, as well as the angular position and velocity of the pole. The action is a scalar
with two possible elements (a force of either -10 or 10 Newtons applied to a cart).

Load the file containing the environment and a PG agent already configured for it.
load RLTrainExample.mat

Specify some training parameters using rlTrainingOptions. These parameters include the
maximum number of episodes to train, the maximum steps per episode, and the conditions for

2-229
2 Functions

terminating training. For this example, use a maximum of 1000 episodes and 500 steps per episode.
Instruct the training to stop when the average reward over the previous five episodes reaches 500.
Create a default options set and use dot notation to change some of the parameter values.

trainOpts = rlTrainingOptions;

trainOpts.MaxEpisodes = 1000;
trainOpts.MaxStepsPerEpisode = 500;
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 500;
trainOpts.ScoreAveragingWindowLength = 5;

During training, the train command can save candidate agents that give good results. Further
configure the training options to save an agent when the episode reward exceeds 500. Save the agent
to a folder called savedAgents.

trainOpts.SaveAgentCriteria = "EpisodeReward";
trainOpts.SaveAgentValue = 500;
trainOpts.SaveAgentDirectory = "savedAgents";

Finally, turn off the command-line display. Turn on the Reinforcement Learning Episode Manager so
you can observe the training progress visually.

trainOpts.Verbose = false;
trainOpts.Plots = "training-progress";

You are now ready to train the PG agent. For the predefined cart-pole environment used in this
example, you can use plot to generate a visualization of the cart-pole system.

plot(env)

When you run this example, both this visualization and the Reinforcement Learning Episode Manager
update with each training episode. Place them side by side on your screen to observe the progress,
and train the agent. (This computation can take 20 minutes or more.)

trainingInfo = train(agent,env,trainOpts);

2-230
train

2-231
2 Functions

Episode Manager shows that the training successfully reaches the termination condition of a reward
of 500 averaged over the previous five episodes. At each training episode, train updates agent with
the parameters learned in the previous episode. When training terminates, you can simulate the
environment with the trained agent to evaluate its performance. The environment plot updates during
simulation as it did during training.

simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);

During training, train saves to disk any agents that meet the condition specified with
trainOps.SaveAgentCritera and trainOpts.SaveAgentValue. To test the performance of any

2-232
train

of those agents, you can load the data from the data files in the folder you specified using
trainOpts.SaveAgentDirectory, and simulate the environment with that agent.

Train Multiple Agents to Perform Collaborative Task

This example shows how to set up a multi-agent training session on a Simulink® environment. In the
example, you train two agents to collaboratively perform the task of moving an object.

The environment in this example is a frictionless two dimensional surface containing elements
represented by circles. A target object C is represented by the blue circle with a radius of 2 m, and
robots A (red) and B (green) are represented by smaller circles with radii of 1 m each. The robots
attempt to move object C outside a circular ring of a radius 8 m by applying forces through collision.
All elements within the environment have mass and obey Newton's laws of motion. In addition,
contact forces between the elements and the environment boundaries are modeled as spring and
mass damper systems. The elements can move on the surface through the application of externally
applied forces in the X and Y directions. There is no motion in the third dimension and the total
energy of the system is conserved.

Set the random seed and create the set of parameters required for this example.
rng(10)
rlCollaborativeTaskParams

Open the Simulink model.

2-233
2 Functions

mdl = "rlCollaborativeTask";
open_system(mdl)

For this environment:

• The 2-dimensional space is bounded from –12 m to 12 m in both the X and Y directions.
• The contact spring stiffness and damping values are 100 N/m and 0.1 N/m/s, respectively.
• The agents share the same observations for positions, velocities of A, B, and C and the action
values from the last time step.
• The simulation terminates when object C moves outside the circular ring.
• At each time step, the agents receive the following reward:

2-234
train

r A = rglobal + rlocal, A
rB = rglobal + rlocal, B
rglobal = 0 . 001dc
rlocal, A = − 0 . 005dAC − 0 . 008u2A
2
rlocal, B = − 0 . 005dBC − 0 . 008uB

Here:

• r Aand rB are the rewards received by agents A and B, respectively.


• rglobal is a team reward that is received by both agents as object C moves closer towards the
boundary of the ring.
• rlocal, A and rlocal, B are local penalties received by agents A and B based on their distances from
object C and the magnitude of the action from the last time step.
• dC is the distance of object C from the center of the ring.
• dAC and dBC are the distances between agent A and object C and agent B and object C,
respectively.
• The agents apply external forces on the robots that result in motion. uA and uB are the action
values of the two agents A and B from the last time step. The range of action values is between -1
and 1.

Environment

To create a multi-agent environment, specify the block paths of the agents using a string array. Also,
specify the observation and action specification objects using cell arrays. The order of the
specification objects in the cell array must match the order specified in the block path array. When
agents are available in the MATLAB workspace at the time of environment creation, the observation
and action specification arrays are optional. For more information on creating multi-agent
environments, see rlSimulinkEnv.

Create the I/O specifications for the environment. In this example, the agents are homogeneous and
have the same I/O specifications.

% Number of observations
numObs = 16;

% Number of actions
numAct = 2;

% Maximum value of externally applied force (N)


maxF = 1.0;

% I/O specifications for each agent


oinfo = rlNumericSpec([numObs,1]);
ainfo = rlNumericSpec([numAct,1], ...
"UpperLimit", maxF, ...
"LowerLimit", -maxF);
oinfo.Name = "observations";
ainfo.Name = "forces";

Create the Simulink environment interface.

2-235
2 Functions

blks = ["rlCollaborativeTask/Agent A", "rlCollaborativeTask/Agent B"];


obsInfos = {oinfo,oinfo};
actInfos = {ainfo,ainfo};
env = rlSimulinkEnv(mdl,blks,obsInfos,actInfos);

Specify a reset function for the environment. The reset function resetRobots ensures that the
robots start from random initial positions at the beginning of each episode.

env.ResetFcn = @(in) resetRobots(in,RA,RB,RC,boundaryR);

Agents

This example uses two Proximal Policy Optimization (PPO) agents with continuous action spaces. The
agents apply external forces on the robots that result in motion. To learn more about PPO agents, see
“Proximal Policy Optimization Agents”.

The agents collect experiences until the experience horizon (600 steps) is reached. After trajectory
completion, the agents learn from mini-batches of 300 experiences. An objective function clip factor
of 0.2 is used to improve training stability and a discount factor of 0.99 is used to encourage long-
term rewards.

Specify the agent options for this example.

agentOptions = rlPPOAgentOptions(...
"ExperienceHorizon",600,...
"ClipFactor",0.2,...
"EntropyLossWeight",0.01,...
"MiniBatchSize",300,...
"NumEpoch",4,...
"AdvantageEstimateMethod","gae",...
"GAEFactor",0.95,...
"SampleTime",Ts,...
"DiscountFactor",0.99);
agentOptions.ActorOptimizerOptions.LearnRate = 1e-4;
agentOptions.CriticOptimizerOptions.LearnRate = 1e-4;

Create the agents using the default agent creation syntax. For more information see rlPPOAgent.

agentA = rlPPOAgent(oinfo, ainfo, ...


rlAgentInitializationOptions("NumHiddenUnit", 200), agentOptions);
agentB = rlPPOAgent(oinfo, ainfo, ...
rlAgentInitializationOptions("NumHiddenUnit", 200), agentOptions);

Training

To train multiple agents, you can pass an array of agents to the train function. The order of agents
in the array must match the order of agent block paths specified during environment creation. Doing
so ensures that the agent objects are linked to their appropriate I/O interfaces in the environment.

You can train multiple agents in a decentralized or centralized manner. In decentralized training,
agents collect their own set of experiences during the episodes and learn independently from those
experiences. In centralized training, the agents share the collected experiences and learn from them
together. The actor and critic functions are synchronized between the agents after trajectory
completion.

To configure a multi-agent training, you can create agent groups and specify a learning strategy for
each group through the rlMultiAgentTrainingOptions object. Each agent group may contain

2-236
train

unique agent indices, and the learning strategy can be "centralized" or "decentralized". For
example, you can use the following command to configure training for three agent groups with
different learning strategies. The agents with indices [1,2] and [3,4] learn in a centralized manner
while agent 4 learns in a decentralized manner.

opts = rlMultiAgentTrainingOptions("AgentGroups", {[1,2], 4, [3,5]},


"LearningStrategy", ["centralized","decentralized","centralized"])

For more information on multi-agent training, type help rlMultiAgentTrainingOptions in


MATLAB.

You can perform decentralized or centralized training by running one of the following sections using
the Run Section button.

1. Decentralized Training

To configure decentralized multi-agent training for this example:

• Automatically assign agent groups using the AgentGroups=auto option. This allocates each
agent in a separate group.
• Specify the "decentralized" learning strategy.
• Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.
• Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or
more.

trainOpts = rlMultiAgentTrainingOptions(...
"AgentGroups","auto",...
"LearningStrategy","decentralized",...
"MaxEpisodes",1000,...
"MaxStepsPerEpisode",600,...
"ScoreAveragingWindowLength",30,...
"StopTrainingCriteria","AverageReward",...
"StopTrainingValue",-10);

Train the agents using the train function. Training can take several hours to complete depending on
the available computational power. To save time, load the MAT file decentralizedAgents.mat
which contains a set of pretrained agents. To train the agents yourself, set doTraining to true.

doTraining = false;
if doTraining
decentralizedTrainResults = train([agentA,agentB],env,trainOpts);
else
load("decentralizedAgents.mat");
end

The following figure shows a snapshot of decentralized training progress. You can expect different
results due to randomness in the training process.

2-237
2 Functions

2. Centralized Training

To configure centralized multi-agent training for this example:

• Allocate both agents (with indices 1 and 2) in a single group. You can do this by specifying the
agent indices in the "AgentGroups" option.
• Specify the "centralized" learning strategy.
• Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.
• Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or
more.

trainOpts = rlMultiAgentTrainingOptions(...
"AgentGroups",{[1,2]},...
"LearningStrategy","centralized",...
"MaxEpisodes",1000,...
"MaxStepsPerEpisode",600,...
"ScoreAveragingWindowLength",30,...
"StopTrainingCriteria","AverageReward",...
"StopTrainingValue",-10);

Train the agents using the train function. Training can take several hours to complete depending on
the available computational power. To save time, load the MAT file centralizedAgents.mat which
contains a set of pretrained agents. To train the agents yourself, set doTraining to true.

2-238
train

doTraining = false;
if doTraining
centralizedTrainResults = train([agentA,agentB],env,trainOpts);
else
load("centralizedAgents.mat");
end

The following figure shows a snapshot of centralized training progress. You can expect different
results due to randomness in the training process.

Simulation

Once the training is finished, simulate the trained agents with the environment.

simOptions = rlSimulationOptions("MaxSteps",300);
exp = sim(env,[agentA agentB],simOptions);

2-239
2 Functions

For more information on agent simulation, see rlSimulationOptions and sim.

Stop and Resume Training Using Existing Training Data

This example shows how to resume training using existing training data for training Q-learning. For
more information on these agents, see “Q-Learning Agents” and “SARSA Agents”.

Create Grid World Environment

For this example, create the basic grid world environment.


env = rlPredefinedEnv("BasicGridWorld");

To specify that the initial state of the agent is always [2,1], create a reset function that returns the
state number for the initial agent state.
x0 = [1:12 15:17 19:22 24];
env.ResetFcn = @() x0(randi(numel(x0)));

Fix the random generator seed for reproducibility.


rng(1)

Create Q-Learning Agent

To create a Q-learning agent, first create a Q table using the observation and action specifications
from the grid world environment. Set the learning rate of the representation to 1.
qTable = rlTable(getObservationInfo(env),getActionInfo(env));
qVf = rlQValueFunction(qTable,getObservationInfo(env),getActionInfo(env));

Next, create a Q-learning agent using this table representation and configure the epsilon-greedy
exploration. For more information on creating Q-learning agents, see rlQAgent and
rlQAgentOptions. Keep the default value of the discount factor to 0.99.

2-240
train

agentOpts = rlQAgentOptions;
agentOpts.EpsilonGreedyExploration.Epsilon = 0.2;
agentOpts.CriticOptimizerOptions.LearnRate = 0.2;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-3;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 1e-3;
agentOpts.DiscountFactor = 1;
qAgent = rlQAgent(qVf,agentOpts);

Train Q-Learning Agent for 100 Episodes

To train the agent, first specify the training options. For more information, see rlTrainingOptions.

trainOpts = rlTrainingOptions;
trainOpts.MaxStepsPerEpisode = 200;
trainOpts.MaxEpisodes = 1e6;
trainOpts.Plots = "none";
trainOpts.Verbose = false;

trainOpts.StopTrainingCriteria = "EpisodeCount";
trainOpts.StopTrainingValue = 100;
trainOpts.ScoreAveragingWindowLength = 30;

Train the Q-learning agent using the train function. Training can take several minutes to complete.
To save time while running this example, load a pretrained agent by setting doTraining to false.
To train the agent yourself, set doTraining to true.

trainingStats = train(qAgent,env,trainOpts);

Display index of last episode.

trainingStats.EpisodeIndex(end)

ans = 100

Train Q-Learning Agent for 200 More Episodes

Set the training to stop after episode 300.

trainingStats.TrainingOptions.StopTrainingValue = 300;

Resume the training using the training data that exists in trainingStats.

trainingStats = train(qAgent,env,trainingStats);

Display index of last episode.

trainingStats.EpisodeIndex(end)

ans = 300

Plot episode reward.

figure()
plot(trainingStats.EpisodeIndex,trainingStats.EpisodeReward)
title('Episode Reward')
xlabel('EpisodeIndex')
ylabel('EpisodeReward')

2-241
2 Functions

Display the final Q-value table.

qAgentFinalQ = getLearnableParameters(getCritic(qAgent));
qAgentFinalQ{1}

ans = 25x4 single matrix

4.8373 10.0000 -1.3036 0.8020


9.2058 4.3147 11.0000 5.8501
10.0000 3.3987 4.5830 -6.4751
6.3569 6.0000 8.9971 5.4393
5.0433 5.8399 7.0067 4.1439
5.1031 8.5228 10.9936 0.1200
9.9616 8.8647 12.0000 10.0026
11.0000 8.4131 8.6974 6.0001
10.0000 6.9997 5.8122 8.5523
7.1164 7.0019 8.0000 5.8196

Validate Q-Learning Results

To validate the training results, simulate the agent in the training environment.

Before running the simulation, visualize the environment and configure the visualization to maintain a
trace of the agent states.

2-242
train

plot(env)
env.ResetFcn = @() 2;
env.Model.Viewer.ShowTrace = true;
env.Model.Viewer.clearTrace;

Simulate the agent in the environment using the sim function.

sim(qAgent,env)

Input Arguments
agents — Agents
agent object | array of agent objects

Agents to train, specified as a reinforcement learning agent object, such as rlACAgent or


rlDDPGAgent, or as an array of such objects.

If env is a multi-agent environment created with rlSimulinkEnv, specify agents as an array. The
order of the agents in the array must match the agent order used to create env. Multi-agent training
is not supported for MATLAB environments.

Note train updates the agents as training progresses. This is possible because each agent is an
handle object. To preserve the original agent parameters for later use, save the agent to a MAT-file (if
you copy the agent into a new variable, the new variable will also always point to the most recent

2-243
2 Functions

agent version with updated parameters). For more information about handle objects, see “Handle
Object Behavior”.

Note When training terminates, agents reflects the state of each agent at the end of the final
training episode. The rewards obtained by the final agents are not necessarily the highest achieved
during the training process, due to continuous exploration. To save agents during training, create an
rlTrainingOptions object specifying the SaveAgentCriteria and SaveAgentValue properties and
pass it to train as a trainOpts argument.

For more information about how to create and configure agents for reinforcement learning, see
“Reinforcement Learning Agents”.

env — Environment
reinforcement learning environment object

Environment in which the agents act, specified as one of the following kinds of reinforcement
learning environment object:

• A predefined MATLAB or Simulink environment created using rlPredefinedEnv. This kind of


environment does not support training multiple agents at the same time.
• A custom MATLAB environment you create with functions such as rlFunctionEnv or
rlCreateEnvTemplate. This kind of environment does not support training multiple agents at
the same time.
• A custom Simulink environment you create using rlSimulinkEnv. This kind of environment
supports training multiple agents at the same time.

For more information about creating and configuring environments, see:

• “Create MATLAB Reinforcement Learning Environments”


• “Create Simulink Reinforcement Learning Environments”

When env is a Simulink environment, calling train compiles and simulates the model associated
with the environment.

trainOpts — Training parameters and options


rlTrainingOptions object | rlMultiAgentTrainingOptions object

Training parameters and options, specified as either an rlTrainingOptions or an


rlMultiAgentTrainingOptions object. Use this argument to specify such parameters and options
as:

• Criteria for ending training


• Criteria for saving candidate agents
• How to display training progress
• Options for parallel computing

For details, see rlTrainingOptions and rlMultiAgentTrainingOptions.

prevTrainStats — Training episode data


rlTrainingResult object | array of rlTrainingResult objects

2-244
train

Training episode data, specified as an:

• rlTrainingResult object, when training a single agent.


• Array of rlTrainingResult objects when training multiple agents.

Use this argument to resume training from the exact point at which it stopped. This starts the
training from the last values of the agent parameters and training results object obtained after the
previous train function call. prevTrainStats contains, as one of its properties, the
rlTrainingOptions object or the rlMultiAgentTrainingOptions object specifying the training
option set. Therefore, to restart the training with updated training options, first change the training
options in trainResults using dot notation. If the maximum number of episodes was already
reached in the previous training session, you must increase the maximum number of episodes.

For details about the rlTrainingResult object properties, see the “trainStats” on page 2-0
output argument.

Output Arguments
trainStats — Training episode data
rlTrainingResult object | array of rlTrainingResult objects

Training episode data, returned as an:

• rlTrainingResult object, when training a single agent.


• Array of rlTrainingResult objects when training multiple agents.

The following properties pertain to the rlTrainingResult object:

EpisodeIndex — Episode numbers


[1;2;…;N]

Episode numbers, returned as the column vector [1;2;…;N], where N is the number of episodes in
the training run. This vector is useful if you want to plot the evolution of other quantities from
episode to episode.

EpisodeReward — Reward for each episode


column vector

Reward for each episode, returned in a column vector of length N. Each entry contains the reward for
the corresponding episode.

EpisodeSteps — Number of steps in each episode


column vector

Number of steps in each episode, returned in a column vector of length N. Each entry contains the
number of steps in the corresponding episode.

AverageReward — Average reward over the averaging window


column vector

Average reward over the averaging window specified in trainOpts, returned as a column vector of
length N. Each entry contains the average award computed at the end of the corresponding episode.

2-245
2 Functions

TotalAgentSteps — Total number of steps


column vector

Total number of agent steps in training, returned as a column vector of length N. Each entry contains
the cumulative sum of the entries in EpisodeSteps up to that point.

EpisodeQ0 — Critic estimate of long-term reward for each episode


column vector

Critic estimate of long-term reward using the current agent and the environment initial conditions,
returned as a column vector of length N. Each entry is the critic estimate (Q0) for the agent of the
corresponding episode. This field is present only for agents that have critics, such as rlDDPGAgent
and rlDQNAgent.

SimulationInfo — Information collected during simulation


structure | vector of Simulink.SimulationOutput objects

Information collected during the simulations performed for training, returned as:

• For training in MATLAB environments, a structure containing the field SimulationError. This
field is a column vector with one entry per episode. When the StopOnError option of
rlTrainingOptions is "off", each entry contains any errors that occurred during the
corresponding episode.
• For training in Simulink environments, a vector of Simulink.SimulationOutput objects
containing simulation data recorded during the corresponding episode. Recorded data for an
episode includes any signals and states that the model is configured to log, simulation metadata,
and any errors that occurred during the corresponding episode.

TrainingOptions — Training options set


rlTrainingOptions object | rlMultiAgentTrainingOptions object

Training options set, returned as:

• For a single agent, an rlTrainingOptions object. For more information, see the
rlTrainingOptions reference page.
• For multiple agents, an rlMultiAgentTrainingOptions object. For more information, see the
rlMultiAgentTrainingOptions reference page.

Tips
• train updates the agents as training progresses. To preserve the original agent parameters for
later use, save the agents to a MAT-file.
• By default, calling train opens the Reinforcement Learning Episode Manager, which lets you
visualize the progress of the training. The Episode Manager plot shows the reward for each
episode, a running average reward value, and the critic estimate Q0 (for agents that have critics).
The Episode Manager also displays various episode and training statistics. To turn off the
Reinforcement Learning Episode Manager, set the Plots option of trainOpts to "none".
• If you use a predefined environment for which there is a visualization, you can use plot(env) to
visualize the environment. If you call plot(env) before training, then the visualization updates
during training to allow you to visualize the progress of each episode. (For custom environments,
you must implement your own plot method.)

2-246
train

• Training terminates when the conditions specified in trainOpts are satisfied. To terminate
training in progress, in the Reinforcement Learning Episode Manager, click Stop Training.
Because train updates the agent at each episode, you can resume training by calling
train(agent,env,trainOpts) again, without losing the trained parameters learned during the
first call to train.
• During training, you can save candidate agents that meet conditions you specify with trainOpts.
For instance, you can save any agent whose episode reward exceeds a certain value, even if the
overall condition for terminating training is not yet satisfied. train stores saved agents in a MAT-
file in the folder you specify with trainOpts. Saved agents can be useful, for instance, to allow
you to test candidate agents generated during a long-running training process. For details about
saving criteria and saving location, see rlTrainingOptions.

Algorithms
In general, train performs the following iterative steps:

1 Initialize agent.
2 For each episode:

a Reset the environment.


b Get the initial observation s0 from the environment.
c Compute the initial action a0 = μ(s0).
d Set the current action to the initial action (a←a0) and set the current observation to the initial
observation (s←s0).
e While the episode is not finished or terminated:

i Step the environment with action a to obtain the next observation s' and the reward r.
ii Learn from the experience set (s,a,r,s').
iii Compute the next action a' = μ(s').
iv Update the current action with the next action (a←a') and update the current
observation with the next observation (s←s').
v Break if the episode termination conditions defined in the environment are met.
3 If the training termination condition defined by trainOpts is met, terminate training.
Otherwise, begin the next episode.

The specifics of how train performs these computations depends on your configuration of the agent
and environment. For instance, resetting the environment at the start of each episode can include
randomizing initial state values, if you configure your environment to do so.

Version History
Introduced in R2019a

train now returns an object instead of a structure


Behavior change in future release

Starting in R2022a, train returns an object or an array of objects instead of a structure. The
properties of the object match the fields of the structure returned in previous versions. Therefore, the
code based on dot notation works in the same way.

2-247
2 Functions

For example, if you train an agent using the following command:

trainStats = train(agent,env,trainOptions);

When training terminates, either because a termination condition is reached or because you click
Stop Training in the Reinforcement Learning Episode Manager, trainStats is returned as an
rlTrainingResult object.

The rlTrainingResult object contains the same training statistics previously returned in a
structure along with data to correctly recreate the training scenario and update the episode manager.

You can use trainStats as third argument for another train call, which (when executed with the
same agents and environment) will cause training to resume from the exact point at which it stopped.

Extended Capabilities
Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

To train in parallel, set the UseParallel and ParallelizationOptions options in the option set
trainOpts. Parallel training is not supported for multi-agent environments. For more information,
see rlTrainingOptions.

See Also
rlTrainingOptions | sim | rlMultiAgentTrainingOptions

Topics
“Train Reinforcement Learning Agents”

2-248
validateEnvironment

validateEnvironment
Package: rl.env

Validate custom reinforcement learning environment

Syntax
validateEnvironment(env)

Description
validateEnvironment(env) validates a reinforcement learning environment. This function is
useful when:

• You are using a custom environment for which you supplied your own step and reset functions,
such as an environment created using rlCreateEnvTemplate.
• You are using an environment you created from a Simulink model using rlSimulinkEnv.

validateEnvironment resets the environment, generates an initial observation and action, and
simulates the environment for one or two steps (see “Algorithms” on page 2-251). If there are no
errors during these operations, validation is successful, and validateEnvironment returns no
result. If errors occur, these errors appear in the MATLAB command window. Use the errors to
determine what to change in your observation specification, action specification, custom functions, or
Simulink model.

Examples

Validate Simulink Environment

This example shows how to validate a Simulink environment.

Create and validate and environment for the rlwatertank model, which represents a control
system containing a reinforcement learning agent (For details about this model, see “Create Simulink
Environment and Train Agent”.)

open_system('rlwatertank')

Create observation and action specifications for the environment.

obsInfo = rlNumericSpec([3 1],...


'LowerLimit',[-inf -inf 0 ]',...
'UpperLimit',[ inf inf inf]');
obsInfo.Name = 'observations';
obsInfo.Description = 'integrated error, error, and measured height';
numObservations = obsInfo.Dimension(1);

actInfo = rlNumericSpec([1 1]);


actInfo.Name = 'flow';
numActions = numel(actInfo);

2-249
2 Functions

Create an environment from the model.

env = rlSimulinkEnv('rlwatertank','rlwatertank/RL Agent',obsInfo,actInfo);

Now you use validateEnvironment to check whether the model is configured correctly.

validateEnvironment(env)

Error using rl.env.SimulinkEnvWithAgent/validateEnvironment (line 187)


Simulink environment validation requires an agent in the MATLAB base workspace
or in a data dictionary linked to the model. Specify the agent in the Simulink model.

validateEnvironment attempts to compile the model, initialize the environment and the agent,
and simulate the model. In this case, the RL Agent block is configured to use an agent called agent,
but no such variable exists in the MATLAB® workspace. Thus, the function returns an error
indicating the problem.

Create an appropriate agent for this system using the commands detailed in the “Create Simulink
Environment and Train Agent” example. In this case, load the agent from the
rlWaterTankDDPGAgent.mat file.

load rlWaterTankDDPGAgent

Now, run validateEnvironment again.

validateEnvironment(env)

Input Arguments
env — Environment to validate
environment object

Environment to validate, specified as a reinforcement learning environment object, such as:

• A custom MATLAB environment you create with rlCreateEnvTemplate. In this case,


validateEnvironment checks that the observations and actions generated during simulation of
the environment are consistent in size, data type, and value range with the observation
specification and action specification. It also checks that your custom step and reset functions
run without error. (When you create a custom environment using rlFunctionEnv, the software
runs validateEnvironment automatically.)
• A custom Simulink environment you create using rlSimulinkEnv. If you use a Simulink
environment, you must also have an agent defined and associated with the RL Agent block in the
model. For a Simulink model, validateEnvironment checks that the model compiles and runs
without error. The function does not dirty your model.

For more information about creating and configuring environments, see:

• “Create MATLAB Reinforcement Learning Environments”


• “Create Simulink Reinforcement Learning Environments”

2-250
validateEnvironment

Algorithms
validateEnvironment works by running a brief simulation of the environment and making sure
that the generated signals match the observation and action specifications you provided when you
created the environment.

MATLAB Environments

For MATLAB environments, validation includes the following steps.

1 Reset the environment using the reset function associated with the environment.
2 Obtain the first observation and check whether it is consistent with the dimension, data type, and
range of values in the observation specification.
3 Generate a test action based on the dimension, data type, and range of values in the action
specification.
4 Simulate the environment for one step using the generated action and the step function
associated with the environment.
5 Obtain the new observation signal and check whether it is consistent with the dimension, data
type, and range of values in the observation specification.

If any of these operations generates an error, validateEnvironment returns the error. If


validateEnvironment returns no result, then validation is successful.

Simulink Environments

For Simulink environments, validation includes the following steps.

1 Reset the environment.


2 Simulate the model for two time steps.

If any of these operations generates an error, validateEnvironment returns the error. If


validateEnvironment returns no result, then validation is successful.

validateEnvironment performs these steps without dirtying the model, and leaves all model
parameters in the state they were in when you called the function.

Version History
Introduced in R2019a

See Also
rlCreateEnvTemplate | rlSimulinkEnv | rlFunctionEnv

Topics
“Create Simulink Environment and Train Agent”
“Create Custom MATLAB Environment from Template”

2-251
2 Functions

write
Package: rl.logging

Transfer stored data from the internal logger memory to the logging target

Syntax
write(lgr)

Description
write(lgr) transfers all stored logging contexts into the logger's target (either a MAT file or a
trainingProgressMonitor object).

Examples

Log Data to Disk in a Custom Training Loop

This example shows how to log data to disk when training an agent using a custom training loop.

Create a FileLogger object using rlDataLogger.

flgr = rlDataLogger();

Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.

setup(flgr);

Within a custom training loop, you can now store data to the logger object memory and write data to
file.

For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.

for iter = 1:10

% Store three random numbers in memory


% as elements of the variable "Context1"
for ct = 1:3
store(flgr, "Context1", rand, iter);
end

% Store a random number in memory


% as the variable "Context2"
store(flgr, "Context2", rand, iter);

2-252
write

% Write data to file every 4 iterations


if mod(iter,4)==0
write(flgr);
end

end

Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.

cleanup(flgr);

Input Arguments
lgr — Date logger object
FileLogger object | MonitorLogger object | ...

Data logger object, specified as either a FileLogger or a MonitorLogger object.

Version History
Introduced in R2022b

See Also
Functions
rlDataLogger | train | setup | store | cleanup

Objects
FileLogger | MonitorLogger

Topics
“Log Training Data To Disk”
“Function Handles”
“Handle Object Behavior”

2-253
3

Objects
3 Objects

quadraticLayer
Quadratic layer for actor or critic network

Description
A quadratic layer takes an input vector and outputs a vector of quadratic monomials constructed from
the input elements. This layer is useful when you need a layer whose output is a quadratic function of
its inputs. For example, to recreate the structure of quadratic value functions such as those used in
LQR controller design.

For example, consider an input vector U = [u1 u2 u3]. For this input, a quadratic layer gives the
output Y = [u1*u1 u1*u2 u2*u2 u1*u3 u2*u3 u3*u3]. For an example that uses a
QuadraticLayer, see “Train DDPG Agent to Control Double Integrator System”.

Note The QuadraticLayer layer does not support inputs coming directly or indirectly from a
featureInputLayer or sequenceInputLayer.

The parameters of a QuadraticLayer object are not learnable.

Creation

Syntax
qLayer = quadraticLayer
qLayer = quadraticLayer(Name,Value)

Description

qLayer = quadraticLayer creates a quadratic layer with default property values.

qLayer = quadraticLayer(Name,Value) sets properties on page 3-2 using name-value pairs.


For example, quadraticLayer('Name','quadlayer') creates a quadratic layer and assigns the
name 'quadlayer'.

Properties
Name — Name of layer
'quadratic' (default) | character vector

Name of layer, specified as a character vector. To include a layer in a layer graph, you must specify a
nonempty unique layer name. If you train a series network with this layer and Name is set to '', then
the software automatically assigns a name to the layer at training time.

Description — Description of layer


'quadratic layer' (default) | character vector

3-2
quadraticLayer

This property is read-only.

Description of layer, specified as a character vector. When you create the quadratic layer, you can use
this property to give it a description that helps you identify its purpose.

Examples

Create Quadratic Layer

Create a quadratic layer that converts an input vector U into a vector of quadratic monomials
constructed from binary combinations of the elements of U.

qLayer = quadraticLayer

qLayer =
QuadraticLayer with properties:

Name: 'quadratic'

Learnable Parameters
No properties.

State Parameters
No properties.

Show all properties

Confirm that the layer produces the expected output. For instance, for U = [u1 u2 u3], the
expected output is [u1*u1 u1*u2 u2*u2 u1*u3 u2*u3 u3*u3].

predict(qLayer,[1 2 3])

ans = 1×3

1 4 9

You can incorporate qLayer into an actor network or critic network for reinforcement learning.

Version History
Introduced in R2019a

See Also
scalingLayer | softplusLayer

Topics
“Train DDPG Agent to Control Double Integrator System”
“Create Policies and Value Functions”

3-3
3 Objects

FileLogger
Log reinforcement learning training data to MAT files

Description
Use a FileLogger object to log data to MAT files, within the train function or inside a custom
training loop. To log data when using the train function, specify appropriate callback functions in
FileLogger, as shown in the examples. These callbacks are executed at different stages of training,
for example, EpisodeFinishedFcn is executed after the completion of an episode. The output of a
callback function is a structure containing the data to log at that stage of training.

Note FileLogger is a handle object. If you assign an existing FileLogger object to a new
FileLogger object, both the new object and the original one refer to the same underlying object in
memory. To preserve the original object parameters for later use, save the object to a MAT-file. For
more information about handle objects, see “Handle Object Behavior”.

Creation
Create a FileLogger object using rlDataLogger without any input arguments.

Properties
LoggingOptions — Object containing a set of logging options
MATFileLoggingOptions object (default)

Object containing a set of logging options, returned as a MATFileLoggingOptions object. A


MATFileLoggingOptions object has the following properties, which you can access using dot
notation after creating the FileLogger object.

LoggingDirectory — Name of the logging directory


"logs" subdirectory of the current directory (default) | string or character array

Name or fully qualified path of the logging directory, specified as a string or a character array. This is
the name of the directory where the MAT files containing the logged data are saved. As a default, a
subdirectory called logs is created in the current folder during setup and files are saved there
during training.

FileNameRule — Rule to name the MAT files


"loggedData<id>" (default) | string | char array

Rule to name the MAT files, specified as a string or a character array. For example, the naming rule
"episode<id>" results in the file names episode001.mat, episode002.mat and so on.

Version — MAT files version


"-v7" (default) | string | char array

3-4
FileLogger

MAT file versions, specified as a string or character array. The default is "-v7". For more information,
see “MAT-File Versions”.

UseCompression — Option to use compression


true (default) | false

Option to use compression when saving data to a MAT file, specified as a logical variable. The default
is true. For more information, see “MAT-File Versions”.

DataWriteFrequency — Frequency for writing data to disk


1 (default) | positive integer

Frequency for writing data to disk, specified as a positive integer. It is the number of episodes after
which data is saved to disk. The default is 1.

MaxEpisodes — Maximum number of episodes


500 (default) | positive integer

Maximum number of episodes, specified as a positive integer. When using train, the value is
automatically initialized. Set this value when using the logger object in a custom training loop. The
default is 500.

EpisodeFinishedFcn — Callback to log data after episode completion


[] (default) | function handle

Callback to log data after episode completion, specified as a function handle object. The specified
function must return a structure containing the data to log, such as experiences, simulation
information, or initial observations.
Example: @myEpisodeLoggingFcn

AgentStepFinishedFcn — Callback to log data after training step completion


[] (default) | function handle | cell array of function handles

Callback to log data after training step completion within an episode, specified as a function handle
object. The specified function must return a structure containing the data to log, such as for example
the state of the agent's exploration policy.

For multi agent training, AgentStepFinishedFcn can be a cell array of function handles with as
many elements as number of agent groups.

Note Logging data using the AgentStepFinishedFcn callback is not supported when training
agents in parallel with the train function.

Example: @myAgentStepLoggingFcn

AgentLearnFinishedFcn — Callback to log data after completion of the learn subroutine


[] (default) | function handle | cell array of function handles

Callback to log data after completion of the learn subroutine, specified as a function handle object.
The specified function must return a structure containing the data to log, such as the training losses
of the actor and critic networks, or, for a model-based agent, the environment model training losses.

3-5
3 Objects

For multi agent training, AgentLearnFinishedFcn can be a cell array of function handles with as
many elements as number of agent groups.
Example: @myLearnLoggingFcn

Object Functions
setup Set up reinforcement learning environment or initialize data logger object
cleanup Clean up reinforcement learning environment or data logger object

Examples

Log Data to Disk during Built-in Training

This example shows how to log data to disk when using train.

Create a FileLogger object using rlDataLogger.

logger = rlDataLogger();

Specify a directory to save logged data.

logger.LoggingOptions.LoggingDirectory = "myDataLog";

Create callback functions to log the data (for this example, see the helper function section), and
specify the appropriate callback functions in the logger object. For a related example, see “Log
Training Data To Disk”.

logger.EpisodeFinishedFcn = @myEpisodeFinishedFcn;
logger.AgentStepFinishedFcn = @myAgentStepFinishedFcn;
logger.AgentLearnFinishedFcn = @myAgentLearnFinishedFcn;

To train the agent, you can now call train, passing logger as an argument such as in the following
command.

trainResult = train(agent, env, trainOpts, Logger=logger);

While the training progresses, data will be logged to the specified directory, according to the rule
specified in the FileNameRule property of logger.LoggingOptions.

logger.LoggingOptions.FileNameRule

ans =
"loggedData<id>"

Example Logging Functions

Episode completion logging function.

function dataToLog = myEpisodeFinishedFcn(data)


dataToLog.Experience = data.Experience;
end

Agent step completion logging function.

3-6
FileLogger

function dataToLog = myAgentStepFinishedFcn(data)


dataToLog.State = getState(getExplorationPolicy(data.Agent));
end

Learn subroutine completion logging function.


function dataToLog = myAgentLearnFinishedFcn(data)
dataToLog.ActorLoss = data.ActorLoss;
dataToLog.CriticLoss = data.CriticLoss;
end

Log Data to Disk in a Custom Training Loop

This example shows how to log data to disk when training an agent using a custom training loop.

Create a FileLogger object using rlDataLogger.


flgr = rlDataLogger();

Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.
setup(flgr);

Within a custom training loop, you can now store data to the logger object memory and write data to
file.

For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.
for iter = 1:10

% Store three random numbers in memory


% as elements of the variable "Context1"
for ct = 1:3
store(flgr, "Context1", rand, iter);
end

% Store a random number in memory


% as the variable "Context2"
store(flgr, "Context2", rand, iter);

% Write data to file every 4 iterations


if mod(iter,4)==0
write(flgr);
end

end

Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.
cleanup(flgr);

3-7
3 Objects

Limitations
• Logging data using the AgentStepFinishedFcn callback is not supported when training agents
in parallel with the train function.

Version History
Introduced in R2022b

See Also
Functions
rlDataLogger | train | setup | store | write | cleanup

Objects
MonitorLogger

Topics
“Log Training Data To Disk”
“Function Handles”
“Handle Object Behavior”

3-8
MonitorLogger

MonitorLogger
Log reinforcement learning training data to monitor window

Description
Use a MonitorLogger object to log data to a monitor window, within the train function or inside a
custom training loop. To log data when using the train function, specify appropriate callback
functions in MonitorLogger, as shown in the examples. These callbacks are executed at different
stages of training, for example, EpisodeFinishedFcn is executed after the completion of an
episode. The output of a callback function is a structure containing the data to log at that stage of
training.

Note MonitorLogger is an handle object. If you assign an existing MonitorLogger object to a new
MonitorLogger object, both the new object and the original one refer to the same underlying object
in memory. To preserve the original object parameters for later use, save the object to a MAT-file. For
more information about handle objects, see “Handle Object Behavior”.

Creation
Create a MonitorLogger object using rlDataLogger specifying a trainingProgressMonitor
object as input argument.

Properties
LoggingOptions — Object containing a set of logging options
MonitorLoggingOptions object (default)

Object containing a set of logging options, returned as a MonitorLoggingOptions object. A


MonitorLoggingOptions object has the following properties, which you can access using dot
notation after creating the MonitorLogger object.

DataWriteFrequency — Frequency for writing data to the monitor window


1 (default) | positive integer

Frequency for writing data to the monitor window, specified as a positive integer. It is the number of
episodes after which data is transmitted to the trainingProgressMonitor object. The default is 1.

MaxEpisodes — Maximum number of episodes


500 (default) | positive integer

Maximum number of episodes, specified as a positive integer. When using train, the value is
automatically initialized. Set this value when using the logger object in a custom training loop. The
default is 500.

EpisodeFinishedFcn — Callback to log data after episode completion


[] (default) | function handle

3-9
3 Objects

Callback to log data after episode completion, specified as a function handle object. The specified
function must return a structure containing the data to log, such as experiences, simulation
information, or initial observations.
Example: @myEpisodeLoggingFcn

AgentStepFinishedFcn — Callback to log data after training step completion


[] (default) | function handle | cell array of function handles

Callback to log data after training step completion within an episode, specified as a function handle
object. The specified function must return a structure containing the data to log, such as for example
the state of the agent's exploration policy.

For multi agent training, AgentStepFinishedFcn can be a cell array of function handles with as
many elements as number of agent groups.

Note Logging data using the AgentStepFinishedFcn callback is not supported when training
agents in parallel with the train function.

Example: @myAgentStepLoggingFcn

AgentLearnFinishedFcn — Callback to log data after completion of the learn subroutine


[] (default) | function handle | cell array of function handles

Callback to log data after completion of the learn subroutine, specified as a function handle object.
The specified function must return a structure containing the data to log, such as the training losses
of the actor and critic networks, or, for a model-based agent, the environment model training losses.

For multi agent training, AgentLearnFinishedFcn can be a cell array of function handles with as
many elements as number of agent groups.
Example: @myLearnLoggingFcn

Object Functions
setup Set up reinforcement learning environment or initialize data logger object
cleanup Clean up reinforcement learning environment or data logger object

Examples

Log and Visualize Data with a Training Progress Monitor Object

This example shows how to log and visualize data to the window of a trainingProgressMonitor
object when using train.

Create a trainingProgressMonitor object. Creating the object also opens a window associated
with the object.

monitor = trainingProgressMonitor();

3-10
MonitorLogger

Create a MonitorLogger object using rlDataLogger.

logger = rlDataLogger(monitor);

Create callback functions to log the data (for this example, see the helper function section), and
specify the appropriate callback functions in the logger object.

logger.AgentLearnFinishedFcn = @myAgentLearnFinishedFcn;

To train the agent, you can now call train, passing logger as an argument such as in the following
command.

trainResult = train(agent, env, trainOpts, Logger=logger);

While the training progresses, data will be logged to the training monitor object, and visualized in the
associated window.

Note that only scalar data can be logged with a monitor logger object.

Example Logging Functions

Define a logging function that logs data periodically at the completion of the learning subroutine.

function dataToLog = myAgentLearnFinishedFcn(data)

if mod(data.AgentLearnCount, 2) == 0
dataToLog.ActorLoss = data.ActorLoss;
dataToLog.CriticLoss = data.CriticLoss;
else
dataToLog = [];
end

3-11
3 Objects

end

Limitations
• Only scalar data is supported when logging data with a MonitorLogger object. The structure
returned by the callback functions must contain fields with scalar data.
• Resuming of training from a previous training result is not supported when logging data with a
MonitorLogger object.

Version History
Introduced in R2022b

See Also
Functions
rlDataLogger | train | setup | store | write | cleanup

Objects
FileLogger | trainingProgressMonitor

Topics
“Log Training Data To Disk”
“Monitor Custom Training Loop Progress”
“Function Handles”
“Handle Object Behavior”

3-12
rlACAgent

rlACAgent
Actor-critic reinforcement learning agent

Description
Actor-critic (AC) agents implement actor-critic algorithms such as A2C and A3C, which are model-
free, online, on-policy reinforcement learning methods. The actor-critic agent optimizes the policy
(actor) directly and uses a critic to estimate the return or future rewards. The action space can be
either discrete or continuous.

For more information, see “Actor-Critic Agents”. For more information on the different types of
reinforcement learning agents, see “Reinforcement Learning Agents”.

Creation

Syntax
agent = rlACAgent(observationInfo,actionInfo)
agent = rlACAgent(observationInfo,actionInfo,initOpts)

agent = rlACAgent(actor,critic)

agent = rlACAgent( ___ ,agentOptions)

Description

Create Agent from Observation and Action Specifications

agent = rlACAgent(observationInfo,actionInfo) creates an actor-critic agent for an


environment with the given observation and action specifications, using default initialization options.
The actor and critic in the agent use default deep neural networks built from the observation
specification observationInfo and the action specification actionInfo. The ObservationInfo
and ActionInfo properties of agent are set to the observationInfo and actionInfo input
arguments, respectively.

agent = rlACAgent(observationInfo,actionInfo,initOpts) creates an actor-critic agent


for an environment with the given observation and action specifications. The agent uses default
networks in which each hidden fully connected layer has the number of units specified in the
initOpts object. Actor-critic agents do not support recurrent neural networks. For more information
on the initialization options, see rlAgentInitializationOptions.

Create Agent from Actor and Critic

agent = rlACAgent(actor,critic) creates an actor-critic agent with the specified actor and
critic, using the default options for the agent.

3-13
3 Objects

Specify Agent Options

agent = rlACAgent( ___ ,agentOptions) creates an actor-critic agent and sets the
AgentOptions property to the agentOptions input argument. Use this syntax after any of the input
arguments in the previous syntaxes.

Input Arguments

initOpts — Agent initialization options


rlAgentInitializationOptions object

Agent initialization options, specified as an rlAgentInitializationOptions object. Actor-critic


agents do not support recurrent neural networks.

actor — Actor
rlDiscreteCategoricalActor object | rlContinuousGaussianActor object

Actor that implements the policy, specified as an rlDiscreteCategoricalActor or


rlContinuousGaussianActor function approximator object. For more information on creating
actor approximators, see “Create Policies and Value Functions”.

critic — Critic
rlValueFunction object

Critic that estimates the discounted long-term reward, specified as an rlValueFunction object. For
more information on creating critic approximators, see “Create Policies and Value Functions”.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data type, and names of the observation
signals.

If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.

You can extract observationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specification


specification object

Action specifications, specified as a reinforcement learning specification object defining properties


such as dimensions, data type, and names of the action signals.

For a discrete action space, you must specify actionInfo as an rlFiniteSetSpec object.

For a continuous action space, you must specify actionInfo as an rlNumericSpec object.

If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.

3-14
rlACAgent

You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

AgentOptions — Agent options


rlACAgentOptions object

Agent options, specified as an rlACAgentOptions object.

UseExplorationPolicy — Option to use exploration policy


true (default) | false

Option to use exploration policy when selecting actions, specified as a one of the following logical
values.

• true — Use the base agent exploration policy when selecting actions in sim and
generatePolicyFunction. In this case, the agent selects its actions by sampling its probability
distribution, the policy is therefore stochastic and the agent explores its observation space.
• false — Use the base agent greedy policy (the action with maximum likelihood) when selecting
actions in sim and generatePolicyFunction. In this case, the simulated agent and generated
policy behave deterministically.

Note This option affects only simulation and deployment; it does not affect training.

SampleTime — Sample time of agent


positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object

3-15
3 Objects

Examples

Create Discrete Actor-Critic Agent from Observation and Action Specifications

Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to a swinging pole).

env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

Obtain observation and action specifications

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.

rng(0)

Create an actor-critic agent from the environment observation and action specifications.

agent = rlACAgent(obsInfo,actInfo);

To check your agent, use getAction to return the action from random observations.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[-2]}

You can now test and train the agent within the environment. You can also use getActor and
getCritic to extract the actor and critic, respectively, and getModel to extract the approximator
model (by default a deep neural network) from the actor or critic.

Create Continuous Actor-Critic Agent Using Initialization Options

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.

% load predefined environment


env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");

% obtain observation and action specifications


obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

3-16
rlACAgent

Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256). Actor-critic agents do not
support recurrent networks, so setting the UseRNN option to true generates an error when the agent
is created.

initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);

The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator.

rng(0)

Create an actor-critic agent from the environment observation and action specifications.

agent = rlACAgent(obsInfo,actInfo,initOpts);

Extract the deep neural networks from both the agent actor and critic.

actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));

Display the layers of the critic network, and verify that each hidden fully connected layer has 128
neurons

criticNet.Layers

ans =
11x1 Layer array with layers:

1 'concat' Concatenation Concatenation of 2 inputs along dimension 1


2 'relu_body' ReLU ReLU
3 'fc_body' Fully Connected 128 fully connected layer
4 'body_output' ReLU ReLU
5 'input_1' Image Input 50x50x1 images
6 'conv_1' 2-D Convolution 64 3x3x1 convolutions with stride [1 1] and padding
7 'relu_input_1' ReLU ReLU
8 'fc_1' Fully Connected 128 fully connected layer
9 'input_2' Feature Input 1 features
10 'fc_2' Fully Connected 128 fully connected layer
11 'output' Fully Connected 1 fully connected layer

Plot actor and critic networks

plot(layerGraph(actorNet))

3-17
3 Objects

plot(layerGraph(criticNet))

3-18
rlACAgent

To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[0.9228]}

You can now test and train the agent within the environment.

Create Discrete Actor-Critic Agent from Actor and Critic

Create an environment with a discrete action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DQN Agent to
Balance Cart-Pole System”. This environment has a four-dimensional observation vector (cart position
and velocity, pole angle, and pole angle derivative), and a scalar action with two possible elements (a
force of either -10 or +10 N applied on the cart).
env = rlPredefinedEnv("CartPole-Discrete");

Obtain observation and action specifications.


obsInfo = getObservationInfo(env)

obsInfo =
rlNumericSpec with properties:

3-19
3 Objects

LowerLimit: -Inf
UpperLimit: Inf
Name: "CartPole States"
Description: "x, dx, theta, dtheta"
Dimension: [4 1]
DataType: "double"

actInfo = getActionInfo(env)

actInfo =
rlFiniteSetSpec with properties:

Elements: [-10 10]


Name: "CartPole Action"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"

The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator.

rng(0)

For actor-critic agents, the critic estimates a value function, therefore it must take the observation
signal as input and return a scalar value.

To approximate the value function within the critic, use a deep neural network. Define the network as
an array of layer objects. Get the dimensions of the observation space from the environment
specification objects.

cnet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(1)];

Convert the network to a dlnetwork object, and display the number of weights.

cnet = dlnetwork(cnet);
summary(cnet)

Initialized: true

Number of learnables: 301

Inputs:
1 'input' 4 features

Create the critic. Actor-critic agents use an rlValueFunction object to implement the critic.

critic = rlValueFunction(cnet,obsInfo);

Check your critic with a random observation input.

getValue(critic,{rand(obsInfo.Dimension)})

3-20
rlACAgent

ans = single
-0.1411

Create a deep neural network to be used as approximation model within the actor. For actor-critic
agents, the actor executes a stochastic policy, which for discrete action spaces is implemented by a
discrete categorical actor. In this case the network must take the observation signal as input and
return a probability for each action. Therefore the output layer must have as many elements as the
number of possible actions.
anet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(numel(actInfo.Dimension))];

Convert the network to a dlnetwork object, and display the number of weights.
anet = dlnetwork(anet);
summary(anet)

Initialized: true

Number of learnables: 352

Inputs:
1 'input' 4 features

Create the actor. Actor-critic agents use an rlDiscreteCategoricalActor object to implement


the actor for discrete action spaces.
actor = rlDiscreteCategoricalActor(anet,obsInfo,actInfo);

Check your actor with a random observation input.


getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-10]}

Create the AC agent using the actor and the critic.


agent = rlACAgent(actor,critic)

agent =
rlACAgent with properties:

AgentOptions: [1x1 rl.option.rlACAgentOptions]


UseExplorationPolicy: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1

Specify some options for the agent, including training options for the actor and critic.
agent.AgentOptions.NumStepsToLookAhead=32;
agent.AgentOptions.DiscountFactor=0.99;
agent.AgentOptions.CriticOptimizerOptions.LearnRate=8e-3;

3-21
3 Objects

agent.AgentOptions.CriticOptimizerOptions.GradientThreshold=1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate=8e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold=1;

Check your agent with a random observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-10]}

You can now test and train the agent within the environment.

Create Continuous Actor-Critic Agent from Actor and Critic

Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the double integrator continuous action space environment used
in the example “Train DDPG Agent to Control Double Integrator System”.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");

Obtain observation and action specifications.

obsInfo = getObservationInfo(env)

obsInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"

actInfo = getActionInfo(env)

actInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"

In this example, the action is a scalar value representing a force ranging from -2 to 2 Newton. To
make sure that the output from the agent is in this range, you perform an appropriate scaling
operation. Store these limits so you can easily access them later.

3-22
rlACAgent

% Make sure action space upper and lower limits are finite
actInfo.LowerLimit=-2;
actInfo.UpperLimit=2;

The actor and critic networks are initialized randomly. You can ensure reproducibility by fixing the
seed of the random generator.

rng(0)

For actor-critic agents, the critic estimates a value function, therefore it must take the observation
signal as input and return a scalar value. To approximate the value function within the critic, use a
deep neural network.

Define the network as an array of layer objects, and get the dimensions of the observation space from
the environment specification object.

cNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(1)];

Convert the network to a dlnetwork object and display the number of weights.

cNet = dlnetwork(cNet);
summary(cNet)

Initialized: true

Number of learnables: 201

Inputs:
1 'input' 2 features

Create the critic using cNet. Actor-critic agents use an rlValueFunction object to implement the
critic.

critic = rlValueFunction(cNet,obsInfo);

Check your critic with a random input observation.

getValue(critic,{rand(obsInfo.Dimension)})

ans = single
-0.0969

To approximate the policy within the actor, use a deep neural network. For actor-critic agents, the
actor executes a stochastic policy, which for continuous action spaces is implemented by a continuous
Gaussian actor. In this case the network must take the observation signal as input and return both a
mean value and a standard deviation value for each action. Therefore it must have two output layers
(one for the mean values the other for the standard deviation values), each having as many elements
as the dimension of the action space.

Note that standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.

3-23
3 Objects

Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects, and specify a name for the input and output
layers, so you can later explicitly associate them with the appropriate channel.

% Input path
inPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="netObsIn")
fullyConnectedLayer(prod(actInfo.Dimension),Name="infc")
];

% Mean value path


meanPath = [
tanhLayer(Name="tanhMean");
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension));
scalingLayer( ...
Name="netMout", ...
Scale=actInfo.UpperLimit) % scale to range
];

% Standard deviation path


sdevPath = [
tanhLayer(Name="tanhStdv");
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension));
softplusLayer(Name="netSDout") % nonnegative
];

% Add layers to network object


aNet = layerGraph;
aNet = addLayers(aNet,inPath);
aNet = addLayers(aNet,meanPath);
aNet = addLayers(aNet,sdevPath);

% Connect layers
aNet = connectLayers(aNet,"infc","tanhMean/in");
aNet = connectLayers(aNet,"infc","tanhStdv/in");

% Plot network
plot(aNet)

3-24
rlACAgent

Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).

aNet = dlnetwork(aNet);
summary(aNet)

Initialized: true

Number of learnables: 305

Inputs:
1 'netObsIn' 2 features

Create the actor. Actor-critic agents use an rlContinuousGaussianActor object to implement the
actor for continuous action spaces.

actor = rlContinuousGaussianActor(aNet, obsInfo, actInfo, ...


ActionMeanOutputNames="netMout",...
ActionStandardDeviationOutputNames="netSDout",...
ObservationInputNames="netObsIn");

Check your actor with a random input observation.

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-1.2332]}

3-25
3 Objects

Create the AC agent using the actor and the critic.

agent = rlACAgent(actor,critic);

Specify agent options, including training options for its actor and critic.

agent.AgentOptions.NumStepsToLookAhead = 32;
agent.AgentOptions.DiscountFactor=0.99;

agent.AgentOptions.CriticOptimizerOptions.LearnRate=8e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold=1;

agent.AgentOptions.ActorOptimizerOptions.LearnRate=8e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold=1;

Check your agent using a random input observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-1.5401]}

You can now test and train the agent within the environment.

Create a Discrete Actor-Critic Agent with Recurrent Neural Networks

For this example load the predefined environment used for the “Train DQN Agent to Balance Cart-
Pole System” example. This environment has a four-dimensional observation vector (cart position and
velocity, pole angle, and pole angle derivative), and a scalar action with two possible elements (a
force of either -10 or +10 N applied on the cart).

env = rlPredefinedEnv("CartPole-Discrete");

Get observation and action information.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.

rng(0)

For actor-critic agents, the critic estimates a value function, therefore it must take the observation
signal as input and return a scalar value.

To approximate the value function within the critic, use a recurrent deep neural network. Define the
network as an array of layer objects, and get the dimensions of the observation space from the
environment specification object. To create a recurrent network, use a sequenceInputLayer as the
input layer and include an lstmLayer as one of the other network layers.

cNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
lstmLayer(10)

3-26
rlACAgent

reluLayer
fullyConnectedLayer(1)];

Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).

cNet = dlnetwork(cNet);
summary(cNet)

Initialized: true

Number of learnables: 611

Inputs:
1 'sequenceinput' Sequence input with 4 dimensions

Create the critic using cNet. Actor-critic agents use an rlValueFunction object to implement the
critic.

critic = rlValueFunction(cNet,obsInfo);

Check the critic with a random input observation.

getValue(critic,{rand(obsInfo.Dimension)})

ans = single
-0.0344

Since the critic has a recurrent network, the actor must also use a recurrent network too. For actor-
critic agents, the actor executes a stochastic policy, which for discrete action spaces is implemented
by a discrete categorical actor. In this case the network must take the observation signal as input and
return a probability for each action. Therefore the output layer must have as many elements as the
number of possible actions.

aNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
lstmLayer(20)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))];

Convert the network to a dlnetwork object and display the number of weights.

aNet = dlnetwork(aNet);
summary(aNet)

Initialized: true

Number of learnables: 2k

Inputs:
1 'sequenceinput' Sequence input with 4 dimensions

Create the actor using aNet. Actor-critic agents use an rlDiscreteCategoricalActor object to
implement the actor for discrete action spaces.

actor = rlDiscreteCategoricalActor(aNet,obsInfo,actInfo);

Check the actor with a random input observation.

3-27
3 Objects

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[10]}

Set some training options for the critic.

criticOpts = rlOptimizerOptions( ...


LearnRate=8e-3,GradientThreshold=1);

Set some training options for the actor.

actorOpts = rlOptimizerOptions( ...


LearnRate=8e-3,GradientThreshold=1);

Specify agent options, and create an AC agent using the actor, the critic, and the agent options
object. Since the agent uses recurrent neural networks, NumStepsToLookAhead is treated as the
training trajectory length.

agentOpts = rlACAgentOptions( ...


NumStepsToLookAhead=32, ...
DiscountFactor=0.99, ...
CriticOptimizerOptions=criticOpts, ...
ActorOptimizerOptions=actorOpts);
agent = rlACAgent(actor,critic,agentOpts);

To check your agent, return the action from a random observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[10]}

You can now test and train the agent within the environment.

Configure Options for A3C Training

To train an agent using the asynchronous advantage actor-critic (A3C) method, you must set the
agent and parallel training options appropriately.

When creating the AC agent, set the NumStepsToLookAhead value to be greater than 1. Common
values are 64 and 128.

agentOpts = rlACAgentOptions(NumStepsToLookAhead=64);

Use agentOpts when creating your agent. Alternatively, create your agent first and then modify its
options, including the actor and critic options later using dot notation.

Configure the training algorithm to use asynchronous parallel training.

trainOpts = rlTrainingOptions(UseParallel=true);
trainOpts.ParallelizationOptions.Mode = "async";

3-28
rlACAgent

Configure the workers to return gradient data to the host. Also, set the number of steps before the
workers send data back to the host to match the number of steps to look ahead.

trainOpts.ParallelizationOptions.DataToSendFromWorkers = ...
"gradients";
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = ...
agentOpts.NumStepsToLookAhead;

Use trainOpts when training your agent.

For an example on asynchronous advantage actor-critic agent training, see “Train AC Agent to
Balance Cart-Pole System Using Parallel Computing”.

Tips
• For continuous action spaces, the rlACAgent object does not enforce the constraints set by the
action specification, so you must enforce action space constraints within the environment.

Version History
Introduced in R2019a

See Also
rlAgentInitializationOptions | rlACAgentOptions | rlValueFunction |
rlDiscreteCategoricalActor | rlContinuousGaussianActor | Deep Network Designer

Topics
“Actor-Critic Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”

3-29
3 Objects

rlACAgentOptions
Options for AC agent

Description
Use an rlACAgentOptions object to specify options for creating actor-critic (AC) agents. To create
an actor-critic agent, use rlACAgent

For more information see “Actor-Critic Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation
Syntax
opt = rlACAgentOptions
opt = rlACAgentOptions(Name,Value)

Description

opt = rlACAgentOptions creates a default option set for an AC agent. You can modify the object
properties using dot notation.

opt = rlACAgentOptions(Name,Value) sets option properties on page 3-30 using name-value


pairs. For example, rlDQNAgentOptions('DiscountFactor',0.95) creates an option set with a
discount factor of 0.95. You can specify multiple name-value pairs. Enclose each property name in
quotes.

Properties
NumStepsToLookAhead — Number of steps ahead
32 (default) | positive integer

Number of steps the agent interacts with the environment before learning from its experience,
specified as a positive integer. When the agent uses a recurrent neural network,
NumStepsToLookAhead is treated as the training trajectory length.

EntropyLossWeight — Entropy loss weight


0 (default) | scalar value between 0 and 1

Entropy loss weight, specified as a scalar value between 0 and 1. A higher entropy loss weight value
promotes agent exploration by applying a penalty for being too certain about which action to take.
Doing so can help the agent move out of local optima.

When gradients are computed during training, an additional gradient component is computed for
minimizing this loss function.

3-30
rlACAgentOptions

ActorOptimizerOptions — Actor optimizer options


rlOptimizerOptions object

Actor optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the actor approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

CriticOptimizerOptions — Critic optimizer options


rlOptimizerOptions object

Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the critic approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

SampleTime — Sample time of agent


1 (default) | positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

DiscountFactor — Discount factor


0.99 (default) | positive scalar less than or equal to 1

Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.

Object Functions
rlACAgent Actor-critic reinforcement learning agent

Examples

Create AC Agent Options Object

Create an AC agent options object, specifying the discount factor.

opt = rlACAgentOptions('DiscountFactor',0.95)

opt =
rlACAgentOptions with properties:

NumStepsToLookAhead: 32
EntropyLossWeight: 0

3-31
3 Objects

ActorOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]


CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
SampleTime: 1
DiscountFactor: 0.9500
InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the agent sample time to 0.5.

opt.SampleTime = 0.5;

Version History
Introduced in R2019a

Simulation and deployment: UseDeterministicExploitation will be removed


Warns starting in R2022a

The property UseDeterministicExploitation of the rlACAgentOptions object will be removed


in a future release. Use the UseExplorationPolicy property of rlACAgent instead.

Previously, you set UseDeterministicExploitation as follows.

• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.

agent.AgentOptions.UseDeterministicExploitation = true;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.

agent.AgentOptions.UseDeterministicExploitation = false;

Starting in R2022a, set UseExplorationPolicy as follows.

• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.

agent.UseExplorationPolicy = false;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.

agent.UseExplorationPolicy = true;

Similarly to UseDeterministicExploitation, UseExplorationPolicy affects only simulation


and deployment; it does not affect training.

Default value for NumStepsToLookAhead changed to 32


Behavior change in future release

A value of 32 for this property should work better than 1 for most environments. If you nave MATLAB
R2020b or a later version and you want to reproduce how rlACAgent behaved on versions prior to
R2020b, set this value to 1.

3-32
rlACAgentOptions

See Also
Topics
“Actor-Critic Agents”

3-33
3 Objects

rlAdditiveNoisePolicy
Policy object to generate continuous noisy actions for custom training loops

Description
This object implements an additive noise policy, which returns continuous deterministic actions with
added noise, given an input observation. You can create an rlAdditiveNoisePolicy object from an
rlContinuousDeterministicActor or extract it from an rlDDPGAgent or rlTD3Agent. You can
then train the policy object using a custom training loop. If UseNoisyAction is set to 0 the policy
does not explore. This object is not compatible with generatePolicyBlock and
generatePolicyFunction. For more information on policies and value functions, see “Create
Policies and Value Functions”.

Creation
Syntax
policy = rlAdditiveNoisePolicy(actor)
policy = rlAdditiveNoisePolicy(actor,NoiseType=noiseType)

Description

policy = rlAdditiveNoisePolicy(actor) creates the additive noise policy object policy from
the continuous deterministic actor actor. It also sets the Actor property of policy to the input
argument actor.

policy = rlAdditiveNoisePolicy(actor,NoiseType=noiseType) specifies the type of noise


distribution for the policy. noiseType can be either "gaussian" (Gaussian noise) or "ou"
(Ornstein-Uhlenbeck noise). This syntax also sets the NoiseType property of policy to the input
argument noiseType.

Properties
Actor — Continuous deterministic actor
rlContinuousDeterministicActor object

Continuous deterministic actor, specified as an rlContinuousDeterministicActor object.

NoiseType — Noise type


"gaussian" (default) | "ou"

Noise type, specified as either "gaussian" (default, Gaussian noise) or "ou" (Ornstein-Uhlenbeck
noise). For more information on noise models, see “Noise Models” on page 3-365.
Example: "ou"

NoiseOptions — Noise model options


GaussianActionNoise object (default) | OrnsteinUhlenbeckActionNoise object

3-34
rlAdditiveNoisePolicy

Noise model options, specified as a GaussianActionNoise object or an


OrnsteinUhlenbeckActionNoise object. For more information on noise models, see “Noise
Models” on page 3-365.

EnableNoiseDecay — Option to enable noise decay


true (default) | false

Option to enable noise decay, specified as a logical value: either true (default, enabling noise decay)
or false (disabling noise decay).
Example: false

UseNoisyAction — Option to enable noisy action


true (default) | false

Option to enable noisy actions, specified as a logical value: either true (default, adding noise to
actions, which helps exploration) or false (no noise is added to the actions). When noise is not
added to the actions the policy is deterministic and therefore it does not explore.
Example: false

ObservationInfo — Observation specifications


rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


of such objects. These objects define properties such as the dimensions, data types, and names of the
observation signals.

ActionInfo — Action specifications


rlNumericSpec object

Action specifications, specified as an rlNumericSpec object. This object defines the properties of the
environment action channel, such as its dimensions, data type, and name. Note that the name of the
action channel specified in actionInfo (if any) is not used.

Note Only one action channel is allowed.

SampleTime — Sample time of policy


positive scalar | -1 (default)

Sample time of the policy, specified as a positive scalar or as -1 (default). Setting this parameter to
-1 allows for event-based simulations.

Within a Simulink environment, the RL Agent block in which the policy is specified executes every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the policy is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience. If
SampleTime is -1, the sample time is treated as being equal to 1.
Example: 0.2

3-35
3 Objects

Object Functions
getAction Obtain action from agent, actor, or policy object given environment
observations
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
reset Reset environment, agent, experience buffer, or policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object

Examples

Create Additive Noise Policy from Continuous Deterministic Actor

Create observation and action specification objects. For this example, define the observation and
action spaces as continuous four- and two-dimensional spaces, respectively.

obsInfo = rlNumericSpec([4 1]);


actInfo = rlNumericSpec([2 1]);

Alternatively, use getObservationInfo and getActionInfo to extract the specification objects


from an environment.

Create a continuous deterministic actor. This actor must accept an observation as input and return an
action as output.

To approximate the policy function within the actor, use a deep neural network model. Define the
network as an array of layer objects, and get the dimension of the observation and action spaces from
the environment specification objects.

layers = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(16)
reluLayer
fullyConnectedLayer(actInfo.Dimension(1))
];

Convert the network to a dlnetwork object and display the number of weights.

model = dlnetwork(layers);
summary(model)

Initialized: true

Number of learnables: 114

Inputs:
1 'input' 4 features

Create the actor using model, and the observation and action specifications.

actor = rlContinuousDeterministicActor(model,obsInfo,actInfo)

actor =
rlContinuousDeterministicActor with properties:

3-36
rlAdditiveNoisePolicy

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlNumericSpec]
UseDevice: "cpu"

Check the actor with a random observation input.

act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}

ans = 2x1 single column vector

0.4013
0.0578

Create a policy object from actor.

policy = rlAdditiveNoisePolicy(actor)

policy =
rlAdditiveNoisePolicy with properties:

Actor: [1x1 rl.function.rlContinuousDeterministicActor]


NoiseType: "gaussian"
NoiseOptions: [1x1 rl.option.GaussianActionNoise]
EnableNoiseDecay: 1
UseNoisyAction: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlNumericSpec]
SampleTime: -1

You can access the policy options using dot notation. For example, change the upper and lower limits
of the distribution.

policy.NoiseOptions.LowerLimit = -3;
policy.NoiseOptions.UpperLimit = 3;

Check the policy with a random observation input.

act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}

ans = 2×1

0.1878
-0.1645

You can now train the policy with a custom training loop and then deploy it to your application.

Create Additive Noise Policy Specifying Noise Model

Create observation and action specification objects. For this example, define the observation and
action spaces as continuous three- and one-dimensional spaces, respectively.

3-37
3 Objects

obsInfo = rlNumericSpec([3 1]);


actInfo = rlNumericSpec([1 1]);

Alternatively, use getObservationInfo and getActionInfo to extract the specification objects


from an environment

Create a continuous deterministic actor. This actor must accept an observation as input and return an
action as output.

To approximate the policy function within the actor, use a deep neural network model. Define the
network as an array of layer objects, and get the dimension of the observation and action spaces from
the environment specification objects.
layers = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(9)
reluLayer
fullyConnectedLayer(actInfo.Dimension(1))
];

Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)

Initialized: true

Number of learnables: 46

Inputs:
1 'input' 3 features

Create the actor using model, and the observation and action specifications.
actor = rlContinuousDeterministicActor(model,obsInfo,actInfo)

actor =
rlContinuousDeterministicActor with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlNumericSpec]
UseDevice: "cpu"

Check the actor with a random observation input.


act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}

ans = single
-0.2535

Create a policy object from actor, specifying an Ornstein-Uhlenbeck probability distribution for the
noise.
policy = rlAdditiveNoisePolicy(actor,NoiseType="ou")

policy =
rlAdditiveNoisePolicy with properties:

3-38
rlAdditiveNoisePolicy

Actor: [1x1 rl.function.rlContinuousDeterministicActor]


NoiseType: "ou"
NoiseOptions: [1x1 rl.option.OrnsteinUhlenbeckActionNoise]
EnableNoiseDecay: 1
UseNoisyAction: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlNumericSpec]
SampleTime: -1

You can access the policy options using dot notation. For example, change the standard deviation of
the distribution.

policy.NoiseOptions.StandardDeviation = 0.6;

Check the policy with a random observation input.

act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}

ans = -0.1625

You can now train the policy with a custom training loop and then deploy it to your application.

Version History
Introduced in R2022a

See Also
Functions
rlMaxQPolicy | rlEpsilonGreedyPolicy | rlDeterministicActorPolicy |
rlStochasticActorPolicy | rlTD3Agent | rlDDPGAgent

Blocks
RL Agent

Topics
“Create Policies and Value Functions”
“Model-Based Reinforcement Learning Using Custom Training Loop”
“Train Reinforcement Learning Policy Using Custom Training Loop”

3-39
3 Objects

rlAgentInitializationOptions
Options for initializing reinforcement learning agents

Description
Use the rlAgentInitializationOptions object to specify initialization options for an agent. To
create an agent, use the specific agent creation function, such as rlACAgent.

Creation

Syntax
initOpts = rlAgentInitializationOptions
initOpts = rlAgentInitializationOptions(Name,Value)

Description

initOpts = rlAgentInitializationOptions returns a default options object for initializing a


reinforcement learning agent that supports default networks. Use the initialization options to specify
agent initialization parameters, such as the number of units for each hidden layer of the agent
networks and whether to use a recurrent neural network.

initOpts = rlAgentInitializationOptions(Name,Value) creates an initialization options


object and sets its properties on page 3-40 by using one or more name-value pair arguments.

Properties
NumHiddenUnit — Number of units in each hidden fully connected layer
256 (default) | positive integer

Number of units in each hidden fully connected layer of the agent networks, except for the fully
connected layer just before the network output, specified as a positive integer. The value you set also
applies to any LSTM layers.
Example: 'NumHiddenUnit',64

UseRNN — Flag to use recurrent neural network


false (default) | true

Flag to use recurrent neural network, specified as a logical.

If you set UseRNN to true, during agent creation the software inserts a recurrent LSTM layer with
the output mode set to sequence in the output path of the agent networks. Policy gradient and actor-
critic agents do not support recurrent neural networks. For more information on LSTM, see “Long
Short-Term Memory Networks”.
Example: 'UseRNN',true

3-40
rlAgentInitializationOptions

Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
rlDQNAgent Deep Q-network (DQN) reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning agent
rlSACAgent Soft actor-critic reinforcement learning agent
rlTRPOAgent Trust region policy optimization reinforcement learning agent

Examples

Create Agent Initialization Options Object

Create an agent initialization options object, specifying the number of hidden neurons and use of a
recurrent neural network.

initOpts = rlAgentInitializationOptions('NumHiddenUnit',64,'UseRNN',true)

initOpts =
rlAgentInitializationOptions with properties:

NumHiddenUnit: 64
UseRNN: 1

You can modify the options using dot notation. For example, set the agent sample time to 0.5.

initOpts.NumHiddenUnit = 128

initOpts =
rlAgentInitializationOptions with properties:

NumHiddenUnit: 128
UseRNN: 1

Version History
Introduced in R2020b

See Also
getActionInfo | getObservationInfo

Topics
“Reinforcement Learning Agents”

3-41
3 Objects

rlContinuousDeterministicActor
Deterministic actor with a continuous action space for reinforcement learning agents

Description
This object implements a function approximator to be used as a deterministic actor within a
reinforcement learning agent with a continuous action space. A continuous deterministic actor takes
an environment state as input and returns as output the action that maximizes the expected
discounted cumulative long-term reward, thereby implementing a deterministic policy. After you
create an rlContinuousDeterministicActor object, use it to create a suitable agent, such as
rlDDPGAgent. For more information on creating representations, see “Create Policies and Value
Functions”.

Creation

Syntax
actor = rlContinuousDeterministicActor(net,observationInfo,actionInfo)
actor = rlContinuousDeterministicActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames)

actor = rlContinuousDeterministicActor({basisFcn,W0},observationInfo,
actionInfo)

actor = rlContinuousDeterministicActor( ___ ,UseDevice=useDevice)

Description

actor = rlContinuousDeterministicActor(net,observationInfo,actionInfo) creates a


continuous deterministic actor object using the deep neural network net as underlying approximator.
For this actor, actionInfo must specify a continuous action space. The network input layers are
automatically associated with the environment observation channels according to the dimension
specifications in observationInfo. The network must have a single output layer with the same data
type and dimensions as the action specified in actionInfo. This function sets the
ObservationInfo and ActionInfo properties of actor to the observationInfo and
actionInfo input arguments, respectively.

Note actor does not enforce constraints set by the action specification; therefore, when using this
actor, you must enforce action space constraints within the environment.

actor = rlContinuousDeterministicActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames) specifies the names of the network input
layers to be associated with the environment observation channels. The function assigns, in
sequential order, each environment observation channel specified in observationInfo to the layer
specified by the corresponding name in the string array netObsNames. Therefore, the network input

3-42
rlContinuousDeterministicActor

layers, ordered as the names in netObsNames, must have the same data type and dimensions as the
observation specifications, as ordered in observationInfo.

actor = rlContinuousDeterministicActor({basisFcn,W0},observationInfo,
actionInfo) creates a continuous deterministic actor object using a custom basis function as
underlying approximator. The first input argument is a two-element cell array whose first element is
the handle basisFcn to a custom basis function and whose second element is the initial weight
vector W0. This function sets the ObservationInfo and ActionInfo properties of actor to the
observationInfo and actionInfo input arguments, respectively.

actor = rlContinuousDeterministicActor( ___ ,UseDevice=useDevice) specifies the


device used to perform computational operations on the actor object, and sets the UseDevice
property of actor to the useDevice input argument. You can use this syntax with any of the
previous input-argument combinations.

Input Arguments

net — Deep neural network


array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object |
dlNetwork object (preferred)

Deep neural network used as the underlying approximator within the actor, specified as one of the
following:

• Array of Layer objects


• layerGraph object
• DAGNetwork object
• SeriesNetwork object
• dlnetwork object

Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
Deep Learning Toolbox neural network object. The resulting dlnet is the dlnetwork object that you
use for your critic or actor. This practice allows a greater level of insight and control for cases in
which the conversion is not straightforward and might require additional specifications.

The network must have the environment observation channels as inputs and a single output layer
representing the action.

rlContinuousDeterministicActor objects support recurrent deep neural networks. For an


example, see “Create Deterministic Actor from Recurrent Neural Network” on page 3-50.

The learnable parameters of the actor are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.

netObsNames — Network input layers names corresponding to the environment observation


channels
string array | cell array of character vectors

3-43
3 Objects

Network input layers names corresponding to the environment observation channels. When you use
the pair value arguments 'ObservationInputNames' with netObsNames, the function assigns, in
sequential order, each environment observation channel specified in observationInfo to each
network input layer specified by the corresponding name in the string array netObsNames.
Therefore, the network input layers, ordered as the names in netObsNames, must have the same
data type and dimensions as the observation specifications, as ordered in observationInfo.

Note Of the information specified in observationInfo, the function only uses the data type and
dimension of each channel, but not its (optional) name or description.

Example: {"NetInput1_airspeed","NetInput2_altitude"}

basisFcn — Custom basis function


function handle

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The action
to be taken based on the current observation, which is the output of the actor, is the vector a =
W'*B, where W is a weight matrix containing the learnable parameters and B is the column vector
returned by the custom basis function.

Your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the
environment observation channels defined in observationInfo.
Example: @(obs1,obs2,obs3) [obs3(2)*obs1(1)^2; abs(obs2(5)+obs3(1))]

W0 — Initial value of basis function weights


matrix

Initial value of the basis function weights W, specified as a matrix having as many rows as the length
of the vector returned by the basis function and as many columns as the dimension of the action
space.

Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


of such objects. These objects define properties such as the dimensions, data types, and names of the
observation signals.

rlContinuousDeterministicActor sets the ObservationInfo property of actor to the input


observationInfo.

You can extract ObservationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually.

3-44
rlContinuousDeterministicActor

ActionInfo — Action specifications


rlNumericSpec object

Action specifications for a continuous action space, specified as an rlNumericSpec object defining
properties such as dimensions, data type and name of the action signals.

rlContinuousDeterministicActor sets the ActionInfo property of actor to the input


observationInfo.

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually.

For custom basis function representations, the action signal must be a scalar, a column vector, or a
discrete action.

UseDevice — Computation device used for training and simulation


"cpu" (default) | "gpu"

Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox™ software and a CUDA® enabled
NVIDIA® GPU. For more information on supported GPUs see “GPU Computing Requirements”
(Parallel Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.

To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: "gpu"

Object Functions
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning
agent
getAction Obtain action from agent, actor, or policy object given environment
observations
evaluate Evaluate function approximator object given observation (or observation-
action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data
accelerate Option to accelerate computation of gradient for approximator object
based on neural network
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object

3-45
3 Objects

setLearnableParameters Set learnable parameter values of agent, function approximator, or policy


object
setModel Set function approximation model for actor or critic
getModel Get function approximator model from actor or critic

Examples

Create Continuous Deterministic Actor from Deep Neural Network

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.

obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.

actInfo = rlNumericSpec([2 1]);

To approximate the policy within the actor, use a deep neural network. The input of the network must
accept a four-element vector (the observation vector just defined by obsInfo), and its output must
be the action and be a two-element vector, as defined by actInfo.

Create a neural network as an array of layer objects.

net = [featureInputLayer(4)
fullyConnectedLayer(2)];

Convert the network to a dlnetwork object and display the number of learnable parameters.

net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 10

Inputs:
1 'input' 4 features

Create the actor object with rlContinuousDeterministicActor, using the network and the
observation and action specification objects as input arguments. The network input layer is
automatically associated with the environment observation channel according to the dimension
specifications in obsInfo.

actor = rlContinuousDeterministicActor( ...


net, ...
obsInfo, ...
actInfo)

actor =
rlContinuousDeterministicActor with properties:

3-46
rlContinuousDeterministicActor

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlNumericSpec]
UseDevice: "cpu"

To check your actor, use getAction to return the action from a random observation, using the
current network weights.
act = getAction(actor, ...
{rand(obsInfo.Dimension)});
act{1}

ans = 2x1 single column vector

-0.5054
1.5390

You can now use the actor to create a suitable agent (such as rlDDPGAgent or
rlTD3AgentOptions).

Create Continuous Deterministic Actor Specifying Network Input Layer Name

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.
actInfo = rlNumericSpec([2 1]);

To approximate the policy within the actor, use a deep neural network. The input of the network (here
called myobs) must accept a four-element vector (the observation vector just defined by obsInfo),
and its output must be the action (here called myact) and be a two-element vector, as defined by
actInfo.

Create the network as an array of layer objects. Name the network input layer netObsIn so you can
later explicitly associate it to the observation input channel.
net = [
featureInputLayer(4,Name="netObsIn")
fullyConnectedLayer(16)
reluLayer
fullyConnectedLayer(2)];

Convert the network to a dlnetwork object, and display the number of learnable parameters.
net = dlnetwork(net);
summary(net)

3-47
3 Objects

Initialized: true

Number of learnables: 114

Inputs:
1 'netObsIn' 4 features

Create the actor object with rlContinuousDeterministicActor, using the network, the
observation and action specification objects, and the name of the network input layer to be associated
with the environment observation channel.

actor = rlContinuousDeterministicActor(net, ...


obsInfo,actInfo, ...
Observation="netObsIn")

actor =
rlContinuousDeterministicActor with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlNumericSpec]
UseDevice: "cpu"

To check your actor, use getAction to return the action from a random observation, using the
current network weights.

act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}

ans = 2x1 single column vector

0.4013
0.0578

You can now use the actor to create a suitable agent (such as rlDDPGAgent or
rlTD3AgentOptions).

Create Continuous Deterministic Actor from Custom Basis Function

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as
consisting of two channels, the first is a two-by-two continuous matrix and the second is a scalar that
can assume only two values, 0 and 1.

obsInfo = [rlNumericSpec([2 2])


rlFiniteSetSpec([0 1])];

Create a continuous action space specification object (or alternatively use getActionInfo to extract
the specification object from an environment). For this example, define the action space as a
continuous three-dimensional space, so that a single action is a column vector containing three
doubles.

actInfo = rlNumericSpec([3 1]);

3-48
rlContinuousDeterministicActor

Create a custom basis function with two input arguments in which each output element is a function
of the observations defined by obsInfo.

myBasisFcn = @(obsA,obsB) [obsA(1,1)+obsB(1)^2;


obsA(2,1)-obsB(1)^2;
obsA(1,2)^2+obsB(1);
obsA(2,2)^2-obsB(1)];

The output of the actor is the vector W'*myBasisFcn(obsA,obsB), which is the action taken as a
result of the given observation. The weight matrix W contains the learnable parameters and must have
as many rows as the length of the basis function output and as many columns as the dimension of the
action space.

Define an initial parameter matrix.

W0 = rand(4,3);

Create the actor. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight matrix. The second and third arguments are, respectively, the
observation and action specification objects.

actor = rlContinuousDeterministicActor({myBasisFcn,W0},obsInfo,actInfo)

actor =
rlContinuousDeterministicActor with properties:

ObservationInfo: [2x1 rl.util.RLDataSpec]


ActionInfo: [1x1 rl.util.rlNumericSpec]
UseDevice: "cpu"

To check your actor, use the getAction function to return the action from a given observation, using
the current parameter matrix.

a = getAction(actor,{rand(2,2),0})

a = 1x1 cell array


{3x1 double}

a{1}

ans = 3×1

1.3192
0.8420
1.5053

Note that the actor does not enforce the set constraint for the discrete set elements.

a = getAction(actor,{rand(2,2),-1});
a{1}

ans = 3×1

2.7890
1.8375

3-49
3 Objects

3.0855

You can now use the actor to create a suitable agent (such as rlDDPGAgent or
rlTD3AgentOptions).

Create Deterministic Actor from Recurrent Neural Network

Create observation and action information. You can also obtain these specifications from an
environment. For this example, define the observation space as a continuous four-dimensional space,
so that a single observation is a column vector containing four doubles, and the action space as a
continuous two-dimensional space, so that a single action is a column vector containing two doubles.

obsInfo = rlNumericSpec([4 1]);


actInfo = rlNumericSpec([2 1]);

To approximate the policy within the actor, use a recurrent deep neural network. You can obtain the
dimension of the observation and action spaces from the environment specification objects.

Create a neural network as an array of layer objects. Since this network is recurrent, use a
sequenceInputLayer as the input layer and at least one lstmLayer.

net = [sequenceInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(10)
reluLayer
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer(20)
fullyConnectedLayer(actInfo.Dimension(1))
tanhLayer];

Convert the network to a dlnetwork object and display the number of learnable parameters.

net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 880

Inputs:
1 'sequenceinput' Sequence input with 4 dimensions

Create a deterministic actor representation for the network.

actor = rlContinuousDeterministicActor( ...


net, ...
obsInfo, ...
actInfo);

To check your actor, use getAction to return the action from a random observation, given the
current network weights.

a = getAction(actor, ...
{rand(obsInfo.Dimension)});
a{1}

3-50
rlContinuousDeterministicActor

ans = 2x1 single column vector

-0.0742
0.0158

You can now use the actor to create a suitable agent (such as rlDDPGAgent or
rlTD3AgentOptions).

Version History
Introduced in R2022a

See Also
Functions
rlDiscreteCategoricalActor | rlContinuousGaussianActor | getActionInfo |
getObservationInfo

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-51
3 Objects

rlContinuousDeterministicRewardFunction
Deterministic reward function approximator object for neural network-based environment

Description
When creating a neural network-based environment using rlNeuralNetworkEnvironment, you can
specify the reward function approximator using an
rlContinuousDeterministicRewardFunction object. Do so when you do not know a ground-
truth reward signal for your environment but you expect the reward signal to be deterministic.

The reward function approximator object uses a deep neural network as internal approximation
model to predict the reward signal for the environment given one of the following input combinations.

• Observations, actions, and next observations


• Observations and actions
• Actions and next observations
• Next observations

To specify a stochastic reward function, use an rlContinuousGaussianRewardFunction object.

Creation

Syntax
rwdFcnAppx = rlContinuousDeterministicRewardFunction(net,observationInfo,
actionInfo,Name=Value)

Description

rwdFcnAppx = rlContinuousDeterministicRewardFunction(net,observationInfo,
actionInfo,Name=Value) creates the deterministic reward function approximator object
rwdFcnAppx using the deep neural network net and sets the ObservationInfo and ActionInfo
properties.

When creating a reward function you must specify the names of the deep neural network inputs using
one of the following combinations of name-value pair arguments.

• ObservationInputNames, ActionInputNames, and NextObservationInputNames


• ObservationInputNames and ActionInputNames
• ActionInputNames and NextObservationInputNames
• NextObservationInputNames

You can also specify the UseDevice property using and an optional name-value pair argument. For
example, to use a GPU for prediction, specify UseDevice="gpu".

3-52
rlContinuousDeterministicRewardFunction

Input Arguments

net — Deep neural network


dlnetwork object

Deep neural network with a scalar output value, specified as a dlnetwork object.

The input layer names for this network must match the input names specified using the
ObservationInputNames, ActionInputNames, and NextObservationInputNames. The
dimensions of the input layers must match the dimensions of the corresponding observation and
action specifications in ObservationInfo and ActionInfo, respectively.

Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.
Example: ObservationInputNames="velocity"

ObservationInputNames — Observation input layer names


string | string array

Observation input layer names, specified as a string or string array. Specify


ObservationInputNames when you expect the reward signal to depend on the current environment
observation.

The number of observation input names must match the length of ObservationInfo and the order
of the names must match the order of the specifications in ObservationInfo.

ActionInputNames — Action input layer names


string | string array

Action input layer names, specified as a string or string array. Specify ActionInputNames when you
expect the reward signal to depend on the current action value.

The number of action input names must match the length of ActionInfo and the order of the names
must match the order of the specifications in ActionInfo.

NextObservationInputNames — Next observation input layer names


string | string array

Next observation input layer names, specified as a string or string array. Specify
NextObservationInputNames when you expect the reward signal to depend on the next
environment observation.

The number of next observation input names must match the length of ObservationInfo and the
order of the names must match the order of the specifications in ObservationInfo.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

This property is read-only.

3-53
3 Objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the
observation signals.

You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specifications


specification object

This property is read-only.

Action specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the action
signals.

You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.

UseDevice — Computation device used for training and simulation


"cpu" (default) | "gpu"

Computation device used to perform operations such as gradient computation, parameter updates,
and prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA-enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Training or simulating a network on a GPU involves device-specific numerical round-off errors. These
errors can produce different results compared to performing the same operations using a CPU.

Object Functions
rlNeuralNetworkEnvironment Environment model with deep neural network transition models

Examples

Create Deterministic Reward Function and Predict Reward

Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.

env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

3-54
rlContinuousDeterministicRewardFunction

To approximate the reward function, create a deep neural network. For this example, the network has
two input channels, one for the current action and one for the next observations. The single output
channel contains a scalar, which represents the value of the predicted reward.

Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specifications, and specify a name for the input layers, so you can
later explicitly associate them with the appropriate environment channel.

actionPath = featureInputLayer( ...


actInfo.Dimension(1), ...
Name="action");

nextStatePath = featureInputLayer( ...


obsInfo.Dimension(1), ...
Name="nextState");

commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64,Name="FC2")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(64,Name="FC3")
reluLayer(Name="CriticCommonRelu3")
fullyConnectedLayer(1,Name="reward")];

net = layerGraph(nextStatePath);
net = addLayers(net,actionPath);
net = addLayers(net,commonPath);

net = connectLayers(net,"nextState","concat/in1");
net = connectLayers(net,"action","concat/in2");

plot(net)

3-55
3 Objects

Create a dlnetwork object and display the number of weights.


net = dlnetwork(net);
summary(net);

Initialized: true

Number of learnables: 8.7k

Inputs:
1 'nextState' 4 features
2 'action' 1 features

Create a deterministic transition function object.


rwdFcnAppx = rlContinuousDeterministicRewardFunction(...
net,obsInfo,actInfo,...
ActionInputNames="action", ...
NextObservationInputNames="nextState");

Using this reward function object, you can predict the next reward value based on the current action
and next observation. For example, predict the reward for a random action and next observation.
Since, for this example, only the action and the next observation influence the reward, use an empty
cell array for the current observation.
act = rand(actInfo.Dimension);
nxtobs = rand(obsInfo.Dimension);
reward = predict(rwdFcnAppx,{}, {act}, {nxtobs})

3-56
rlContinuousDeterministicRewardFunction

reward = single
0.1034

To predict the reward, you can also use evaluate.

reward_ev = evaluate(rwdFcnAppx, {act,nxtobs} )

reward_ev = 1x1 cell array


{[0.1034]}

Version History
Introduced in R2022a

See Also
Objects
rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction | rlContinuousGaussianRewardFunction |
rlNeuralNetworkEnvironment | rlIsDoneFunction | evaluate | gradient | accelerate

Topics
“Model-Based Policy Optimization Agents”

3-57
3 Objects

rlContinuousDeterministicTransitionFunction
Deterministic transition function approximator object for neural network-based environment

Description
When creating a neural network-based environment using rlNeuralNetworkEnvironment, you can
specify deterministic transition function approximators using
rlContinuousDeterministicTransitionFunction objects.

A transition function approximator object uses a deep neural network to predict the next observations
based on the current observations and actions.

To specify stochastic transition function approximators, use


rlContinuousGaussianTransitionFunction objects.

Creation

Syntax
tsnFcnAppx = rlContinuousDeterministicTransitionFunction(net,observationInfo,
actionInfo,Name=Value)

Description

tsnFcnAppx = rlContinuousDeterministicTransitionFunction(net,observationInfo,
actionInfo,Name=Value) creates a deterministic transition function approximator object using
the deep neural network net and sets the ObservationInfo and ActionInfo properties.

When creating a deterministic transition function approximator you must specify the names of the
deep neural network inputs and outputs using the ObservationInputNames, ActionInputNames,
and NextObservationOutputNames name-value pair arguments.

You can also specify the PredictDiff and UseDevice properties using optional name-value pair
arguments. For example, to use a GPU for prediction, specify UseDevice="gpu".

Input Arguments

net — Deep neural network


dlnetwork object

Deep neural network, specified as a dlnetwork object.

The input layer names for this network must match the input names specified using
ObservationInputNames and ActionInputNames. The dimensions of the input layers must match
the dimensions of the corresponding observation and action specifications in ObservationInfo and
ActionInfo, respectively.

3-58
rlContinuousDeterministicTransitionFunction

The output layer names for this network must match the output names specified using
NextObservationOutputNames. The dimensions of the input layers must match the dimensions of
the corresponding observation specifications in ObservationInfo.

Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.
Example: ObservationInputNames="velocity"

ObservationInputNames — Observation input layer names


string | string array

Observation input layer names, specified as a string or string array.

The number of observation input names must match the length of ObservationInfo and the order
of the names must match the order of the specifications in ObservationInfo.

ActionInputNames — Action input layer names


string | string array

Action input layer names, specified as a string or string array.

The number of action input names must match the length of ActionInfo and the order of the names
must match the order of the specifications in ActionInfo.

NextObservationOutputNames — Next observation output layer names


string | string array

Next observation output layer names, specified as a string or string array.

The number of next observation output names must match the length of ObservationInfo and the
order of the names must match the order of the specifications in ObservationInfo.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

This property is read-only.

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the
observation signals.

You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specifications


specification object

This property is read-only.

3-59
3 Objects

Action specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the action
signals.

You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.

PredictDiff — Option to predict the difference between the current observation and the
next observation
false (default) | true

Option to predict the difference between the current observation and the next observation, specified
as one of the following logical values.

• false — Select this option if net outputs the value of the next observation.
• true — Select this option if net outputs the difference between the next observation and the
current observation. In this case, the predict function computes the next observation by adding
the current observation to the output of net.

UseDevice — Computation device used for training and simulation


"cpu" (default) | "gpu"

Computation device used to perform operations such as gradient computation, parameter updates,
and prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA-enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Note Training or simulating a network on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations using a CPU.

Object Functions
rlNeuralNetworkEnvironment Environment model with deep neural network transition models

Examples

Create Deterministic Transition Function and Predict Next Observation

Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a deep neural network. The network has two input channels, one for the current observations
and one for the current actions. The single output channel is for the predicted next observation.

3-60
rlContinuousDeterministicTransitionFunction

statePath = featureInputLayer(obsInfo.Dimension(1),...
Normalization="none",Name="state");
actionPath = featureInputLayer(actInfo.Dimension(1),...
Normalization="none",Name="action");
commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64, Name="FC3")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(obsInfo.Dimension(1),Name="nextObservation")];

tsnNet = layerGraph(statePath);
tsnNet = addLayers(tsnNet,actionPath);
tsnNet = addLayers(tsnNet,commonPath);

tsnNet = connectLayers(tsnNet,"state","concat/in1");
tsnNet = connectLayers(tsnNet,"action","concat/in2");

plot(tsnNet)

Create a dlnetwork object.


tsnNet = dlnetwork(tsnNet);

Create a deterministic transition function object.


tsnFcnAppx = rlContinuousDeterministicTransitionFunction(...
tsnNet,obsInfo,actInfo,...

3-61
3 Objects

ObservationInputNames="state", ...
ActionInputNames="action", ...
NextObservationOutputNames="nextObservation");

Using this transition function object, you can predict the next observation based on the current
observation and action. For example, predict the next observation for a random observation and
action.

obs = rand(obsInfo.Dimension);
act = rand(actInfo.Dimension);
nextObsP = predict(tsnFcnAppx,{obs},{act})

nextObsP = 1x1 cell array


{4x1 single}

nextObsP{1}

ans = 4x1 single column vector

-0.1172
0.1168
0.0493
-0.0155

You can also obtain the same result using evaluate.

nextObsE = evaluate(tsnFcnAppx,{obs,act})

nextObsE = 1x1 cell array


{4x1 single}

nextObsE{1}

ans = 4x1 single column vector

-0.1172
0.1168
0.0493
-0.0155

Version History
Introduced in R2022a

See Also
Objects
rlNeuralNetworkEnvironment | rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction | evaluate | gradient | accelerate

3-62
rlContinuousDeterministicTransitionFunction

Topics
“Model-Based Policy Optimization Agents”

3-63
3 Objects

rlContinuousGaussianActor
Stochastic Gaussian actor with a continuous action space for reinforcement learning agents

Description
This object implements a function approximator to be used as a stochastic actor within a
reinforcement learning agent with a continuous action space. A continuous Gaussian actor takes an
environment state as input and returns as output a random action sampled from a Gaussian
probability distribution of the expected cumulative long term reward, thereby implementing a
stochastic policy. After you create an rlContinuousGaussianActor object, use it to create a
suitable agent, such as an rlACAgent or rlPGAgent agent. For more information on creating
representations, see “Create Policies and Value Functions”.

Creation
Syntax
actor = rlContinuousGaussianActor(net,observationInfo,
actionInfo,ActionMeanOutputNames=
netMeanActName,ActionStandardDeviationOutputNames=netStdvActName)
actor = rlContinuousGaussianActor(net,observationInfo,
actionInfo,ActionMeanOutputNames=
netMeanActName,ActionStandardDeviationOutputNames=
netStdActName,ObservationInputNames=netObsNames)

actor = rlContinuousGaussianActor( ___ ,UseDevice=useDevice)

Description

actor = rlContinuousGaussianActor(net,observationInfo,
actionInfo,ActionMeanOutputNames=
netMeanActName,ActionStandardDeviationOutputNames=netStdvActName) creates a
Gaussian stochastic actor with a continuous action space using the deep neural network net as
function approximator. Here, net must have two differently named output layers, each with as many
elements as the number of dimensions of the action space, as specified in actionInfo. The two
output layers calculate the mean and standard deviation of each component of the action. The actor
uses these layers, according to the names specified in the strings netMeanActName and
netStdActName, to represent the Gaussian probability distribution from which the action is
sampled. The function sets the ObservationInfo and ActionInfo properties of actor to the input
arguments observationInfo and actionInfo, respectively.

Note actor does not enforce constraints set by the action specification, therefore, when using this
actor, you must enforce action space constraints within the environment.

actor = rlContinuousGaussianActor(net,observationInfo,
actionInfo,ActionMeanOutputNames=

3-64
rlContinuousGaussianActor

netMeanActName,ActionStandardDeviationOutputNames=
netStdActName,ObservationInputNames=netObsNames) specifies the names of the network
input layers to be associated with the environment observation channels. The function assigns, in
sequential order, each environment observation channel specified in observationInfo to the layer
specified by the corresponding name in the string array netObsNames. Therefore, the network input
layers, ordered as the names in netObsNames, must have the same data type and dimensions as the
observation specifications, as ordered in observationInfo.

actor = rlContinuousGaussianActor( ___ ,UseDevice=useDevice) specifies the device


used to perform computational operations on the actor object, and sets the UseDevice property of
actor to the useDevice input argument. You can use this syntax with any of the previous input-
argument combinations.

Input Arguments

net — Deep neural network


array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object |
dlNetwork object (preferred)

Deep neural network used as the underlying approximator within the actor. The network must have
two differently named output layers each with as many elements as the number of dimensions of the
action space, as specified in actionInfo. The two output layers calculate the mean and standard
deviation of each component of the action. The actor uses these layers, according to the names
specified in the strings netMeanActName and netStdActName, to represent the Gaussian
probability distribution from which the action is sampled.

Note Standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore, the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, and the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.

You can specify the network as one of the following:

• Array of Layer objects


• layerGraph object
• DAGNetwork object
• SeriesNetwork object
• dlnetwork object

Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
neural network object from the Deep Learning Toolbox. The resulting dlnet is the dlnetwork object
that you use for your critic or actor. This practice allows a greater level of insight and control for
cases in which the conversion is not straightforward and might require additional specifications.

rlContinuousGaussianActor objects support recurrent deep neural networks.

3-65
3 Objects

The learnable parameters of the actor are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.

netMeanActName — Names of the network output layers corresponding to the mean values
of the action channel
string | character vector

Names of the network output layers corresponding to the mean values of the action channel,
specified as a string or character vector. The actor uses this name to select the network output layer
that returns the mean values of each elements of the action channel. Therefore, this network output
layer must be named as indicated in netMeanActName. Furthermore, it must be a scaling layer that
scales the returned mean values to the desired action range.

Note Of the information specified in actionInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.

Example: "myNetOut_Force_Mean_Values"

netStdvActName — Names of the network output layers corresponding to the standard


deviations of the action channel
string | character vector

Names of the network output layers corresponding to the standard deviations of the action channel,
specified as a string or character vector. The actor uses this name to select the network output layer
that returns the standard deviations of each elements of the action channel. Therefore, this network
output layer must be named as indicated in netStdvActName. Furthermore, it must be a softplus or
ReLU layer, to enforce nonnegativity of the returned standard deviations.

Note Of the information specified in actionInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.

Example: "myNetOut_Force_Standard_Deviations"

netObsNames — Network input layers names corresponding to the environment observation


channels
string array | cell array of character vectors

Network input layers names corresponding to the environment observation channels, specified as a
string array or a cell array of character vectors. When you use this argument after
'ObservationInputNames', the function assigns, in sequential order, each environment
observation channel specified in observationInfo to each network input layer specified by the
corresponding name in the string array netObsNames. Therefore, the network input layers, ordered
as the names in netObsNames, must have the same data type and dimensions as the observation
specifications, as ordered in observationInfo.

Note Of the information specified in observationInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.

3-66
rlContinuousGaussianActor

Example: {"NetInput1_airspeed","NetInput2_altitude"}

Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


of such objects. These objects define properties such as the dimensions, data types, and names of the
observation signals.

rlContinuousGaussianActor sets the ObservationInfo property of actor to the input


observationInfo.

You can extract ObservationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually.

ActionInfo — Action specifications


rlNumericSpec object

Action specifications, specified as an rlNumericSpec object. This object defines the properties of the
environment action channel, such as its dimensions, data type, and name. Note that the function does
not use the name of the action channel specified in actionInfo.

Note Only one action channel is allowed.

rlContinuousGaussianActor sets the ActionInfo property of critic to the input actionInfo.

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specifications manually.

UseDevice — Computation device used for training and simulation


"cpu" (default) | "gpu"

Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.

To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.

3-67
3 Objects

Example: 'UseDevice',"gpu"

Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
rlSACAgent Soft actor-critic reinforcement learning agent
getAction Obtain action from agent, actor, or policy object given environment
observations
evaluate Evaluate function approximator object given observation (or observation-
action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data
accelerate Option to accelerate computation of gradient for approximator object
based on neural network
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
setModel Set function approximation model for actor or critic
getModel Get function approximator model from actor or critic

Examples

Create Continuous Gaussian Actor from Deep Neural Network

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous six-dimensional space, so that a single observation is a column vector containing five
doubles.
obsInfo = rlNumericSpec([5 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
three-dimensional space, so that a single action is a column vector containing three doubles, each
between -10 and 10.
actInfo = rlNumericSpec([3 1], ...
LowerLimit=-10, ...
UpperLimit=10);

To approximate the policy within the actor, use a deep neural network.

For a continuous Gaussian actor, the network must take the observation signal as input and return
both a mean value and a standard deviation value for each action. Therefore it must have two output
layers (one for the mean values the other for the standard deviation values), each having as many
elements as the dimension of the action space. You can obtain the dimensions of the observation and
action spaces from the environment specification objects (for example regardless of whether the
observation space is a column vector, row vector, or matrix, prod(obsInfo.Dimension)).

Note that standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU

3-68
rlContinuousGaussianActor

layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.

Create each network path as an array of layer objects. Specify a name for the input and output layers,
so you can later explicitly associate them with the correct channels.

% Input path layers


inPath = [
featureInputLayer( ...
prod(obsInfo.Dimension), ...
Name="netOin")
fullyConnectedLayer( ...
prod(actInfo.Dimension), ...
Name="infc")
];

% Path layers for mean value


% Using scalingLayer to scale range from (-1,1) to (-10,10)
meanPath = [
tanhLayer(Name="tanhMean");
fullyConnectedLayer(prod(actInfo.Dimension));
scalingLayer(Name="scale", ...
Scale=actInfo.UpperLimit)
];

% Path layers for standard deviations


% Using softplus layer to make them non negative
sdevPath = [
tanhLayer(Name="tanhStdv");
fullyConnectedLayer(prod(actInfo.Dimension));
softplusLayer(Name="splus")
];

% Add layers to network object


net = layerGraph(inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);

% Connect layers
net = connectLayers(net,"infc","tanhMean/in");
net = connectLayers(net,"infc","tanhStdv/in");

% Plot the network


plot(net)

3-69
3 Objects

Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 42

Inputs:
1 'netOin' 5 features

Create the actor with rlContinuousGaussianActor, using the network, the observation and action
specification objects, and the names of the network input and output layers.
actor = rlContinuousGaussianActor(net, obsInfo, actInfo, ...
ActionMeanOutputNames="scale",...
ActionStandardDeviationOutputNames="splus",...
ObservationInputNames="netOin");

To check your actor, use getAction to return an action from a random observation vector, using the
current network weights. Each of the three elements of the action vector is a random sample from the
Gaussian distribution with mean and standard deviation calculated, as a function of the current
observation, by the neural network.
act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}

3-70
rlContinuousGaussianActor

ans = 3x1 single column vector

-12.0285
1.7628
10.8733

To return the Gaussian distribution of the action, given an observation, use evaluate.

dist = evaluate(actor,{rand(obsInfo.Dimension)});

Display the vector of mean values.

dist{1}

ans = 3x1 single column vector

-5.6127
3.9449
9.6213

Display the vector of standard deviations.

dist{2}

ans = 3x1 single column vector

0.8516
0.8366
0.7004

You can now use the actor to create a suitable agent (such as rlACAgent, rlPGAgent,
rlSACAgentOptions, rlPPOAgent, or rlTRPOAgentOptions).

Version History
Introduced in R2022a

See Also
Functions
rlContinuousDeterministicActor | rlDiscreteCategoricalActor | getActionInfo |
getObservationInfo

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-71
3 Objects

rlContinuousGaussianRewardFunction
Stochastic Gaussian reward function approximator object for neural network-based environment

Description
When creating a neural network-based environment using rlNeuralNetworkEnvironment, you can
specify the reward function approximator using an
rlContinuousDeterministicRewardFunction object. Do so when you do not know a ground-
truth reward signal for your environment and you expect the reward signal to be stochastic.

The reward function object uses a deep neural network as internal approximation model to predict
the reward signal for the environment given one of the following input combinations.

• Observations, actions, and next observations


• Observations and actions
• Actions and next observations
• Next observations

To specify a deterministic reward function approximator, use an


rlContinuousDeterministicRewardFunction object.

Creation

Syntax
rwdFcnAppx = rlContinuousGaussianRewardFunction(net,observationInfo,
actionInfo,Name=Value)

Description

rwdFcnAppx = rlContinuousGaussianRewardFunction(net,observationInfo,
actionInfo,Name=Value) creates a stochastic reward function using the deep neural network net
and sets the ObservationInfo and ActionInfo properties.

When creating a reward function you must specify the names of the deep neural network inputs using
one of the following combinations of name-value pair arguments.

• ObservationInputNames, ActionInputNames, and NextObservationInputNames


• ObservationInputNames and ActionInputNames
• ActionInputNames and NextObservationInputNames
• NextObservationInputNames

You must also specify the names of the deep neural network outputs using the
RewardMeanOutputName and RewardStandardDeviationOutputName name-value pair
arguments.

3-72
rlContinuousGaussianRewardFunction

You can also specify the UseDevice property using an optional name-value pair argument. For
example, to use a GPU for prediction, specify UseDevice="gpu".

Input Arguments

net — Deep neural network


dlnetwork object

Deep neural network with a scalar output value, specified as a dlnetwork object.

The input layer names for this network must match the input names specified using the
ObservationInputNames, ActionInputNames, and NextObservationInputNames. The
dimensions of the input layers must match the dimensions of the corresponding observation and
action specifications in ObservationInfo and ActionInfo, respectively.

Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.
Example: ObservationInputNames="velocity"

ObservationInputNames — Observation input layer names


string | string array

Observation input layer names, specified as a string or string array. Specify


ObservationInputNames when you expect the reward signal to depend on the current environment
observation.

The number of observation input names must match the length of ObservationInfo and the order
of the names must match the order of the specifications in ObservationInfo.

ActionInputNames — Action input layer names


string | string array

Action input layer names, specified as a string or string array. Specify ActionInputNames when you
expect the reward signal to depend on the current action value.

The number of action input names must match the length of ActionInfo and the order of the names
must match the order of the specifications in ActionInfo.

NextObservationInputNames — Next observation input layer names


string | string array

Next observation input layer names, specified as a string or string array. Specify
NextObservationInputNames when you expect the reward signal to depend on the next
environment observation.

The number of next observation input names must match the length of ObservationInfo and the
order of the names must match the order of the specifications in ObservationInfo.

RewardMeanOutputName — Reward mean output layer name


string

Reward mean output layer name, specified as a string.

3-73
3 Objects

RewardStandardDeviationOutputName — Reward standard deviation output layer name


string

Reward standard deviation output layer name, specified as a string.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

This property is read-only.

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the
observation signals.

You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specifications


specification object | array of specification objects

This property is read-only.

Action specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the action
signals.

You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.

UseDevice — Computation device used for training and simulation


"cpu" (default) | "gpu"

Computation device used to perform operations such as gradient computation, parameter updates,
and prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA-enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Note Training or simulating a network on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations using a CPU.

Object Functions
rlNeuralNetworkEnvironment Environment model with deep neural network transition models

3-74
rlContinuousGaussianRewardFunction

Examples

Create Stochastic Reward Function and Predict Reward

Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.

env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a deep neural network. The network has two input channels, one for the current action and
one for the next observations. The single output channel is for the predicted reward value.

statePath = featureInputLayer(obsInfo.Dimension(1),Name="obs");
actionPath = featureInputLayer(actInfo.Dimension(1),Name="action");
nextStatePath = featureInputLayer(obsInfo.Dimension(1),Name="nextObs");
commonPath = [concatenationLayer(1,3,Name="concat")
fullyConnectedLayer(32,Name="fc")
reluLayer(Name="relu1")
fullyConnectedLayer(32,Name="fc2")];

meanPath = [reluLayer(Name="rewardMeanRelu")
fullyConnectedLayer(1,Name="rewardMean")];
stdPath = [reluLayer(Name="rewardStdRelu")
fullyConnectedLayer(1,Name="rewardStdFc")
softplusLayer(Name="rewardStd")];

rwdNet = layerGraph(statePath);
rwdNet = addLayers(rwdNet,actionPath);
rwdNet = addLayers(rwdNet,nextStatePath);
rwdNet = addLayers(rwdNet,commonPath);
rwdNet = addLayers(rwdNet,meanPath);
rwdNet = addLayers(rwdNet,stdPath);

rwdNet = connectLayers(rwdNet,"nextObs","concat/in1");
rwdNet = connectLayers(rwdNet,"action","concat/in2");
rwdNet = connectLayers(rwdNet,"obs",'concat/in3');
rwdNet = connectLayers(rwdNet,"fc2","rewardMeanRelu");
rwdNet = connectLayers(rwdNet,"fc2","rewardStdRelu");

plot(rwdNet)

3-75
3 Objects

Create a dlnetwork object.

rwdNet = dlnetwork(rwdNet);

Create a stochastic reward function object.

rwdFncAppx = rlContinuousGaussianRewardFunction(...
rwdNet,obsInfo,actInfo,...
ObservationInputNames="obs",...
ActionInputNames="action", ...
NextObservationInputNames="nextObs", ...
RewardMeanOutputNames="rewardMean", ...
RewardStandardDeviationOutputNames="rewardStd");

Using this reward function object, you can predict the next reward value based on the current action
and next observation. For example, predict the reward for a random action and next observation. The
reward value is sampled from a Gaussian distribution with the mean and standard deviation output by
the reward network.

obs = rand(obsInfo.Dimension);
act = rand(actInfo.Dimension);
nextObs = rand(obsInfo.Dimension(1),1);
predRwd = predict(rwdFncAppx,{obs},{act},{nextObs})

predRwd = single
-0.1308

3-76
rlContinuousGaussianRewardFunction

You can obtain the mean value and standard deviation of the Gaussian distribution for the predicted
reward using evaluate.

predRwdDist = evaluate(rwdFncAppx,{obs,act,nextObs})

predRwdDist=1×2 cell array


{[-0.0995]} {[0.6195]}

Version History
Introduced in R2022a

See Also
Objects
rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlNeuralNetworkEnvironment |
rlIsDoneFunction | evaluate | gradient | accelerate

Topics
“Model-Based Policy Optimization Agents”

3-77
3 Objects

rlContinuousGaussianTransitionFunction
Stochastic Gaussian transition function approximator object for neural network-based environment

Description
When creating a neural network-based environment using rlNeuralNetworkEnvironment, you can
specify stochastic transition function approximators using
rlContinuousDeterministicTransitionFunction objects.

A transition function approximator object uses a deep neural network as internal approximation
model to predict the next observations based on the current observations and actions.

To specify deterministic transition function approximators, use


rlContinuousGaussianTransitionFunction objects.

Creation

Syntax
tsnFcnAppx = rlContinuousGaussianTransitionFunction(net,observationInfo,
actionInfo,Name=Value)

Description

tsnFcnAppx = rlContinuousGaussianTransitionFunction(net,observationInfo,
actionInfo,Name=Value) creates the stochastic transition function approximator object
tsnFcnAppx using the deep neural network net and sets the ObservationInfo and ActionInfo
properties.

When creating a stochastic transition function approximator, you must specify the names of the deep
neural network inputs and outputs using the ObservationInputNames, ActionInputNames,
NextObservationMeanOutputNames, and NextObservationStandardDeviationOutputNames
name-value pair arguments.

You can also specify the PredictDiff and UseDevice properties using optional name-value pair
arguments. For example, to use a GPU for prediction, specify UseDevice="gpu".

Input Arguments

net — Deep neural network


dlnetwork object

Deep neural network, specified as a dlnetwork object.

The input layer names for this network must match the input names specified using
ObservationInputNames and ActionInputNames. The dimensions of the input layers must match
the dimensions of the corresponding observation and action specifications in ObservationInfo and
ActionInfo, respectively.

3-78
rlContinuousGaussianTransitionFunction

The output layer names for this network must match the output names specified using
NextObservationOutputNames. The dimensions of the input layers must match the dimensions of
the corresponding observation specifications in ObservationInfo.
Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.
Example: ObservationInputNames="velocity"

ObservationInputNames — Observation input layer names


string | string array

Observation input layer names, specified as a string or string array.

The number of observation input names must match the length of ObservationInfo and the order
of the names must match the order of the specifications in ObservationInfo.

ActionInputNames — Action input layer names


string | string array

Action input layer names, specified as a string or string array.

The number of action input names must match the length of ActionInfo and the order of the names
must match the order of the specifications in ActionInfo.

NextObservationMeanOutputNames — Next observation mean output layer names


string | string array

Next observation mean output layer names, specified as a string or string array.

The number of next observation mean output names must match the length of ObservationInfo
and the order of the names must match the order of the specifications in ObservationInfo.

NextObservationStandardDeviationOutputNames — Next observation standard deviation


output layer names
string | string array

Next observation standard deviation output layer names, specified as a string or string array.

The number of next observation standard deviation output names must match the length of
ObservationInfo and the order of the names must match the order of the specifications in
ObservationInfo.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

This property is read-only.

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the
observation signals.

3-79
3 Objects

You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specifications


specification object

This property is read-only.

Action specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the action
signals.

You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.

UseDevice — Computation device used for training and simulation


"cpu" (default) | "gpu"

Computation device used to perform operations such as gradient computation, parameter updates,
and prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA-enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Note Training or simulating a network on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations using a CPU.

PredictDiff — Option to predict the difference between the current observation and the
next observation
false (default) | true

Option to predict the difference between the current observation and the next observation, specified
as one of the following logical values.

• false — Select this option if net outputs the value of the next observation.
• true — Select this option if net outputs the difference between the next observation and the
current observation. In this case, the predict function computes the next observation by adding
the current observation to the output of net.

Object Functions
rlNeuralNetworkEnvironment Environment model with deep neural network transition models

Examples

3-80
rlContinuousGaussianTransitionFunction

Create Stochastic Transition Function and Predict Next Observation

Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.

env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Define the layers for the deep neural network. The network has two input channels, one for the
current observations and one for the current actions. The output of the network is the predicted
Gaussian distribution for each next observation. The two output channels correspond to the means
and standard deviations of these distribution.

statePath = featureInputLayer(obsInfo.Dimension(1),Name="obs");
actionPath = featureInputLayer(actInfo.Dimension(1),Name="act");

commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(32,Name="fc")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(32,Name="fc2")];

meanPath = [reluLayer(Name="nextObsMeanRelu")
fullyConnectedLayer(obsInfo.Dimension(1),Name="nextObsMean")];
stdPath = [reluLayer(Name="nextObsStdRelu")
fullyConnectedLayer(obsInfo.Dimension(1),Name="nextObsStdReluFull")
softplusLayer(Name="nextObsStd")];

tsnNet = layerGraph(statePath);
tsnNet = addLayers(tsnNet,actionPath);
tsnNet = addLayers(tsnNet,commonPath);
tsnNet = addLayers(tsnNet,meanPath);
tsnNet = addLayers(tsnNet,stdPath);

tsnNet = connectLayers(tsnNet,"obs","concat/in1");
tsnNet = connectLayers(tsnNet,"act","concat/in2");
tsnNet = connectLayers(tsnNet,"fc2","nextObsMeanRelu");
tsnNet = connectLayers(tsnNet,"fc2","nextObsStdRelu");

plot(tsnNet)

3-81
3 Objects

Create a dlnetwork object.


tsnNet = dlnetwork(tsnNet);

Create a stochastic transition function object.


tsnFcnAppx = rlContinuousGaussianTransitionFunction(tsnNet,obsInfo,actInfo, ...
ObservationInputNames="obs",...
ActionInputNames="act",...
NextObservationMeanOutputNames="nextObsMean",...
NextObservationStandardDeviationOutputNames="nextObsStd");

Using this transition function object, you can predict the next observation based on the current
observation and action. For example, predict the next observation for a random observation and
action. The next observation values are sampled from Gaussian distributions with the means and
standard deviations output by the transition network.
observation = rand(obsInfo.Dimension);
action = rand(actInfo.Dimension);
nextObs = predict(tsnFcnAppx,{observation},{action})

nextObs = 1x1 cell array


{4x1 single}

nextObs{1}

ans = 4x1 single column vector

3-82
rlContinuousGaussianTransitionFunction

1.2414
0.7307
-0.5588
-0.9567

You can also obtain the mean value and standard deviation of the Gaussian distribution of the
predicted next observation using evaluate.

nextObsDist = evaluate(tsnFcnAppx,{observation,action})

nextObsDist=1×2 cell array


{4x1 single} {4x1 single}

Version History
Introduced in R2022a

See Also
Objects
rlContinuousDeterministicTransitionFunction | rlNeuralNetworkEnvironment

Topics
“Model-Based Policy Optimization Agents”

3-83
3 Objects

rlDDPGAgent
Deep deterministic policy gradient (DDPG) reinforcement learning agent

Description
The deep deterministic policy gradient (DDPG) algorithm is an actor-critic, model-free, online, off-
policy reinforcement learning method which computes an optimal policy that maximizes the long-
term reward. The action space can only be continuous.

For more information, see “Deep Deterministic Policy Gradient (DDPG) Agents”. For more
information on the different types of reinforcement learning agents, see “Reinforcement Learning
Agents”.

Creation
Syntax
agent = rlDDPGAgent(observationInfo,actionInfo)
agent = rlDDPGAgent(observationInfo,actionInfo,initOpts)

agent = rlDDPGAgent(actor,critic,agentOptions)

agent = rlDDPGAgent( ___ ,agentOptions)

Description
Create Agent from Observation and Action Specifications

agent = rlDDPGAgent(observationInfo,actionInfo) creates a deep deterministic policy


gradient agent for an environment with the given observation and action specifications, using default
initialization options. The actor and critic in the agent use default deep neural networks built from
the observation specification observationInfo and the action specification actionInfo. The
ObservationInfo and ActionInfo properties of agent are set to the observationInfo and
actionInfo input arguments, respectively.

agent = rlDDPGAgent(observationInfo,actionInfo,initOpts) creates a deep


deterministic policy gradient agent for an environment with the given observation and action
specifications. The agent uses default networks configured using options specified in the initOpts
object. For more information on the initialization options, see rlAgentInitializationOptions.
Create Agent from Actor and Critic

agent = rlDDPGAgent(actor,critic,agentOptions) creates a DDPG agent with the specified


actor and critic, using default DDPG agent options.
Specify Agent Options

agent = rlDDPGAgent( ___ ,agentOptions) creates a DDPG agent and sets the AgentOptions
property to the agentOptions input argument. Use this syntax after any of the input arguments in
the previous syntaxes.

3-84
rlDDPGAgent

Input Arguments

initOpts — Agent initialization options


rlAgentInitializationOptions object

Agent initialization options, specified as an rlAgentInitializationOptions object.

actor — Actor
rlContinuousDeterministicActor object

Actor, specified as an rlContinuousDeterministicActor. For more information on creating


actors, see “Create Policies and Value Functions”.

critic — Critic
rlQValueFunction object

Critic, specified as an rlQValueFunction object. For more information on creating critics, see
“Create Policies and Value Functions”.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data type, and names of the observation
signals.

If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.

You can extract observationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specification


specification object

Action specifications, specified as a reinforcement learning specification object defining properties


such as dimensions, data type, and names of the action signals.

Since a DDPG agent operates in a continuous action space, you must specify actionInfo as an
rlNumericSpec object.

If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.

You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlNumericSpec.

AgentOptions — Agent options


rlDDPGAgentOptions object

Agent options, specified as an rlDDPGAgentOptions object.

3-85
3 Objects

If you create a DDPG agent with default actor and critic that use recurrent neural networks, the
default value of AgentOptions.SequenceLength is 32.

ExperienceBuffer — Experience buffer


rlReplayMemory object

Experience buffer, specified as an rlReplayMemory object. During training the agent stores each of
its experiences (S,A,R,S',D) in a buffer. Here:

• S is the current observation of the environment.


• A is the action taken by the agent.
• R is the reward for taking action A.
• S' is the next observation after taking action A.
• D is the is-done signal after taking action A.

UseExplorationPolicy — Option to use exploration policy


false (default) | true

Option to use exploration policy when selecting actions, specified as a one of the following logical
values.

• true — Use the base agent exploration policy when selecting actions.
• false — Use the base agent greedy policy when selecting actions.

SampleTime — Sample time of agent


positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object

3-86
rlDDPGAgent

Examples

Create DDPG Agent from Observation and Action Specifications

Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”. The observation from the environment is a vector containing the
position and velocity of a mass. The action is a scalar representing a force, applied to the mass,
ranging continuously from -2 to 2 Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");

Obtain observation and action specifications.


obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)

Create a policy gradient agent from the environment observation and action specifications.
agent = rlDDPGAgent(obsInfo,actInfo)

agent =
rlDDPGAgent with properties:

ExperienceBuffer: [1x1 rl.replay.rlReplayMemory]


AgentOptions: [1x1 rl.option.rlDDPGAgentOptions]
UseExplorationPolicy: 0
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlNumericSpec]
SampleTime: 1

To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension)})

ans = 1x1 cell array


{[0.0182]}

You can now test and train the agent within the environment. You can also use getActor and
getCritic to extract the actor and critic, respectively, and getModel to extract the approximator
model (by default a deep neural network) from the actor or critic.

Create DDPG Agent Using Initialization Options

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a

3-87
3 Objects

50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.

env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");

Obtain observation and action specifications

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256).

initOpts = rlAgentInitializationOptions('NumHiddenUnit',128);

The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator.

rng(0)

Create a DDPG agent from the environment observation and action specifications.

agent = rlDDPGAgent(obsInfo,actInfo,initOpts);

Extract the deep neural networks from both the agent actor and critic.

actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));

Display the layers of the critic network, and verify that each hidden fully connected layer has 128
neurons

criticNet.Layers

ans =
13x1 Layer array with layers:

1 'concat' Concatenation Concatenation of 3 inputs along dimension 1


2 'relu_body' ReLU ReLU
3 'fc_body' Fully Connected 128 fully connected layer
4 'body_output' ReLU ReLU
5 'input_1' Image Input 50x50x1 images
6 'conv_1' 2-D Convolution 64 3x3x1 convolutions with stride [1 1] and padding
7 'relu_input_1' ReLU ReLU
8 'fc_1' Fully Connected 128 fully connected layer
9 'input_2' Feature Input 1 features
10 'fc_2' Fully Connected 128 fully connected layer
11 'input_3' Feature Input 1 features
12 'fc_3' Fully Connected 128 fully connected layer
13 'output' Fully Connected 1 fully connected layer

Plot actor and critic networks

plot(layerGraph(actorNet))

3-88
rlDDPGAgent

plot(layerGraph(criticNet))

3-89
3 Objects

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[-0.0364]}

You can now test and train the agent within the environment.

Create DDPG Agent from Actor and Critic

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”. The observation from the environment is a vector containing the
position and velocity of a mass. The action is a scalar representing a force ranging continuously from
-2 to 2 Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");

Obtain the environment observation and action specification objects.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

3-90
rlDDPGAgent

The actor and critic networks are initialized randomly. Ensure reproducibility by fixing the seed of the
random generator.

rng(0)

For DDPG agents, the critic estimates a Q-value function, therefore it must take both the observation
and action signals as inputs and return a scalar value.

To approximate the Q-value function within the critic, use a deep neural network. Define each
network path as an array of layer objects. Get the dimensions of the observation and action spaces
from the environment specification objects, and specify a name for the input layers, so you can later
explicitly associate them with the appropriate environment channel.

% Define observation and action paths


obsPath = featureInputLayer(prod(obsInfo.Dimension),Name="netOin");
actPath = featureInputLayer(prod(actInfo.Dimension),Name="netAin");

% Define common path: concatenate along first dimension


commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(1)];

% Add paths to layerGraph network


cNet = layerGraph(obsPath);
cNet = addLayers(cNet, actPath);
cNet = addLayers(cNet, commonPath);

% Connect paths
cNet = connectLayers(cNet,"netOin","concat/in1");
cNet = connectLayers(cNet,"netAin","concat/in2");

% Plot the network


plot(cNet)

3-91
3 Objects

% Convert to dlnetwork object


cNet = dlnetwork(cNet);

% Display the number of weights


summary(cNet)

Initialized: true

Number of learnables: 251

Inputs:
1 'netOin' 2 features
2 'netAin' 1 features

Create the critic using cNet, and the names of the input layers. DDPG agents use an
rlQValueFunction object to implement the critic.

critic = rlQValueFunction(cNet,obsInfo,actInfo,...
ObservationInputNames="netOin", ...
ActionInputNames="netAin");

Check the critic with random observation and action inputs.

getValue(critic,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})

ans = single
-0.4260

3-92
rlDDPGAgent

To approximate the policy within the actor, use a neural network. For DDPG agents, the actor
executes a deterministic policy, which is implemented by a continuous deterministic actor. In this case
the network must take the observation signal as input and return an action. Therefore the output
layer must have as many elements as the number of possible actions.
% create a network to be used as underlying actor approximator
aNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(numel(actInfo))];

% Convert to dlnetwork object


aNet = dlnetwork(aNet);

% Display the number of weights


summary(aNet)

Initialized: true

Number of learnables: 201

Inputs:
1 'input' 2 features

Create the actor using actorNetwork. DDPG agents use an rlContinuousDeterministicActor


object to implement the actor.
actor = rlContinuousDeterministicActor(aNet,obsInfo,actInfo);

Check the actor with a random observation input.


getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-0.2327]}

Create the DDPG agent using the actor and critic.


agent = rlDDPGAgent(actor,critic);

Specify agent options, including training options for the actor and critic.
agent.AgentOptions.SampleTime=env.Ts;
agent.AgentOptions.TargetSmoothFactor=1e-3;
agent.AgentOptions.ExperienceBufferLength=1e6;
agent.AgentOptions.DiscountFactor=0.99;
agent.AgentOptions.MiniBatchSize=32;

agent.AgentOptions.CriticOptimizerOptions.LearnRate=5e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold=1;

agent.AgentOptions.ActorOptimizerOptions.LearnRate=1e-4;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold=1;

Check the agent with a random observation input.


getAction(agent,{rand(obsInfo.Dimension)})

3-93
3 Objects

ans = 1x1 cell array


{[-0.1666]}

You can now train the agent within the environment.

Create DDPG Agent with Recurrent Neural Networks

For this example, load the environment used in the example “Train DDPG Agent to Control Double
Integrator System”. The observation from the environment is a vector containing the position and
velocity of a mass. The action is a scalar representing a force ranging continuously from -2 to 2
Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");

Get the observation and action specification objects.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

For DDPG agents, the critic estimates a Q-value function, therefore it must take both the observation
and action signals as inputs and return a scalar value.

To approximate the Q-value function within the critic, use a recurrent neural network. Define each
network path as an array of layer objects. Get the dimensions of the observation spaces from the
environment specification object, and specify a name for the input layers, so you can later explicitly
associate them with the correct environment channel. To create a recurrent neural network, use
sequenceInputLayer as the input layer and include an lstmLayer as one of the other network
layers.

% Define observation and action paths


obsPath = sequenceInputLayer(prod(obsInfo.Dimension),Name="netOin");
actPath = sequenceInputLayer(prod(actInfo.Dimension),Name="netAin");

% Define common path: concatenate along first dimension


commonPath = [concatenationLayer(1,2,Name="cat")
lstmLayer(50)
reluLayer
fullyConnectedLayer(1)];

% Add paths to layerGraph network


cNet = layerGraph(obsPath);
cNet = addLayers(cNet, actPath);
cNet = addLayers(cNet, commonPath);

% Connect paths
cNet = connectLayers(cNet,"netOin","cat/in1");
cNet = connectLayers(cNet,"netAin","cat/in2");

% Plot the network


plot(cNet)

3-94
rlDDPGAgent

% Convert to dlnetwork object


cNet = dlnetwork(cNet);

% Display the number of weights


summary(cNet)

Initialized: true

Number of learnables: 10.8k

Inputs:
1 'netOin' Sequence input with 2 dimensions
2 'netAin' Sequence input with 1 dimensions

Create the critic using cNet, specifying the names of the input layers. DDPG agents use an
rlQValueFunction object to implement the critic.
critic = rlQValueFunction(cNet,obsInfo,actInfo,...
ObservationInputNames="netOin",ActionInputNames="netAin");

Check the critic with random observation and action inputs.


getValue(critic,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})

ans = single
-0.0074

Since the critic has a recurrent network, the actor must have a recurrent network too. For DDPG
agents, the actor executes a deterministic policy, which is implemented by a continuous deterministic

3-95
3 Objects

actor. In this case the network must take the observation signal as input and return an action.
Therefore the output layer must have as many elements as the number of possible actions.

Define the network as an array of layer objects, and get the dimensions of the observation and action
spaces from the environment specification objects.
aNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
lstmLayer(10)
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension)) ];

Convert to dlnetwork and display the number of weights.


aNet = dlnetwork(aNet);
summary(aNet)

Initialized: true

Number of learnables: 531

Inputs:
1 'sequenceinput' Sequence input with 2 dimensions

Create the actor using aNet. DDPG agents use an rlContinuousDeterministicActor object to
implement the actor.
actor = rlContinuousDeterministicActor(aNet,obsInfo,actInfo);

Check the actor with random observation input.


getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[0.0246]}

Specify some training options for the critic.


criticOpts = rlOptimizerOptions( ...
'LearnRate',5e-3,'GradientThreshold',1);

Specify some training options for the actor.


actorOpts = rlOptimizerOptions( ...
'LearnRate',1e-4,'GradientThreshold',1);

Specify agent options. To use a DDPG agent with recurrent neural networks, you must specify a
SequenceLength greater than 1.
agentOpts = rlDDPGAgentOptions(...
'SampleTime',env.Ts,...
'TargetSmoothFactor',1e-3,...
'ExperienceBufferLength',1e6,...
'DiscountFactor',0.99,...
'SequenceLength',20,...
'MiniBatchSize',32, ...
'CriticOptimizerOptions',criticOpts, ...
'ActorOptimizerOptions',actorOpts);

3-96
rlDDPGAgent

Create the DDPG agent using the actor and critic.

agent = rlDDPGAgent(actor,critic,agentOpts);

To check your agent return the action from a random observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[0.0158]}

You can now test and train the agent within the environment.

Version History
Introduced in R2019a

See Also
rlAgentInitializationOptions | rlDDPGAgentOptions | rlQValueFunction |
rlContinuousDeterministicActor | Deep Network Designer

Topics
“Deep Deterministic Policy Gradient (DDPG) Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”

3-97
3 Objects

rlDDPGAgentOptions
Options for DDPG agent

Description
Use an rlDDPGAgentOptions object to specify options for deep deterministic policy gradient
(DDPG) agents. To create a DDPG agent, use rlDDPGAgent.

For more information, see “Deep Deterministic Policy Gradient (DDPG) Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation

Syntax
opt = rlDDPGAgentOptions
opt = rlDDPGAgentOptions(Name,Value)

Description

opt = rlDDPGAgentOptions creates an options object for use as an argument when creating a
DDPG agent using all default options. You can modify the object properties using dot notation.

opt = rlDDPGAgentOptions(Name,Value) sets option properties on page 3-98 using name-


value pairs. For example, rlDDPGAgentOptions('DiscountFactor',0.95) creates an option set
with a discount factor of 0.95. You can specify multiple name-value pairs. Enclose each property
name in quotes.

Properties
NoiseOptions — Noise model options
OrnsteinUhlenbeckActionNoise object

Noise model options, specified as an OrnsteinUhlenbeckActionNoise object. For more


information on the noise model, see “Noise Model” on page 3-101.

For an agent with multiple actions, if the actions have different ranges and units, it is likely that each
action requires different noise model parameters. If the actions have similar ranges and units, you
can set the noise parameters for all actions to the same value.

For example, for an agent with two actions, set the standard deviation of each action to a different
value while using the same decay rate for both standard deviations.

opt = rlDDPGAgentOptions;
opt.NoiseOptions.StandardDeviation = [0.1 0.2];
opt.NoiseOptions.StandardDeviationDecayRate = 1e-4;

3-98
rlDDPGAgentOptions

ActorOptimizerOptions — Actor optimizer options


rlOptimizerOptions object

Actor optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the actor approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

CriticOptimizerOptions — Critic optimizer options


rlOptimizerOptions object

Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the critic approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

TargetSmoothFactor — Smoothing factor for target actor and critic updates


1e-3 (default) | positive scalar less than or equal to 1

Smoothing factor for target actor and critic updates, specified as a positive scalar less than or equal
to 1. For more information, see “Target Update Methods”.

TargetUpdateFrequency — Number of steps between target actor and critic updates


1 (default) | positive integer

Number of steps between target actor and critic updates, specified as a positive integer. For more
information, see “Target Update Methods”.

ResetExperienceBufferBeforeTraining — Option for clearing the experience buffer


true (default) | false

Option for clearing the experience buffer before training, specified as a logical value.

SequenceLength — Maximum batch-training trajectory length when using RNN


1 (default) | positive integer

Maximum batch-training trajectory length when using a recurrent neural network, specified as a
positive integer. This value must be greater than 1 when using a recurrent neural network and 1
otherwise.

MiniBatchSize — Size of random experience mini-batch


64 (default) | positive integer

Size of random experience mini-batch, specified as a positive integer. During each training episode,
the agent randomly samples experiences from the experience buffer when computing gradients for
updating the critic properties. Large mini-batches reduce the variance when computing gradients but
increase the computational effort.

NumStepsToLookAhead — Number of future rewards used to estimate the value of the policy
1 (default) | positive integer

Number of future rewards used to estimate the value of the policy, specified as a positive integer. For
more information, see [1], Chapter 7.

3-99
3 Objects

Note that if parallel training is enabled (that is if an rlTrainingOptions option object in which the
UseParallel property is set to true is passed to train) then NumStepsToLookAhead must be set
to 1, otherwise an error is generated. This guarantees that experiences are stored contiguously.

ExperienceBufferLength — Experience buffer size


10000 (default) | positive integer

Experience buffer size, specified as a positive integer. During training, the agent computes updates
using a mini-batch of experiences randomly sampled from the buffer.

SampleTime — Sample time of agent


1 (default) | positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

DiscountFactor — Discount factor


0.99 (default) | positive scalar less than or equal to 1

Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.

Object Functions
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent

Examples

Create DDPG Agent Options Object

This example shows how to create a DDPG agent option object.

Create an rlDDPGAgentOptions object that specifies the mini-batch size.

opt = rlDDPGAgentOptions('MiniBatchSize',48)

opt =
rlDDPGAgentOptions with properties:

NoiseOptions: [1x1 rl.option.OrnsteinUhlenbeckActionNoise]


ActorOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
TargetSmoothFactor: 1.0000e-03

3-100
rlDDPGAgentOptions

TargetUpdateFrequency: 1
ResetExperienceBufferBeforeTraining: 1
SequenceLength: 1
MiniBatchSize: 48
NumStepsToLookAhead: 1
ExperienceBufferLength: 10000
SampleTime: 1
DiscountFactor: 0.9900
InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the agent sample time to 0.5.

opt.SampleTime = 0.5;

Algorithms
Noise Model

DDPG agents use an Ornstein-Uhlenbeck action noise model for exploration.


Ornstein-Uhlenbeck Action Noise

An OrnsteinUhlenbeckActionNoise object has the following numeric value properties.

Property Description Default Value


InitialAction Initial value of action 0
Mean Noise mean value 0
MeanAttractionConstant Constant specifying how quickly 0.15
the noise model output is
attracted to the mean
StandardDeviationDecayRa Decay rate of the standard 0
te deviation
StandardDeviation Initial value of noise standard 0.3
deviation
StandardDeviationMin Minimum standard deviation 0

At each sample time step k, the noise value v(k) is updated using the following formula, where Ts is
the agent sample time, and the initial value v(1) is defined by the InitialAction parameter.

v(k+1) = v(k) + MeanAttractionConstant.*(Mean - v(k)).*Ts


+ StandardDeviation(k).*randn(size(Mean)).*sqrt(Ts)

At each sample time step, the standard deviation decays as shown in the following code.

decayedStandardDeviation = StandardDeviation(k).*(1 - StandardDeviationDecayRate);


StandardDeviation(k+1) = max(decayedStandardDeviation,StandardDeviationMin);

You can calculate how many samples it will take for the standard deviation to be halved using this
simple formula.

halflife = log(0.5)/log(1-StandardDeviationDecayRate);

3-101
3 Objects

For continuous action signals, it is important to set the noise standard deviation appropriately to
encourage exploration. It is common to set StandardDeviation*sqrt(Ts) to a value between 1%
and 10% of your action range.

If your agent converges on local optima too quickly, promote agent exploration by increasing the
amount of noise; that is, by increasing the standard deviation. Also, to increase exploration, you can
reduce the StandardDeviationDecayRate.

Version History
Introduced in R2019a

Property names defining noise probability distribution in the


OrnsteinUhlenbeckActionNoise object have changed
Behavior changed in R2021a

The properties defining the probability distribution of the Ornstein-Uhlenbeck (OU) noise model have
been renamed. DDPG agents use OU noise for exploration.

• The Variance property has been renamed StandardDeviation.


• The VarianceDecayRate property has been renamed StandardDeviationDecayRate.
• The VarianceMin property has been renamed StandardDeviationMin.

The default values of these properties remain the same. When an


OrnsteinUhlenbeckActionNoise noise object saved from a previous MATLAB release is loaded,
the values of Variance, VarianceDecayRate, and VarianceMin are copied in the
StandardDeviation, StandardDeviationDecayRate, and StandardDeviationMin,
respectively.

The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the OU noise model, use the new property
names instead.
Update Code

This table shows how to update your code to use the new property names for rlDDPGAgentOptions
object ddpgopt.

Not Recommended Recommended


ddpgopt.NoiseOptions.Variance = 0.5; ddpgopt.NoiseOptions.StandardDeviation
= 0.5;
ddpgopt.NoiseOptions.VarianceDecayRate ddpgopt.NoiseOptions.StandardDeviation
= 0.1; DecayRate = 0.1;
ddpgopt.NoiseOptions.VarianceMin = 0; ddpgopt.NoiseOptions.StandardDeviation
Min = 0;

Target update method settings for DDPG agents have changed


Behavior changed in R2020a

Target update method settings for DDPG agents have changed. The following changes require
updates to your code:

3-102
rlDDPGAgentOptions

• The TargetUpdateMethod option has been removed. Now, DDPG agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.

To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.

Update Method TargetUpdateFrequency TargetSmoothFactor


Smoothing 1 Less than 1
Periodic Greater than 1 1
Periodic smoothing (new Greater than 1 Less than 1
method in R2020a)

The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.
Update Code

This table shows some typical uses of rlDDPGAgentOptions and how to update your code to use the
new option configuration.

Not Recommended Recommended


opt = opt = rlDDPGAgentOptions;
rlDDPGAgentOptions('TargetUpdateMethod
',"smoothing");
opt = opt = rlDDPGAgentOptions;
rlDDPGAgentOptions('TargetUpdateMethod opt.TargetUpdateFrequency = 4;
',"periodic"); opt.TargetSmoothFactor = 1;
opt = rlDDPGAgentOptions; opt = rlDDPGAgentOptions;
opt.TargetUpdateMethod = "periodic"; opt.TargetUpdateFrequency = 5;
opt.TargetUpdateFrequency = 5; opt.TargetSmoothFactor = 1;

References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second
edition. Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press,
2018.

See Also
Topics
“Deep Deterministic Policy Gradient (DDPG) Agents”

3-103
3 Objects

rlDeterministicActorPolicy
Policy object to generate continuous deterministic actions for custom training loops and application
deployment

Description
This object implements a deterministic policy, which returns continuous deterministic actions given
an input observation. You can create an rlDeterministicActorPolicy object from an
rlContinuousDeterministicActor or extract it from an rlDDPGAgent or rlTD3Agent. You can
then train the policy object using a custom training loop or deploy it for your application using
generatePolicyBlock or generatePolicyFunction. This policy is always deterministic and
does not perform any exploration. For more information on policies and value functions, see “Create
Policies and Value Functions”.

Creation

Syntax
policy = rlDeterministicActorPolicy(actor)

Description

policy = rlDeterministicActorPolicy(actor) creates the deterministic actor policy object


policy from the continuous deterministic actor actor. It also sets the Actor property of policy to
the input argument actor.

Properties
Actor — Continuous deterministic actor
rlContinuousDeterministicActor object

Continuous deterministic actor, specified as an rlContinuousDeterministicActor object.

ObservationInfo — Observation specifications


rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


of such objects. These objects define properties such as the dimensions, data types, and names of the
observation signals.

ActionInfo — Action specifications


rlNumericSpec object

Action specifications, specified as an rlNumericSpec object. This object defines the properties of the
environment action channel, such as its dimensions, data type, and name. Note that the name of the
action channel specified in actionInfo (if any) is not used.

3-104
rlDeterministicActorPolicy

Note Only one action channel is allowed.

SampleTime — Sample time of policy


positive scalar | -1 (default)

Sample time of the policy, specified as a positive scalar or as -1 (default). Setting this parameter to
-1 allows for event-based simulations.

Within a Simulink environment, the RL Agent block in which the policy is specified executes every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the policy is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience. If
SampleTime is -1, the sample time is treated as being equal to 1.
Example: 0.2

Object Functions
generatePolicyBlock Generate Simulink block that evaluates policy of an agent or policy object
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
getAction Obtain action from agent, actor, or policy object given environment
observations
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
reset Reset environment, agent, experience buffer, or policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object

Examples

Create Deterministic Actor Policy from Continuous Deterministic Actor

Create observation and action specification objects. For this example, define the observation and
action spaces as continuous four- and two-dimensional spaces, respectively.

obsInfo = rlNumericSpec([4 1]);


actInfo = rlNumericSpec([2 1]);

Alternatively, use getObservationInfo and getActionInfo to extract the specification objects


from an environment

Create a continuous deterministic actor. This actor must accept an observation as input and return an
action as output.

To approximate the policy function within the actor, use a deep neural network model. Define the
network as an array of layer objects, and get the dimension of the observation and action spaces from
the environment specification objects.

layers = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(16)

3-105
3 Objects

reluLayer
fullyConnectedLayer(actInfo.Dimension(1))
];

Convert the network to a dlnetwork object and display the number of weights.

model = dlnetwork(layers);
summary(model)

Initialized: true

Number of learnables: 114

Inputs:
1 'input' 4 features

Create the actor using model, and the observation and action specifications.

actor = rlContinuousDeterministicActor(model,obsInfo,actInfo)

actor =
rlContinuousDeterministicActor with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlNumericSpec]
UseDevice: "cpu"

Check the actor with a random observation input.

act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}

ans = 2x1 single column vector

0.4013
0.0578

Create a policy object from actor.

policy = rlDeterministicActorPolicy(actor)

policy =
rlDeterministicActorPolicy with properties:

Actor: [1x1 rl.function.rlContinuousDeterministicActor]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlNumericSpec]
SampleTime: -1

Check the policy with a random observation input.

act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}

ans = 2×1

3-106
rlDeterministicActorPolicy

0.4313
-0.3002

You can now train the policy with a custom training loop and then deploy it to your application.

Version History
Introduced in R2022a

See Also
Functions
rlMaxQPolicy | rlEpsilonGreedyPolicy | rlAdditiveNoisePolicy |
rlStochasticActorPolicy | rlTD3Agent | rlDDPGAgent | generatePolicyBlock |
generatePolicyFunction

Blocks
RL Agent

Topics
“Create Policies and Value Functions”
“Model-Based Reinforcement Learning Using Custom Training Loop”
“Train Reinforcement Learning Policy Using Custom Training Loop”

3-107
3 Objects

rlDeterministicActorRepresentation
(Not recommended) Deterministic actor representation for reinforcement learning agents

Note rlDeterministicActorRepresentation is not recommended. Use


rlContinuousDeterministicActor instead. For more information, see
“rlDeterministicActorRepresentation is not recommended”.

Description
This object implements a function approximator to be used as a deterministic actor within a
reinforcement learning agent with a continuous action space. A deterministic actor takes
observations as inputs and returns as outputs the action that maximizes the expected cumulative
long-term reward, thereby implementing a deterministic policy. After you create an
rlDeterministicActorRepresentation object, use it to create a suitable agent, such as an
rlDDPGAgent agent. For more information on creating representations, see “Create Policies and
Value Functions”.

Creation
Syntax
actor = rlDeterministicActorRepresentation(net,observationInfo,
actionInfo,'Observation',obsName,'Action',actName)
actor = rlDeterministicActorRepresentation({basisFcn,W0},observationInfo,
actionInfo)
actor = rlDeterministicActorRepresentation( ___ ,options)

Description

actor = rlDeterministicActorRepresentation(net,observationInfo,
actionInfo,'Observation',obsName,'Action',actName) creates a deterministic actor using
the deep neural network net as approximator. This syntax sets the ObservationInfo and ActionInfo
properties of actor to the inputs observationInfo and actionInfo, containing the specifications
for observations and actions, respectively. actionInfo must specify a continuous action space,
discrete action spaces are not supported. obsName must contain the names of the input layers of net
that are associated with the observation specifications. The action names actName must be the
names of the output layers of net that are associated with the action specifications.

actor = rlDeterministicActorRepresentation({basisFcn,W0},observationInfo,
actionInfo) creates a deterministic actor using a custom basis function as underlying
approximator. The first input argument is a two-elements cell in which the first element contains the
handle basisFcn to a custom basis function, and the second element contains the initial weight
matrix W0. This syntax sets the ObservationInfo and ActionInfo properties of actor respectively to
the inputs observationInfo and actionInfo.

actor = rlDeterministicActorRepresentation( ___ ,options) creates a deterministic


actor using the additional options set options, which is an rlRepresentationOptions object.

3-108
rlDeterministicActorRepresentation

This syntax sets the Options property of actor to theoptions input argument. You can use this
syntax with any of the previous input-argument combinations.

Input Arguments

net — Deep neural network


array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object |
dlNetwork object

Deep neural network used as the underlying approximator within the actor, specified as one of the
following:

• Array of Layer objects


• layerGraph object
• DAGNetwork object
• SeriesNetwork object
• dlnetwork object

The network input layers must be in the same order and with the same data type and dimensions as
the signals defined in ObservationInfo. Also, the names of these input layers must match the
observation names listed in obsName.

The network output layer must have the same data type and dimension as the signal defined in
ActionInfo. Its name must be the action name specified in actName.

rlDeterministicActorRepresentation objects support recurrent deep neural networks.

For a list of deep neural network layers, see “List of Deep Learning Layers”. For more information on
creating deep neural networks for reinforcement learning, see “Create Policies and Value Functions”.

obsName — Observation names


string | character vector | cell array of character vectors

Observation names, specified as a cell array of strings or character vectors. The observation names
must be the names of the input layers in net.
Example: {'my_obs'}

actName — Action name


string | character vector | single-element cell array containing a character vector

Action name, specified as a single-element cell array that contains a character vector. It must be the
name of the output layer of net.
Example: {'my_act'}

basisFcn — Custom basis function


function handle

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The action
to be taken based on the current observation, which is the output of the actor, is the vector a =
W'*B, where W is a weight matrix containing the learnable parameters and B is the column vector
returned by the custom basis function.

3-109
3 Objects

When creating a deterministic actor representation, your basis function must have the following
signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here obs1 to obsN are observations in the same order and with the same data type and dimensions
as the signals defined in observationInfo
Example: @(obs1,obs2,obs3) [obs3(2)*obs1(1)^2; abs(obs2(5)+obs3(1))]

W0 — Initial value of the basis function weights


column vector

Initial value of the basis function weights, W, specified as a matrix having as many rows as the length
of the vector returned by the basis function and as many columns as the dimension of the action
space.

Properties
Options — Representation options
rlRepresentationOptions object

Representation options, specified as an rlRepresentationOptions object. Available options


include the optimizer used for training and the learning rate.

ObservationInfo — Observation specifications


rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


of such objects. These objects define properties such as the dimensions, data types, and names of the
observation signals.

rlDeterministicActorRepresentation sets the ObservationInfo property of actor to the


input observationInfo.

You can extract ObservationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually.

ActionInfo — Action specifications


rlNumericSpec object

Action specifications for a continuous action space, specified as an rlNumericSpec object defining
properties such as dimensions, data type and name of the action signals. The deterministic actor
representation does not support discrete actions.

rlDeterministicActorRepresentation sets the ActionInfo property of actor to the input


observationInfo.

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually.

For custom basis function representations, the action signal must be a scalar, a column vector, or a
discrete action.

3-110
rlDeterministicActorRepresentation

Object Functions
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning agent
getAction Obtain action from agent, actor, or policy object given environment observations

Examples

Create Deterministic Actor from Deep Neural Network

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.
actInfo = rlNumericSpec([2 1]);

Create a deep neural network approximator for the actor. The input of the network (here called
myobs) must accept a four-element vector (the observation vector just defined by obsInfo), and its
output must be the action (here called myact) and be a two-element vector, as defined by actInfo.
net = [featureInputLayer(4,'Normalization','none','Name','myobs')
fullyConnectedLayer(2,'Name','myact')];

Create the critic with rlQValueRepresentation, using the network, the observations and action
specification objects, as well as the names of the network input and output layers.
actor = rlDeterministicActorRepresentation(net,obsInfo,actInfo, ...
'Observation',{'myobs'},'Action',{'myact'})

actor =
rlDeterministicActorRepresentation with properties:

ActionInfo: [1x1 rl.util.rlNumericSpec]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
Options: [1x1 rl.option.rlRepresentationOptions]

To check your actor, use getAction to return the action from a random observation, using the
current network weights.
act = getAction(actor,{rand(4,1)}); act{1}

ans = 2x1 single column vector

-0.5054
1.5390

You can now use the actor to create a suitable agent (such as an rlACAgent, rlPGAgent, or
rlDDPGAgent agent).

3-111
3 Objects

Create Deterministic Actor from Custom Basis Function

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 3
doubles.

obsInfo = rlNumericSpec([3 1]);

The deterministic actor does not support discrete action spaces. Therefore, create a continuous
action space specification object (or alternatively use getActionInfo to extract the specification
object from an environment). For this example, define the action space as a continuous two-
dimensional space, so that a single action is a column vector containing 2 doubles.

actInfo = rlNumericSpec([2 1]);

Create a custom basis function. Each element is a function of the observations defined by obsInfo.

myBasisFcn = @(myobs) [myobs(2)^2; myobs(1); 2*myobs(2)+myobs(1); -myobs(3)]

myBasisFcn = function_handle with value:


@(myobs)[myobs(2)^2;myobs(1);2*myobs(2)+myobs(1);-myobs(3)]

The output of the actor is the vector W'*myBasisFcn(myobs), which is the action taken as a result
of the given observation. The weight matrix W contains the learnable parameters and must have as
many rows as the length of the basis function output and as many columns as the dimension of the
action space.

Define an initial parameter matrix.

W0 = rand(4,2);

Create the actor. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight matrix. The second and third arguments are, respectively, the
observation and action specification objects.

actor = rlDeterministicActorRepresentation({myBasisFcn,W0},obsInfo,actInfo)

actor =
rlDeterministicActorRepresentation with properties:

ActionInfo: [1x1 rl.util.rlNumericSpec]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
Options: [1x1 rl.option.rlRepresentationOptions]

To check your actor, use the getAction function to return the action from a given observation, using
the current parameter matrix.

a = getAction(actor,{[1 2 3]'});
a{1}

ans =
2x1 dlarray

3-112
rlDeterministicActorRepresentation

2.0595
2.3788

You can now use the actor (along with an critic) to create a suitable continuous action space agent.

Create Deterministic Actor from Recurrent Neural Network

Create observation and action information. You can also obtain these specifications from an
environment.

obsinfo = rlNumericSpec([4 1]);


actinfo = rlNumericSpec([2 1]);
numObs = obsinfo.Dimension(1);
numAct = actinfo.Dimension(1);

Create a recurrent deep neural network for the actor. To create a recurrent neural network, use a
sequenceInputLayer as the input layer and include at least one lstmLayer.

net = [sequenceInputLayer(numObs,'Normalization','none','Name','state')
fullyConnectedLayer(10,'Name','fc1')
reluLayer('Name','relu1')
lstmLayer(8,'OutputMode','sequence','Name','ActorLSTM')
fullyConnectedLayer(20,'Name','CriticStateFC2')
fullyConnectedLayer(numAct,'Name','action')
tanhLayer('Name','tanh1')];

Create a deterministic actor representation for the network.

actorOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(net,obsinfo,actinfo,...
'Observation',{'state'},'Action',{'tanh1'});

Version History
Introduced in R2020a

rlDeterministicActorRepresentation is not recommended


Not recommended starting in R2022a

rlDeterministicActorRepresentation is not recommended. Use


rlContinuousDeterministicActor instead.

The following table shows some typical uses of rlDeterministicActorRepresentation, and how
to update your code with rlContinuousDeterministicActor instead. The first table entry uses a
neural network, the second one uses a basis function.

3-113
3 Objects

rlDeterministicActorRepresentation: rlContinuousDeterministicActor:
Not Recommended Recommended
myActor = myActor =
rlDeterministicActorRepresentation(net rlContinuousDeterministicActor(net,obs
,obsInfo,actInfo,'Observation',obsName Info,actInfo,'ObservationInputNames',o
s,'Action',actNames) with actInfo defining bsNames). Use this syntax to create a
a continuous action space and net having deterministic actor object with a continuous
observations as inputs and a single output layer action space.
with as many elements as the number of
dimensions of the continuous action space.
rep = rep =
rlDeterministicActorRepresentation({ba rlContinuousDeterministicActor({basisF
sisFcn,W0},obsInfo,actInfo), where the cn,W0},obsInfo,actInfo). Use this syntax to
basis function has observations as inputs and create a deterministic actor object with a
actions as outputs, W0 is a matrix with as many continuous action space.
columns as the number of possible actions, and
actInfo defines a continuous action space.

See Also
Functions
rlContinuousDeterministicActor | rlRepresentationOptions | getActionInfo |
getObservationInfo

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-114
rlDiscreteCategoricalActor

rlDiscreteCategoricalActor
Stochastic categorical actor with a discrete action space for reinforcement learning agents

Description
This object implements a function approximator to be used as a stochastic actor within a
reinforcement learning agent with a discrete action space. A discrete categorical actor takes an
environment state as input and returns as output a random action sampled from a categorical (also
known as Multinoulli) probability distribution of the expected cumulative long term reward, thereby
implementing a stochastic policy. After you create an rlDiscreteCategoricalActor object, use it
to create a suitable agent, such as rlACAgent or rlPGAgent. For more information on creating
representations, see “Create Policies and Value Functions”.

Creation
Syntax
actor = rlDiscreteCategoricalActor(net,observationInfo,actionInfo)
actor = rlDiscreteCategoricalActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames)

actor = rlDiscreteCategoricalActor({basisFcn,W0},observationInfo,actionInfo)

actor = rlDiscreteCategoricalActor( ___ ,UseDevice=useDevice)

Description

actor = rlDiscreteCategoricalActor(net,observationInfo,actionInfo) creates a


stochastic actor with a discrete action space, using the deep neural network net as function
approximator. For this actor, actionInfo must specify a discrete action space. The network input
layers are automatically associated with the environment observation channels according to the
dimension specifications in observationInfo. The network must have a single output layer with as
many elements as the number of possible discrete actions, as specified in actionInfo. This function
sets the ObservationInfo and ActionInfo properties of actor to the inputs observationInfo
and actionInfo, respectively.

Note actor does not enforce constraints set by the action specification; therefore, when using this
actor, you must enforce action space constraints within the environment.

actor = rlDiscreteCategoricalActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames) specifies the names of the network input
layers to be associated with the environment observation channels. The function assigns, in
sequential order, each environment observation channel specified in observationInfo to the layer
specified by the corresponding name in the string array netObsNames. Therefore, the network input
layers, ordered as the names in netObsNames, must have the same data type and dimensions as the
observation specifications, as ordered in observationInfo.

3-115
3 Objects

actor = rlDiscreteCategoricalActor({basisFcn,W0},observationInfo,actionInfo)
creates a discrete space stochastic actor using a custom basis function as underlying approximator.
The first input argument is a two-element cell array whose first element is the handle basisFcn to a
custom basis function and whose second element is the initial weight matrix W0. This function sets
the ObservationInfo and ActionInfo properties of actor to the inputs observationInfo and
actionInfo, respectively.

actor = rlDiscreteCategoricalActor( ___ ,UseDevice=useDevice) specifies the device


used to perform computational operations on the actor object, and sets the UseDevice property of
actor to the useDevice input argument. You can use this syntax with any of the previous input-
argument combinations.

Input Arguments

net — Deep neural network


array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object |
dlNetwork object (preferred)

Deep neural network used as the underlying approximator within the actor, specified as one of the
following:

• Array of Layer objects


• layerGraph object
• DAGNetwork object
• SeriesNetwork object
• dlnetwork object

Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
Deep Learning Toolbox neural network object. The resulting dlnet is the dlnetwork object that you
use for your critic or actor. This practice allows a greater level of insight and control for cases in
which the conversion is not straightforward and might require additional specifications.

The network must have the environment observation channels as inputs and a single output layer
with as many elements as the number of possible discrete actions. Since the output of the network
must represent the probability of executing each possible action, the software automatically adds a
softmaxLayer as a final output layer if you do not specify it explicitly. When computing the action,
the actor then randomly samples the distribution to return an action.

rlDiscreteCategoricalActor objects support recurrent deep neural networks. For an example,


see “Create Discrete Categorical Actor from Deep Recurrent Neural Network” on page 3-122.

The learnable parameters of the actor are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.

3-116
rlDiscreteCategoricalActor

netObsNames — Network input layers names corresponding to the environment observation


channels
string array | cell array of character vectors

Network input layers names corresponding to the environment observation channels, specified as a
string array or a cell array of character vectors. When you use this argument after
'ObservationInputNames', the function assigns, in sequential order, each environment
observation channel specified in observationInfo to each network input layer specified by the
corresponding name in the string array netObsNames. Therefore, the network input layers, ordered
as the names in netObsNames, must have the same data type and dimensions as the observation
specifications, as ordered in observationInfo.

Note Of the information specified in observationInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.

Example: {"NetInput1_airspeed","NetInput2_altitude"}

basisFcn — Custom basis function


function handle

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The number
of the action to be taken based on the current observation, which is the output of the actor, is
randomly sampled from a categorical distribution with probabilities p = softmax(W'*B), where W
is a weight matrix containing the learnable parameters and B is the column vector returned by the
custom basis function. Each element of p represents the probability of executing the corresponding
action from the observed state.

Your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the
environment observation channels defined in observationInfo.
Example: @(obs1,obs2,obs3) [obs3(2)*obs1(1)^2; abs(obs2(5)+obs3(1))]

W0 — Initial value of the basis function weights


matrix

Initial value of the basis function weights W, specified as a matrix having as many rows as the length
of the vector returned by the basis function and as many columns as the dimension of the action
space.

Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


of such objects. These objects define properties such as the dimensions, data types, and names of the
observation signals.

3-117
3 Objects

rlDiscreteCategoricalActor sets the ObservationInfo property of actor to the input


observationInfo.

You can extract ObservationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually.

ActionInfo — Action specifications


rlFiniteSetSpec object

Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of
the environment action channel, such as its dimensions, data type, and name. Note that the function
does not use the name of the action channel specified in actionInfo.

Note Only one action channel is allowed.

rlDiscreteCategoricalActor sets the ActionInfo property of critic to the input


actionInfo.

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specifications manually.

UseDevice — Computation device used for training and simulation


"cpu" (default) | "gpu"

Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.

To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: "gpu"

Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
getAction Obtain action from agent, actor, or policy object given environment
observations

3-118
rlDiscreteCategoricalActor

evaluate Evaluate function approximator object given observation (or observation-


action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data
accelerate Option to accelerate computation of gradient for approximator object
based on neural network
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
setModel Set function approximation model for actor or critic
getModel Get function approximator model from actor or critic

Examples

Create Discrete Categorical Actor from Deep Neural Network

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as consisting of
three values, -10, 0, and 10.
actInfo = rlFiniteSetSpec([-10 0 10]);

To approximate the policy within the actor, use a deep neural network.

The input of the network must accept a four-element vector (the observation vector just defined by
obsInfo), and its output must be a three-element vector. Each element of the output vector must be
between 0 and 1 since it represents the probability of executing each of the three possible actions (as
defined by actInfo). Using softmax as the output layer enforces this requirement (the software
automatically adds a softmaxLayer as a final output layer if you do not specify it explicitly). When
computing the action, the actor then randomly samples the distribution to return an action.

Create a neural network as an array of layer objects.


net = [ featureInputLayer(4)
fullyConnectedLayer(3) ];

Convert the network to a dlnetwork object and display the number of learnable parameters.
net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 15

Inputs:
1 'input' 4 features

3-119
3 Objects

Create the actor with rlDiscreteCategoricalActor, using the network, the observations and
action specification objects. When the network has multiple input layers, they are automatically
associated with the environment observation channels according to the dimension specifications in
obsInfo.

actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);

To check your actor, use getAction to return an action from a random observation vector, given the
current network weights.

act = getAction(actor,{rand(obsInfo.Dimension)});
act

act = 1x1 cell array


{[0]}

You can now use the actor to create a suitable agent, such as rlACAgent, or rlPGAgent.

Create Discrete Categorical Actor from Deep Neural Network Specifying Input Layer Name

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.

obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as consisting of
three values, -10, 0, and 10.

actInfo = rlFiniteSetSpec([-10 0 10]);

To approximate the policy within the actor, use a neural network.

The input of the network (here called netOin) must accept a four-element vector (the observation
vector just defined by obsInfo), and its output (here called actionProb) must be a three-element
vector. Each element of the output vector must be between 0 and 1 since it represents the probability
of executing each of the three possible actions (as defined by actInfo). Using softmax as the output
layer enforces this requirement (however, the software automatically adds a softmaxLayer as a final
output layer if you do not specify it explicitly). When computing the action, the actor then randomly
samples the distribution to return an action.

Create a network as an array of layer objects. Specify a name for the input layer, so you can later
explicitly associate it with the observation channel.

net = [ featureInputLayer(4,Name="netOin")
fullyConnectedLayer(3)
softmaxLayer(Name="actionProb") ];

Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).

3-120
rlDiscreteCategoricalActor

net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 15

Inputs:
1 'netOin' 4 features

Create the actor with rlDiscreteCategoricalActor, using the network, the observations and
action specification objects, and the name of the network input layer.

actor = rlDiscreteCategoricalActor(net, ...


obsInfo,actInfo,...
Observation="netOin");

To validate your actor, use getAction to return an action from a random observation, given the
current network weights.

act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}

ans = 0

To return the probability distribution of the possible actions as a function of a random observation,
and given the current network weights, use evaluate.

prb = evaluate(actor,{rand(obsInfo.Dimension)})

prb = 1x1 cell array


{3x1 single}

prb{1}

ans = 3x1 single column vector

0.5606
0.2619
0.1776

You can now use the actor to create a suitable agent, such as rlACAgent, or rlPGAgent.

Create Discrete Categorical Actor from Custom Basis Function

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as
consisting of two channels, the first being a two-dimensional vector in a continuous space, the second
being a two dimensional vector that can assume only three values, -[1 1], [0 0], and [1 1]. Therefore a
single observation consists of two two-dimensional vectors, one continuous, the other discrete.

obsInfo = [rlNumericSpec([2 1]) rlFiniteSetSpec({-[1 1],[0 0],[1 1]})];

3-121
3 Objects

Create a discrete action space specification object (or alternatively use getActionInfo to extract
the specification object from an environment with a discrete action space). For this example, define
the action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).

actInfo = rlFiniteSetSpec([7 5 3]);

Create a custom basis function. Each element is a function of the observation defined by obsInfo.

myBasisFcn = @(obsC,obsD) [obsC(1)^2-obsD(2)^2;


obsC(2)^2-obsD(1)^2;
exp(obsC(2))+abs(obsD(1));
exp(obsC(1))+abs(obsD(2))];

The output of the actor is the action, among the ones defined in actInfo, corresponding to the
element of softmax(W'*myBasisFcn(obsC,obsD)) which has the highest value. W is a weight
matrix, containing the learnable parameters, which must have as many rows as the length of the
basis function output, and as many columns as the number of possible actions.

Define an initial parameter matrix.

W0 = rand(4,3);

Create the actor. The first argument is a two-element cell containing both the handle to the custom
function and the initial parameter matrix. The second and third arguments are, respectively, the
observation and action specification objects.

actor = rlDiscreteCategoricalActor({myBasisFcn,W0},obsInfo,actInfo);

To check your actor use the getAction function to return one of the three possible actions,
depending on a given random observation and on the current parameter matrix.

getAction(actor,{rand(2,1),[1 1]})

ans = 1x1 cell array


{[3]}

Note that the discrete set constrain is not enforced.

getAction(actor,{rand(2,1),[0.5 -0.7]})

ans = 1x1 cell array


{[3]}

You can now use the actor (along with an critic) to create a suitable discrete action space agent (such
as rlACAgent, rlPGAgent, or rlPPOAgent).

Create Discrete Categorical Actor from Deep Recurrent Neural Network

This example shows you how to create a stochastic actor with a discrete action space using a
recurrent neural network. You can also use a recurrent neural network for a continuous stochastic
actor.

3-122
rlDiscreteCategoricalActor

For this example, use the same environment used in “Train PG Agent to Balance Cart-Pole System”.
Load the environment and obtain the observation and action specifications.
env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

To approximate the policy within the actor, use a recurrent deep neural network.

Create a neural network as an array of layer objects. To create a recurrent network, use a
sequenceInputLayer as the input layer (with size equal to the number of dimensions of the
observation channel) and include at least one lstmLayer.

Specify a name for the input layer, so you can later explicitly associate it with the observation
channel.
net = [
sequenceInputLayer( ...
prod(obsInfo.Dimension), ...
Name="netOin")
fullyConnectedLayer(8)
reluLayer
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer( ...
numel(actInfo.Elements)) ];

Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
net = dlnetwork(net);
summary(net)

Initialized: true

Number of learnables: 602

Inputs:
1 'netOin' Sequence input with 4 dimensions

Create a discrete categorical actor using the network, the environment specifications, and the name
of the network input layer to be associated with the observation channel.
actor = rlDiscreteCategoricalActor(net, ...
obsInfo,actInfo,...
Observation="netOin");

To check your actor use getAction to return one of the two possible actions, depending on a given
random observation and on the current network weights.
act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}

ans = -10

To return the probability of each of the two possible action, use evaluate. Note that the type of the
returned numbers is single, not double.
prob = evaluate(actor,{rand(obsInfo.Dimension)});
prob{1}

3-123
3 Objects

ans = 2x1 single column vector

0.4704
0.5296

You can use getState and setState to extract and set the current state of the recurrent neural
network in the actor.

getState(actor)

ans=2×1 cell array


{8x1 single}
{8x1 single}

actor = setState(actor, ...


{-0.01*single(rand(8,1)), ...
0.01*single(rand(8,1))});

To evaluate the actor using sequential observations, use the sequence length (time) dimension. For
example, obtain actions for 5 independent sequences each one consisting of 9 sequential
observations.

[action,state] = getAction(actor, ...


{rand([obsInfo.Dimension 5 9])});

Display the action corresponding to the seventh element of the observation sequence in the fourth
sequence.

action = action{1};
action(1,1,4,7)

ans = 10

Display the updated state of the recurrent neural network.

state

state=2×1 cell array


{8x5 single}
{8x5 single}

For more information on input and output format for recurrent neural networks, see the Algorithms
section of lstmLayer.

You can now use the actor (along with an critic) to create a suitable discrete action space agent (such
as rlACAgent, rlPGAgent, or rlPPOAgent).

Version History
Introduced in R2022a

3-124
rlDiscreteCategoricalActor

See Also
Functions
rlContinuousDeterministicActor | rlContinuousGaussianActor | getActionInfo |
getObservationInfo

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-125
3 Objects

rlDQNAgent
Deep Q-network (DQN) reinforcement learning agent

Description
The deep Q-network (DQN) algorithm is a model-free, online, off-policy reinforcement learning
method. A DQN agent is a value-based reinforcement learning agent that trains a critic to estimate
the return or future rewards. DQN is a variant of Q-learning, and it operates only within discrete
action spaces.

For more information, “Deep Q-Network (DQN) Agents”. For more information on the different types
of reinforcement learning agents, see “Reinforcement Learning Agents”.

Creation
Syntax
agent = rlDQNAgent(observationInfo,actionInfo)
agent = rlDQNAgent(observationInfo,actionInfo,initOpts)

agent = rlDQNAgent(critic)

agent = rlDQNAgent(critic,agentOptions)

Description
Create Agent from Observation and Action Specifications

agent = rlDQNAgent(observationInfo,actionInfo) creates a DQN agent for an environment


with the given observation and action specifications, using default initialization options. The critic in
the agent uses a default vector (that is, multi-output) Q-value deep neural network built from the
observation specification observationInfo and the action specification actionInfo. The
ObservationInfo and ActionInfo properties of agent are set to the observationInfo and
actionInfo input arguments, respectively.

agent = rlDQNAgent(observationInfo,actionInfo,initOpts) creates a DQN agent for an


environment with the given observation and action specifications. The agent uses a default network
configured using options specified in the initOpts object. For more information on the initialization
options, see rlAgentInitializationOptions.
Create Agent from Critic

agent = rlDQNAgent(critic) creates a DQN agent with the specified critic network using a
default option set for a DQN agent.
Specify Agent Options

agent = rlDQNAgent(critic,agentOptions) creates a DQN agent with the specified critic


network and sets the AgentOptions property to the agentOptions input argument. Use this syntax
after any of the input arguments in the previous syntaxes..

3-126
rlDQNAgent

Input Arguments

initOpts — Agent initialization options


rlAgentInitializationOptions object

Agent initialization options, specified as an rlAgentInitializationOptions object.

critic — Critic
rlQValueFunction object | rlVectorQValueFunction object

Critic, specified as an rlQValueFunction or as the generally more efficient


rlVectorQValueFunction object. For more information on creating critics, see “Create Policies
and Value Functions”.

Your critic can use a recurrent neural network as its function approximator. However, only
rlVectorQValueFunction supports recurrent neural networks. For an example, see “Create DQN
Agent with Recurrent Neural Network” on page 3-136.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data type, and names of the observation
signals.

If you create the agent by specifying a critic object, the value of ObservationInfo matches the
value specified in critic.

You can extract observationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specification


specification object

Action specifications, specified as a reinforcement learning specification object defining properties


such as dimensions, data type, and names of the action signals.

Since a DDPG agent operates in a discrete action space, you must specify actionInfo as an
rlFiniteSetSpec object.

If you create the agent by specifying a critic object, the value of ActionInfo matches the value
specified in critic.

You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec.

AgentOptions — Agent options


rlDQNAgentOptions object

Agent options, specified as an rlDQNAgentOptions object.

3-127
3 Objects

If you create a DQN agent with a default critic that uses a recurrent neural network, the default value
of AgentOptions.SequenceLength is 32.

ExperienceBuffer — Experience buffer


rlReplayMemory object

Experience buffer, specified as an rlReplayMemory object. During training the agent stores each of
its experiences (S,A,R,S',D) in a buffer. Here:

• S is the current observation of the environment.


• A is the action taken by the agent.
• R is the reward for taking action A.
• S' is the next observation after taking action A.
• D is the is-done signal after taking action A.

UseExplorationPolicy — Option to use exploration policy


false (default) | true

Option to use exploration policy when selecting actions, specified as a one of the following logical
values.

• true — Use the base agent exploration policy when selecting actions.
• false — Use the base agent greedy policy when selecting actions.

SampleTime — Sample time of agent


positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object

3-128
rlDQNAgent

Examples

Create DQN Agent from Observation and Action Specifications

Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to a swinging pole).

% load predefined environment


env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

% obtain observation and action specifications


obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator.

rng(0)

Create a deep Q-network agent from the environment observation and action specifications.

agent = rlDQNAgent(obsInfo,actInfo);

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[1]}

You can now test and train the agent within the environment.

Create DQN Agent Using Initialization Options

Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to a swinging pole).

% load predefined environment


env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

% obtain observation and action specifications


obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256).

3-129
3 Objects

initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);

The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.

rng(0)

Create a policy gradient agent from the environment observation and action specifications.

agent = rlDQNAgent(obsInfo,actInfo,initOpts);

Extract the deep neural network from both the critic.

criticNet = getModel(getCritic(agent));

The default DQN agent uses a multi-output Q-value critic approximator. A multi-output approximator
has observations as inputs and state-action values as outputs. Each output element represents the
expected cumulative long-term reward for taking the corresponding discrete action from the state
indicated by the observation inputs.

Display the layers of the critic network, and verify that each hidden fully connected layer has 128
neurons

criticNet.Layers

ans =
11x1 Layer array with layers:

1 'concat' Concatenation Concatenation of 2 inputs along dimension 1


2 'relu_body' ReLU ReLU
3 'fc_body' Fully Connected 128 fully connected layer
4 'body_output' ReLU ReLU
5 'input_1' Image Input 50x50x1 images
6 'conv_1' 2-D Convolution 64 3x3x1 convolutions with stride [1 1] and padding
7 'relu_input_1' ReLU ReLU
8 'fc_1' Fully Connected 128 fully connected layer
9 'input_2' Feature Input 1 features
10 'fc_2' Fully Connected 128 fully connected layer
11 'output' Fully Connected 5 fully connected layer

Plot the critic network

plot(layerGraph(criticNet))

3-130
rlDQNAgent

To check your agent, use getAction to return the action from random observations.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[0]}

You can now test and train the agent within the environment.

Create a DQN Agent Using a Multi-Output Critic

Create an environment interface and obtain its observation and action specifications. For this
example load the predefined environment used for the “Train DQN Agent to Balance Cart-Pole
System” example. This environment has a continuous four-dimensional observation space (the
positions and velocities of both cart and pole) and a discrete one-dimensional action space consisting
on the application of two possible forces, -10N or 10N.

Create the predefined environment.

env = rlPredefinedEnv("CartPole-Discrete");

Get the observation and action specification objects.

3-131
3 Objects

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

To approximate the Q-value function within the critic, use a deep neural network. For DQN agents
with a discrete action space, you have the option to create a multi-output Q-value function critic,
which is generally more efficient than a comparable single-output critic.

A network for this critic must take only the observation as input and return a vector of values for
each action. Therefore, it must have an input layer with as many elements as the dimension of the
observation space and an output layer having as many elements as the number of possible discrete
actions. Each output element represents the expected cumulative long-term reward following from
the observation given as input, when the corresponding action is taken.

Define the network as an array of layer objects, and get the dimensions of the observation space (that
is, prod(obsInfo.Dimension)) and the number of possible actions (that is,
numel(actInfo.Elements)) directly from the environment specification objects.

dnn = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))];

Convert the network to a dlnetwork object ad display the number of weights.

dnn = dlnetwork(dnn);
summary(dnn)

Initialized: true

Number of learnables: 770

Inputs:
1 'input' 4 features

Create the critic using rlVectorQValueFunction, the network dnn as well as the observation and
action specifications.

critic = rlVectorQValueFunction(dnn,obsInfo,actInfo);

Check that the critic works with a random observation input.

getValue(critic,{rand(obsInfo.Dimension)})

ans = 2x1 single column vector

-0.0361
0.0913

Create the DQN agent using the critic.

agent = rlDQNAgent(critic)

agent =
rlDQNAgent with properties:

3-132
rlDQNAgent

ExperienceBuffer: [1x1 rl.replay.rlReplayMemory]


AgentOptions: [1x1 rl.option.rlDQNAgentOptions]
UseExplorationPolicy: 0
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1

Specify agent options, including training options for the critic.

agent.AgentOptions.UseDoubleDQN=false;
agent.AgentOptions.TargetUpdateMethod="periodic";
agent.AgentOptions.TargetUpdateFrequency=4;
agent.AgentOptions.ExperienceBufferLength=100000;
agent.AgentOptions.DiscountFactor=0.99;
agent.AgentOptions.MiniBatchSize=256;

agent.AgentOptions.CriticOptimizerOptions.LearnRate=1e-2;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold=1;

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[10]}

You can now test and train the agent within the environment.

Create a DQN Agent Using a Single-Output Critic

Create an environment interface and obtain its observation and action specifications. For this
example load the predefined environment used for the “Train DQN Agent to Balance Cart-Pole
System” example. This environment has a continuous four-dimensional observation space (the
positions and velocities of both cart and pole) and a discrete one-dimensional action space consisting
on the application of two possible forces, -10N or 10N.

Create the predefined environment.

env = rlPredefinedEnv("CartPole-Discrete");

Get the observation and action specification objects.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a deep neural network to be used as approximation model within the critic. For DQN agents,
you have the option to create a multi-output Q-value function critic, which is generally more efficient
than a comparable single-output critic. However, for this example, create a single-output Q-value
function critic instead.

3-133
3 Objects

The network for this critic must have two input layers, one for the observation and the other for the
action, and return a scalar value representing the expected cumulative long-term reward following
from the given observation and action.

Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects and specify a name for the input layers, so
you can later explicitly associate them with the appropriate environment channel.

% Observation path
obsPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="netOin")
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(24,Name="fcObsPath")];

% Action path
actPath = [
featureInputLayer(prod(actInfo.Dimension),Name="netAin")
fullyConnectedLayer(24,Name="fcActPath")];

% Common path (concatenate inputs along dim #1)


commonPath = [
concatenationLayer(1,2,Name="cat")
reluLayer
fullyConnectedLayer(1,Name="out")];

% Add paths to network


net = layerGraph;
net = addLayers(net,obsPath);
net = addLayers(net,actPath);
net = addLayers(net,commonPath);

% Connect layers
net = connectLayers(net,'fcObsPath','cat/in1');
net = connectLayers(net,'fcActPath','cat/in2');

% Plot network
plot(net)

3-134
rlDQNAgent

% Convert to dlnetwork object


net = dlnetwork(net);

% Display the number of weights


summary(net)

Initialized: true

Number of learnables: 817

Inputs:
1 'netOin' 4 features
2 'netAin' 1 features

Create the critic using rlQValueFunction. Specify the names of the layers to be associated with the
observation and action channels.
critic = rlQValueFunction(net, ...
obsInfo, ...
actInfo, ...
ObservationInputNames="netOin", ...
ActionInputNames="netAin");

Check the critic with a random observation and action input.


getValue(critic,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})

ans = single
-0.0232

3-135
3 Objects

Create the DQN agent using the critic.

agent = rlDQNAgent(critic)

agent =
rlDQNAgent with properties:

ExperienceBuffer: [1x1 rl.replay.rlReplayMemory]


AgentOptions: [1x1 rl.option.rlDQNAgentOptions]
UseExplorationPolicy: 0
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1

Specify agent options, including training options for the critic.

agent.AgentOptions.UseDoubleDQN=false;
agent.AgentOptions.TargetUpdateMethod="periodic";
agent.AgentOptions.TargetUpdateFrequency=4;
agent.AgentOptions.ExperienceBufferLength=100000;
agent.AgentOptions.DiscountFactor=0.99;
agent.AgentOptions.MiniBatchSize=256;

agent.AgentOptions.CriticOptimizerOptions.LearnRate=1e-2;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold=1;

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[10]}

You can now test and train the agent within the environment.

Create DQN Agent with Recurrent Neural Network

For this example load the predefined environment used for the “Train DQN Agent to Balance Cart-
Pole System” example. This environment has a continuous four-dimensional observation space (the
positions and velocities of both cart and pole) and a discrete one-dimensional action space consisting
on the application of two possible forces, -10N or 10N.

env = rlPredefinedEnv('CartPole-Discrete');

Get the observation and action specification objects.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

To approximate the Q-value function within the critic, use a recurrent deep neural network. For DQN
agents, only the vector function approximator, rlVectorQValueFunction, supports recurrent
neural networks models. For vector Q-value function critics, the number of elements of the output
layer has to be equal to the number of possible actions: numel(actInfo.Elements).

3-136
rlDQNAgent

Define the network as an array of layer objects. Get the dimensions of the observation space from the
environment specification object (prod(obsInfo.Dimension)). To create a recurrent neural
network, use a sequenceInputLayer as the input layer and include an lstmLayer as one of the
other network layers.
net = [
sequenceInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(50)
reluLayer
lstmLayer(20,OutputMode="sequence");
fullyConnectedLayer(20)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))];

Convert to a dlnetwork object and display the number of weights.


net = dlnetwork(net);
summary(net);

Initialized: true

Number of learnables: 6.3k

Inputs:
1 'sequenceinput' Sequence input with 4 dimensions

Create a critic using the recurrent neural network.


critic = rlVectorQValueFunction(net,obsInfo,actInfo);

Check your critic with a random input observation.


getValue(critic,{rand(obsInfo.Dimension)})

ans = 2x1 single column vector

0.0136
0.0067

Define some training options for the critic.


criticOptions = rlOptimizerOptions( ...
LearnRate=1e-3, ...
GradientThreshold=1);

Specify options for creating the DQN agent. To use a recurrent neural network, you must specify a
SequenceLength greater than 1.
agentOptions = rlDQNAgentOptions(...
UseDoubleDQN=false, ...
TargetSmoothFactor=5e-3, ...
ExperienceBufferLength=1e6, ...
SequenceLength=32, ...
CriticOptimizerOptions=criticOptions);

agentOptions.EpsilonGreedyExploration.EpsilonDecay = 1e-4;

Create the agent. The actor and critic networks are initialized randomly.

3-137
3 Objects

agent = rlDQNAgent(critic,agentOptions);

Check your agent using getAction to return the action from a random observation.

getAction(agent,rand(obsInfo.Dimension))

ans = 1x1 cell array


{[-10]}

You can now test and train the agent against the environment.

Version History
Introduced in R2019a

See Also
rlAgentInitializationOptions | rlDQNAgentOptions | rlVectorQValueFunction |
rlQValueFunction | Deep Network Designer

Topics
“Deep Q-Network (DQN) Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”

3-138
rlDQNAgentOptions

rlDQNAgentOptions
Options for DQN agent

Description
Use an rlDQNAgentOptions object to specify options for deep Q-network (DQN) agents. To create a
DQN agent, use rlDQNAgent.

For more information, see “Deep Q-Network (DQN) Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation

Syntax
opt = rlDQNAgentOptions
opt = rlDQNAgentOptions(Name,Value)

Description

opt = rlDQNAgentOptions creates an options object for use as an argument when creating a DQN
agent using all default settings. You can modify the object properties using dot notation.

opt = rlDQNAgentOptions(Name,Value) sets option properties on page 3-139 using name-value


pairs. For example, rlDQNAgentOptions('DiscountFactor',0.95) creates an option set with a
discount factor of 0.95. You can specify multiple name-value pairs. Enclose each property name in
quotes.

Properties
UseDoubleDQN — Flag for using double DQN
true (default) | false

Flag for using double DQN for value function target updates, specified as a logical value. For most
application set UseDoubleDQN to "on". For more information, see “Deep Q-Network (DQN) Agents”.

EpsilonGreedyExploration — Options for epsilon-greedy exploration


EpsilonGreedyExploration object

Options for epsilon-greedy exploration, specified as an EpsilonGreedyExploration object with the


following properties.

3-139
3 Objects

Property Description Default Value


Epsilon Probability threshold to either 1
randomly select an action or
select the action that maximizes
the state-action value function.
A larger value of Epsilon
means that the agent randomly
explores the action space at a
higher rate.
EpsilonMin Minimum value of Epsilon 0.01
EpsilonDecay Decay rate 0.0050

At the end of each training time step, if Epsilon is greater than EpsilonMin, then it is updated
using the following formula.

Epsilon = Epsilon*(1-EpsilonDecay)

If your agent converges on local optima too quickly, you can promote agent exploration by increasing
Epsilon.

To specify exploration options, use dot notation after creating the rlDQNAgentOptions object opt.
For example, set the epsilon value to 0.9.
opt.EpsilonGreedyExploration.Epsilon = 0.9;

CriticOptimizerOptions — Critic optimizer options


rlOptimizerOptions object

Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the critic approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

TargetSmoothFactor — Smoothing factor for target critic updates


1e-3 (default) | positive scalar less than or equal to 1

Smoothing factor for target critic updates, specified as a positive scalar less than or equal to 1. For
more information, see “Target Update Methods”.

TargetUpdateFrequency — Number of steps between target critic updates


1 (default) | positive integer

Number of steps between target critic updates, specified as a positive integer. For more information,
see “Target Update Methods”.

ResetExperienceBufferBeforeTraining — Option for clearing the experience buffer


true (default) | false

Option for clearing the experience buffer before training, specified as a logical value.

SequenceLength — Maximum batch-training trajectory length when using RNN


1 (default) | positive integer

3-140
rlDQNAgentOptions

Maximum batch-training trajectory length when using a recurrent neural network for the critic,
specified as a positive integer. This value must be greater than 1 when using a recurrent neural
network for the critic and 1 otherwise.

MiniBatchSize — Size of random experience mini-batch


64 (default) | positive integer

Size of random experience mini-batch, specified as a positive integer. During each training episode,
the agent randomly samples experiences from the experience buffer when computing gradients for
updating the critic properties. Large mini-batches reduce the variance when computing gradients but
increase the computational effort.

When using a recurrent neural network for the critic, MiniBatchSize is the number of experience
trajectories in a batch, where each trajectory has length equal to SequenceLength.

NumStepsToLookAhead — Number of future rewards used to estimate the value of the policy
1 (default) | positive integer

Number of future rewards used to estimate the value of the policy, specified as a positive integer. For
more information, see chapter 7 of [1].

N-step Q learning is not supported when using a recurrent neural network for the critic. In this case,
NumStepsToLookAhead must be 1.

ExperienceBufferLength — Experience buffer size


10000 (default) | positive integer

Experience buffer size, specified as a positive integer. During training, the agent computes updates
using a mini-batch of experiences randomly sampled from the buffer.

SampleTime — Sample time of agent


1 (default) | positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

DiscountFactor — Discount factor


0.99 (default) | positive scalar less than or equal to 1

Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.

Object Functions
rlDQNAgent Deep Q-network (DQN) reinforcement learning agent

3-141
3 Objects

Examples

Create DQN Agent Options Object

This example shows how to create a DQN agent options object.

Create an rlDQNAgentOptions object that specifies the agent mini-batch size.

opt = rlDQNAgentOptions('MiniBatchSize',48)

opt =
rlDQNAgentOptions with properties:

UseDoubleDQN: 1
EpsilonGreedyExploration: [1x1 rl.option.EpsilonGreedyExploration]
CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
TargetSmoothFactor: 1.0000e-03
TargetUpdateFrequency: 1
ResetExperienceBufferBeforeTraining: 1
SequenceLength: 1
MiniBatchSize: 48
NumStepsToLookAhead: 1
ExperienceBufferLength: 10000
SampleTime: 1
DiscountFactor: 0.9900
InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the agent sample time to 0.5.

opt.SampleTime = 0.5;

Version History
Introduced in R2019a

Target update method settings for DQN agents have changed


Behavior changed in R2020a

Target update method settings for DQN agents have changed. The following changes require updates
to your code:

• The TargetUpdateMethod option has been removed. Now, DQN agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.

To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.

Update Method TargetUpdateFrequency TargetSmoothFactor


Smoothing 1 Less than 1

3-142
rlDQNAgentOptions

Update Method TargetUpdateFrequency TargetSmoothFactor


Periodic Greater than 1 1
Periodic smoothing (new Greater than 1 Less than 1
method in R2020a)

The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.
Update Code

This table shows some typical uses of rlDQNAgentOptions and how to update your code to use the
new option configuration.

Not Recommended Recommended


opt = opt = rlDQNAgentOptions;
rlDQNAgentOptions('TargetUpdateMethod'
,"smoothing");
opt = opt = rlDQNAgentOptions;
rlDQNAgentOptions('TargetUpdateMethod' opt.TargetUpdateFrequency = 4;
,"periodic"); opt.TargetSmoothFactor = 1;
opt = rlDQNAgentOptions; opt = rlDQNAgentOptions;
opt.TargetUpdateMethod = "periodic"; opt.TargetUpdateFrequency = 5;
opt.TargetUpdateFrequency = 5; opt.TargetSmoothFactor = 1;

References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second
edition. Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press,
2018.

See Also
Topics
“Deep Q-Network (DQN) Agents”

3-143
3 Objects

rlEpsilonGreedyPolicy
Policy object to generate discrete epsilon-greedy actions for custom training loops

Description
This object implements an epsilon-greedy policy, which returns either the action that maximizes a
discrete action-space Q-value function, with probability 1-Epsilon, or a random action otherwise,
given an input observation. You can create an rlEpsilonGreedyPolicy object from an
rlQValueFunction or rlVectorQValueFunction object, or extract it from an rlQAgent,
rlDQNAgent or rlSARSAAgent. You can then train the policy object using a custom training loop or
deploy it for your application. If UseEpsilonGreedyAction is set to 0 the policy is deterministic,
therefore in this case it does not explore. This object is not compatible with generatePolicyBlock
and generatePolicyFunction. For more information on policies and value functions, see “Create
Policies and Value Functions”.

Creation

Syntax
policy = rlEpsilonGreedyPolicy(qValueFunction)

Description

policy = rlEpsilonGreedyPolicy(qValueFunction) creates the epsilon-greedy policy object


policy from the discrete action-space Q-value function qValueFunction. It also sets the
QValueFunction property of policy to the input argument qValueFunction.

Properties
QValueFunction — Discrete action-space Q-value function
rlQValueFunction object | rlVectorQValueFunction object

Discrete action-space Q-value function approximator, specified as an rlQValueFunction or


rlVectorQValueFunction object.

ExplorationOptions — Noise model options


EpsilonGreedyExploration object

Exploration options, specified as an EpsilonGreedyExploration object. For more information see


the EpsilonGreedyExploration property in rlQAgentOptions.

UseEpsilonGreedyAction — Option to enable epsilon-greedy actions


true (default) | false

Option to enable epsilon-greedy actions, specified as a logical value: either true (default, enabling
epsilon-greedy actions, which helps exploration) or false (epsilon-greedy actions not enabled). When
epsilon-greedy actions are disabled the policy is deterministic and therefore it does not explore.

3-144
rlEpsilonGreedyPolicy

Example: false

EnableNoiseDecay — Option to enable noise decay


true (default) | false

Option to enable noise decay, specified as a logical value: either true (default, enabling noise decay)
or false (disabling noise decay).
Example: false

ObservationInfo — Observation specifications


rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


of such objects. These objects define properties such as the dimensions, data types, and names of the
observation signals.

ActionInfo — Action specifications


rlFiniteSetSpec object

Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of
the environment action channel, such as its dimensions, data type, and name. Note that the name of
the action channel specified in actionInfo (if any) is not used.

Note Only one action channel is allowed.

SampleTime — Sample time of policy


positive scalar | -1 (default)

Sample time of the policy, specified as a positive scalar or as -1 (default). Setting this parameter to
-1 allows for event-based simulations.

Within a Simulink environment, the RL Agent block in which the policy is specified executes every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the policy is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience. If
SampleTime is -1, the sample time is treated as being equal to 1.
Example: 0.2

Object Functions
getAction Obtain action from agent, actor, or policy object given environment
observations
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
reset Reset environment, agent, experience buffer, or policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object

Examples

3-145
3 Objects

Create Epsilon-Greedy Policy Object from Vector Q-Value Function

Create observation and action specification objects. For this example, define the observation space as
a continuous four-dimensional space, so that a single observation is a column vector containing four
doubles, and the action space as a finite set consisting of two possible row vectors, [1 0] and [0 1].
obsInfo = rlNumericSpec([4 1]);
actInfo = rlFiniteSetSpec({[1 0],[0 1]});

Alternatively, use getObservationInfo and getActionInfo to extract the specification objects


from an environment.

Create a vector Q-value function approximator to use as critic. A vector Q-value function must accept
an observation as input and return a single vector with as many elements as the number of possible
discrete actions.

To approximate the vector Q-value function within the critic, use a neural network. Define a single
path from the network input to its output as an array of layer objects.
layers = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(10)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))
];

Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)

Initialized: true

Number of learnables: 72

Inputs:
1 'input' 4 features

Create a vector Q-value function using model, and the observation and action specifications.
qValueFcn = rlVectorQValueFunction(model,obsInfo,actInfo)

qValueFcn =
rlVectorQValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
UseDevice: "cpu"

Check the critic with a random observation input.


getValue(qValueFcn,{rand(obsInfo.Dimension)})

ans = 2x1 single column vector

0.6486
-0.3103

3-146
rlEpsilonGreedyPolicy

Create a policy object from qValueFcn.

policy = rlEpsilonGreedyPolicy(qValueFcn)

policy =
rlEpsilonGreedyPolicy with properties:

QValueFunction: [1x1 rl.function.rlVectorQValueFunction]


ExplorationOptions: [1x1 rl.option.EpsilonGreedyExploration]
UseEpsilonGreedyAction: 1
EnableEpsilonDecay: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: -1

Check the policy with a random observation input.

getAction(policy,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[1 0]}

You can now train the policy with a custom training loop and then deploy it to your application.

Version History
Introduced in R2022a

See Also
Functions
rlMaxQPolicy | rlDeterministicActorPolicy | rlAdditiveNoisePolicy |
rlStochasticActorPolicy | rlQValueFunction | rlVectorQValueFunction | rlSARSAAgent
| rlQAgent | rlDQNAgent

Blocks
RL Agent

Topics
“Create Policies and Value Functions”
“Model-Based Reinforcement Learning Using Custom Training Loop”
“Train Reinforcement Learning Policy Using Custom Training Loop”

3-147
3 Objects

rlFiniteSetSpec
Create discrete action or observation data specifications for reinforcement learning environments

Description
An rlFiniteSetSpec object specifies discrete action or observation data specifications for
reinforcement learning environments.

Creation
Syntax
spec = rlFiniteSetSpec(elements)

Description

spec = rlFiniteSetSpec(elements) creates a data specification with a discrete set of actions or


observations, setting the Elements property.

Properties
Elements — Set of valid actions or observations
vector | cell array

Set of valid actions or observations for the environment, specified as one of the following:

• Vector — Specify valid numeric values for a single action or single observation.
• Cell array — Specify valid numeric value combinations when you have more than one action or
observation. Each entry of the cell array must have the same dimensions.

Name — Name of the rlFiniteSetSpec object


string

Name of the rlFiniteSetSpec object, specified as a string. Use this property to set a meaningful
name for your finite set.

Description — Description of the rlFiniteSetSpec object


string

Description of the rlFiniteSetSpec object, specified as a string. Use this property to specify a
meaningful description of the finite set values.

Dimension — Size of each element


vector (default)

This property is read-only.

Size of each element, specified as a vector.

3-148
rlFiniteSetSpec

If you specify Elements as a vector, then Dimension is [1 1]. Otherwise, if you specify a cell array,
then Dimension indicates the size of the entries in Elements.

DataType — Information about the type of data


"double" (default) | string

This property is read-only.

Information about the type of data, specified as a string, such as "double" or "single".

Object Functions
rlSimulinkEnv Create reinforcement learning environment using dynamic model
implemented in Simulink
rlFunctionEnv Specify custom reinforcement learning environment dynamics
using functions
rlValueFunction Value function approximator object for reinforcement learning
agents
rlQValueFunction Q-Value function approximator object for reinforcement learning
agents
rlVectorQValueFunction Vector Q-value function approximator for reinforcement learning
agents
rlContinuousDeterministicActor Deterministic actor with a continuous action space for
reinforcement learning agents
rlDiscreteCategoricalActor Stochastic categorical actor with a discrete action space for
reinforcement learning agents
rlContinuousGaussianActor Stochastic Gaussian actor with a continuous action space for
reinforcement learning agents

Examples

Create Reinforcement Learning Environment for Simulink Model

For this example, consider the rlSimplePendulumModel Simulink model. The model is a simple
frictionless pendulum that initially hangs in a downward position.

Open the model.

mdl = 'rlSimplePendulumModel';
open_system(mdl)

Create rlNumericSpec and rlFiniteSetSpec objects for the observation and action information,
respectively.

The observation is a vector containing three signals: the sine, cosine, and time derivative of the
angle.

obsInfo = rlNumericSpec([3 1])

obsInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf

3-149
3 Objects

Name: [0×0 string]


Description: [0×0 string]
Dimension: [3 1]
DataType: "double"

The action is a scalar expressing the torque and can be one of three possible values, -2 Nm, 0 Nm and
2 Nm.

actInfo = rlFiniteSetSpec([-2 0 2])

actInfo =
rlFiniteSetSpec with properties:

Elements: [3×1 double]


Name: [0×0 string]
Description: [0×0 string]
Dimension: [1 1]
DataType: "double"

You can use dot notation to assign property values for the rlNumericSpec and rlFiniteSetSpec
objects.

obsInfo.Name = 'observations';
actInfo.Name = 'torque';

Assign the agent block path information, and create the reinforcement learning environment for the
Simulink model using the information extracted in the previous steps.

agentBlk = [mdl '/RL Agent'];


env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo)

env =
SimulinkEnvWithAgent with properties:

Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on

You can also include a reset function using dot notation. For this example, randomly initialize theta0
in the model workspace.

env.ResetFcn = @(in) setVariable(in,'theta0',randn,'Workspace',mdl)

env =
SimulinkEnvWithAgent with properties:

Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : @(in)setVariable(in,'theta0',randn,'Workspace',mdl)
UseFastRestart : on

3-150
rlFiniteSetSpec

Specify Discrete Value Set for Multiple Actions

If the actor for your reinforcement learning agent has multiple outputs, each with a discrete action
space, you can specify the possible discrete actions combinations using an rlFiniteSetSpec object.

Suppose that the valid values for a two-output system are [1 2] for the first output and [10 20 30]
for the second output. Create a discrete action space specification for all possible input combinations.

actionSpec = rlFiniteSetSpec({[1 10],[1 20],[1 30],...


[2 10],[2 20],[2 30]})

actionSpec =
rlFiniteSetSpec with properties:

Elements: {6x1 cell}


Name: [0x0 string]
Description: [0x0 string]
Dimension: [1 2]
DataType: "double"

Version History
Introduced in R2019a

See Also
rlNumericSpec | rlSimulinkEnv | getActionInfo | getObservationInfo |
rlValueRepresentation | rlQValueRepresentation |
rlDeterministicActorRepresentation | rlStochasticActorRepresentation |
rlFunctionEnv

3-151
3 Objects

rlFunctionEnv
Specify custom reinforcement learning environment dynamics using functions

Description
Use rlFunctionEnv to define a custom reinforcement learning environment. You provide MATLAB
functions that define the step and reset behavior for the environment. This object is useful when you
want to customize your environment beyond the predefined environments available with
rlPredefinedEnv.

Creation
Syntax
env = rlFunctionEnv(obsInfo,actInfo,stepfcn,resetfcn)

Description

env = rlFunctionEnv(obsInfo,actInfo,stepfcn,resetfcn) creates a reinforcement


learning environment using the provided observation and action specifications, obsInfo and
actInfo, respectively. You also set the StepFcn and ResetFcn properties using MATLAB functions.

Input Arguments

obsInfo — Observation specification


rlFiniteSetSpec object | rlNumericSpec object | array

Observation specification, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


containing a mix of such objects. These objects define properties such as the dimensions, data types,
and names of the observation signals.

actInfo — Action specification


rlFiniteSetSpec object | rlNumericSpec object

Action specification, specified as an rlFiniteSetSpec or rlNumericSpec object. These objects


define properties such as the dimensions, data types, and names of the action signals.

Properties
StepFcn — Step behavior for the environment
function name | function handle | anonymous function handle

Step behavior for the environment, specified as a function name, function handle, or anonymous
function.

StepFcn is a function that you provide which describes how the environment advances to the next
state from a given action. When using a function name or function handle, this function must have
two inputs and four outputs, as illustrated by the following signature.

3-152
rlFunctionEnv

[Observation,Reward,IsDone,LoggedSignals] = myStepFunction(Action,LoggedSignals)

To use additional input arguments beyond the required set, specify StepFcn using an anonymous
function handle.

The step function computes the values of the observation and reward for the given action in the
environment. The required input and output arguments are as follows.

• Action — Current action, which must match the dimensions and data type specified in actInfo.
• Observation — Returned observation, which must match the dimensions and data types
specified in obsInfo.
• Reward — Reward for the current step, returned as a scalar value.
• IsDone — Logical value indicating whether to end the simulation episode. The step function that
you define can include logic to decide whether to end the simulation based on the observation,
reward, or any other values.
• LoggedSignals — Any data that you want to pass from one step to the next, specified as a
structure.

For an example showing multiple ways to define a step function, see “Create MATLAB Environment
Using Custom Functions”.

ResetFcn — Reset behavior for the environment


function name | function handle | anonymous function handle

Reset behavior for the environment, specified as a function, function handle, or anonymous function
handle.

The reset function that you provide must have no inputs and two outputs, as illustrated by the
following signature.
[InitialObservation,LoggedSignals] = myResetFunction

To use input arguments with your reset function, specify ResetFcn using an anonymous function
handle.

The reset function sets the environment to an initial state and computes the initial values of the
observation signals. For example, you can create a reset function that randomizes certain state
values, such that each training episode begins from different initial conditions.

The sim function calls the reset function to reset the environment at the start of each simulation, and
the train function calls it at the start of each training episode.

The InitialObservation output must match the dimensions and data type of obsInfo.

To pass information from the reset condition into the first step, specify that information in the reset
function as the output structure LoggedSignals.

For an example showing multiple ways to define a reset function, see “Create MATLAB Environment
Using Custom Functions”.

LoggedSignals — Information to pass to next step


structure

Information to pass to the next step, specified as a structure. When you create the environment,
whatever you define as the LoggedSignals output of ResetFcn initializes this property. When a

3-153
3 Objects

step occurs, the software populates this property with data to pass to the next step, as defined in
StepFcn.

Object Functions
getActionInfo Obtain action data specifications from reinforcement learning environment,
agent, or experience buffer
getObservationInfo Obtain observation data specifications from reinforcement learning
environment, agent, or experience buffer
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified environment
validateEnvironment Validate custom reinforcement learning environment

Examples

Create Custom MATLAB Environment

Create a reinforcement learning environment by supplying custom dynamic functions in MATLAB®.


Using rlFunctionEnv, you can create a MATLAB reinforcement learning environment from an
observation specification, action specification, and step and reset functions that you define.

For this example, create an environment that represents a system for balancing a cart on a pole. The
observations from the environment are the cart position, cart velocity, pendulum angle, and
pendulum angle derivative. (For additional details about this environment, see “Create MATLAB
Environment Using Custom Functions”.) Create an observation specification for those signals.

oinfo = rlNumericSpec([4 1]);


oinfo.Name = 'CartPole States';
oinfo.Description = 'x, dx, theta, dtheta';

The environment has a discrete action space where the agent can apply one of two possible force
values to the cart, –10 N or 10 N. Create the action specification for those actions.

ActionInfo = rlFiniteSetSpec([-10 10]);


ActionInfo.Name = 'CartPole Action';

Next, specify the custom step and reset functions. For this example, use the supplied functions
myResetFunction.m and myStepFunction.m. For details about these functions and how they are
constructed, see “Create MATLAB Environment Using Custom Functions”.

Construct the custom environment using the defined observation specification, action specification,
and function names.

env = rlFunctionEnv(oinfo,ActionInfo,'myStepFunction','myResetFunction');

You can create agents for env and train them within the environment as you would for any other
reinforcement learning environment.

As an alternative to using function names, you can specify the functions as function handles. For
more details and an example, see “Create MATLAB Environment Using Custom Functions”.

3-154
rlFunctionEnv

Version History
Introduced in R2019a

See Also
rlPredefinedEnv | rlCreateEnvTemplate

Topics
“Create MATLAB Environment Using Custom Functions”

3-155
3 Objects

rlIsDoneFunction
Is-done function approximator object for neural network-based environment

Description
When creating a neural network-based environment using rlNeuralNetworkEnvironment, you can
specify the is-done function approximator using an rlIsDoneFunction object. Do so when you do
not know a ground-truth termination signal for your environment.

The is-done function approximator object uses a deep neural network as internal approximation
model to predict the termination signal for the environment given one of the following input
combinations.

• Observations, actions, and next observations


• Observations and actions
• Actions and next observations
• Next observations

Creation
Syntax
isdFcnAppx = rlIsDoneFunction(net,observationInfo,actionInfo,Name=Value)

Description

isdFcnAppx = rlIsDoneFunction(net,observationInfo,actionInfo,Name=Value)
creates the is-done function approximator object isdFcnAppx using the deep neural network net
and sets the ObservationInfo and ActionInfo properties.

When creating an is-done function approximator you must specify the names of the deep neural
network inputs using one of the following combinations of name-value pair arguments.

• ObservationInputNames, ActionInputNames, and NextObservationInputNames


• ObservationInputNames and ActionInputNames
• ActionInputNames and NextObservationInputNames
• NextObservationInputNames

You can also specify the UseDeterministicPredict and UseDevice properties using optional
name-value pair arguments. For example, to use a GPU for prediction, specify UseDevice="gpu".

Input Arguments

net — Deep neural network


dlnetwork object

Deep neural network with a scalar output value, specified as a dlnetwork object.

3-156
rlIsDoneFunction

The input layer names for this network must match the input names specified using the
ObservationInputNames, ActionInputNames, and NextObservationInputNames. The
dimensions of the input layers must match the dimensions of the corresponding observation and
action specifications in ObservationInfo and ActionInfo, respectively.

Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.
Example: ObservationInputNames="velocity"

ObservationInputNames — Observation input layer names


string | string array

Observation input layer names, specified as a string or string array. Specify


ObservationInputNames when you expect the termination signal to depend on the current
environment observation.

The number of observation input names must match the length of ObservationInfo and the order
of the names must match the order of the specifications in ObservationInfo.

ActionInputNames — Action input layer names


string | string array

Action input layer names, specified as a string or string array. Specify ActionInputNames when you
expect the termination signal to depend on the current action value.

The number of action input names must match the length of ActionInfo and the order of the names
must match the order of the specifications in ActionInfo.

NextObservationInputNames — Next observation input layer names


string | string array

Next observation input layer names, specified as a string or string array. Specify
NextObservationInputNames when you expect the termination signal to depend on the next
environment observation.

The number of next observation input names must match the length of ObservationInfo and the
order of the names must match the order of the specifications in ObservationInfo.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

This property is read-only.

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the
observation signals.

3-157
3 Objects

You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specifications


specification object | array of specification objects

This property is read-only.

Action specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the action
signals.

You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.

UseDeterministicPredict — Option to predict the terminal signal deterministically


true (default) | false

Option to predict the terminal signal deterministically, specified as one of the following values.

• true — Use deterministic network prediction.


• false — Use stochastic network prediction.

UseDevice — Computation device used for training and simulation


"cpu" (default) | "gpu"

Computation device used to perform operations such as gradient computation, parameter updates,
and prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA-enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Note Training or simulating a network on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations using a CPU.

Object Functions
rlNeuralNetworkEnvironment Environment model with deep neural network transition models

Examples

Create Is-Done Function and Predict Termination

Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.

3-158
rlIsDoneFunction

env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

To approximate the is-done function, use a deep neural network. The network has one input channel
for the next observations. The single output channel is for the predicted termination signal.

Create the neural network as a vecto of layer object.


commonPath = [
featureInputLayer( ...
obsInfo.Dimension(1), ...
Name="nextState")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64,Name="FC3")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(2,Name="isdone0")
softmaxLayer(Name="isdone")];

net = layerGraph(commonPath);

plot(net)

Covert the network to a dlnetwork object and display the number of weights.
net = dlnetwork(net);
summary(net);

3-159
3 Objects

Initialized: true

Number of learnables: 4.6k

Inputs:
1 'nextState' 4 features

Create an is-done function approximator object.

isDoneFcnAppx = rlIsDoneFunction(...
net,obsInfo,actInfo,...
NextObservationInputNames="nextState");

Using this is-done function approximator object, you can predict the termination signal based on the
next observation. For example, predict the termination signal for a random next observation. Since
for this example the termination signal only depends on the next observation, use empty cell arrays
for the current action and observation inputs.

nxtobs = rand(obsInfo.Dimension);
predIsDone = predict(isDoneFcnAppx,{},{},{nxtobs})

predIsDone = 0

You can obtain the termination probability using evaluate.

predIsDoneProb = evaluate(isDoneFcnAppx,{nxtobs})

predIsDoneProb = 1x1 cell array


{2x1 single}

predIsDoneProb{1}

ans = 2x1 single column vector

0.5405
0.4595

The first number is the probability of obtaining a 0 (no termination predicted), the second one is the
probability of obtaining a 1 (termination predicted).

Version History
Introduced in R2022a

See Also
Objects
rlNeuralNetworkEnvironment | rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction | evaluate | accelerate | gradient

3-160
rlIsDoneFunction

Topics
“Model-Based Policy Optimization Agents”

3-161
3 Objects

rlMaxQPolicy
Policy object to generate discrete max-Q actions for custom training loops and application
deployment

Description
This object implements a max-Q policy, which returns the action that maximizes a discrete action-
space Q-value function, given an input observation. You can create an rlMaxQPolicy object from an
rlQValueFunction or rlVectorQValueFunction object, or extract it from an rlQAgent,
rlDQNAgent or rlSARSAAgent. You can then train the policy object using a custom training loop or
deploy it for your application using generatePolicyBlock or generatePolicyFunction. This
policy is always deterministic and does not perform any exploration. For more information on policies
and value functions, see “Create Policies and Value Functions”.

Creation

Syntax
policy = rlMaxQPolicy(qValueFunction)

Description

policy = rlMaxQPolicy(qValueFunction) creates the max-Q policy object policy from the
discrete action-space Q-value function qValueFunction. It also sets the QValueFunction property
of policy to the input argument qValueFunction.

Properties
QValueFunction — Discrete action-space Q-value function
rlQValueFunction object

Discrete action-space Q-value function approximator, specified as an rlQValueFunction or


rlVectorQValueFunction object.

ObservationInfo — Observation specifications


rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


of such objects. These objects define properties such as the dimensions, data types, and names of the
observation signals.

ActionInfo — Action specifications


rlFiniteSetSpec object

Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of
the environment action channel, such as its dimensions, data type, and name. Note that the name of
the action channel specified in actionInfo (if any) is not used.

3-162
rlMaxQPolicy

Note Only one action channel is allowed.

SampleTime — Sample time of policy


positive scalar | -1 (default)

Sample time of policy, specified as a positive scalar or as -1 (default). Setting this parameter to -1
allows for event-based simulations.

Within a Simulink environment, the RL Agent block in which the policy is specified executes every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the policy is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience. If
SampleTime is -1, the sample time is treated as being equal to 1.
Example: 0.2

Object Functions
generatePolicyBlock Generate Simulink block that evaluates policy of an agent or policy object
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
getAction Obtain action from agent, actor, or policy object given environment
observations
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
reset Reset environment, agent, experience buffer, or policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object

Examples

Create Max-Q Policy Object from Vector Q-Value Function

Create observation and action specification objects. For this example, define the observation space as
a continuous four-dimensional space, so that a single observation is a column vector containing four
doubles, and the action space as a finite set consisting of two possible values, -1 and 1.

obsInfo = rlNumericSpec([4 1]);


actInfo = rlFiniteSetSpec([-1 1]);

Alternatively, you can use getObservationInfo and getActionInfo to extract the specification
objects from an environment.

Create a vector Q-value function approximator to use as critic. A vector Q-value function must accept
an observation as input and return a single vector with as many elements as the number of possible
discrete actions.

To approximate the vector Q-value function within the critic, use a neural network. Define a single
path from the network input to its output as an array of layer objects.

layers = [
featureInputLayer(prod(obsInfo.Dimension))

3-163
3 Objects

fullyConnectedLayer(10)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))
];

Convert the network to a dlnetwork object and display the number of weights.

model = dlnetwork(layers);
summary(model)

Initialized: true

Number of learnables: 72

Inputs:
1 'input' 4 features

Create a vector Q-value function using model, and the observation and action specifications.

qValueFcn = rlVectorQValueFunction(model,obsInfo,actInfo)

qValueFcn =
rlVectorQValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
UseDevice: "cpu"

Check the critic with a random observation input.

getValue(qValueFcn,{rand(obsInfo.Dimension)})

ans = 2x1 single column vector

0.6486
-0.3103

Create a policy object from qValueFcn.

policy = rlMaxQPolicy(qValueFcn)

policy =
rlMaxQPolicy with properties:

QValueFunction: [1x1 rl.function.rlVectorQValueFunction]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: -1

Check the policy with a random observation input.

getAction(policy,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-1]}

3-164
rlMaxQPolicy

You can now train the policy with a custom training loop and then deploy it to your application.

Version History
Introduced in R2022a

See Also
Functions
rlEpsilonGreedyPolicy | rlDeterministicActorPolicy | rlAdditiveNoisePolicy |
rlStochasticActorPolicy | rlQValueFunction | rlVectorQValueFunction | rlSARSAAgent
| rlQAgent | rlDQNAgent | generatePolicyBlock | generatePolicyFunction

Blocks
RL Agent

Topics
“Create Policies and Value Functions”
“Model-Based Reinforcement Learning Using Custom Training Loop”
“Train Reinforcement Learning Policy Using Custom Training Loop”

3-165
3 Objects

rlMBPOAgent
Model-based policy optimization reinforcement learning agent

Description
A model-based policy optimization (MBPO) agent is a model-based, online, off-policy, reinforcement
learning method. An MBPO agent contains an internal model of the environment, which it uses to
generate additional experiences without interacting with the environment.

During training, the MBPO agent generates real experiences by interacting with the environment.
These experiences are used to train the internal environment model, which is used to generate
additional experiences. The training algorithm then uses both the real and generated experiences to
update the agent policy.

Creation

Syntax
agent = rlMBPOAgent(baseAgent,envModel)
agent = rlMBPOAgent( ___ ,agentOptions)

Description

agent = rlMBPOAgent(baseAgent,envModel) creates a model-based policy optimization agent


with default options and sets the BaseAgent and EnvModel properties.

agent = rlMBPOAgent( ___ ,agentOptions) creates a model-based policy optimization agent


using specified options and sets the AgentOptions property.

Properties
BaseAgent — Base reinforcement learning agent
rlDQNAgent | rlDDPGAgent | rlTD3Agent | rlSACAgent

Base reinforcement learning agent, specified as an off-policy agent object.

For environments with a discrete action space, specify a DQN agent using an rlDQNAgent object.

For environments with a continuous action space, use one of the following agent objects.

• rlDDPGAgent — DDPG agent


• rlTD3Agent — TD3 agent
• rlSACAgent — SAC agent

EnvModel — Environment model


rlNeuralNetworkEnvironment

3-166
rlMBPOAgent

Environment model, specified as an rlNeuralNetworkEnvironment object. This environment


contains transition functions, a reward function, and an is-done function.

AgentOptions — Agent options


rlMBPOAgentOptions object

Agent options, specified as an rlMBPOAgentOptions object.

RolloutHorizon — Current roll-out horizon value


positive integer

Current roll-out horizon value, specified as a positive integer. For more information on setting the
initial horizon value and the horizon update method, see rlMBPOAgentOptions.

ModelExperienceBuffer — Model experience buffer


rlReplayMemory object

Model experience buffer, specified as an rlReplayMemory object. During training the agent stores
each of its generated experiences (S,A,R,S',D) in a buffer. Here:

• S is the current observation of the environment.


• A is the action taken by the agent.
• R is the reward for taking action A.
• S' is the next observation after taking action A.
• D is the is-done signal after taking action A.

UseExplorationPolicy — Option to use exploration policy


true | false

Option to use exploration policy when selecting actions, specified as one of the following logical
values.

• true — Use the base agent exploration policy when selecting actions.
• false — Use the base agent greedy policy when selecting actions.

The initial value of UseExplorationPolicy matches the value specified in BaseAgent. If you
change the value of UseExplorationPolicy in either the base agent or the MBPO agent, the same
value is used for the other agent.

ObservationInfo — Observation specifications


specification object | array of specification objects

This property is read-only.

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data type, and names of the observation
signals.

The value of ObservationInfo matches the corresponding value specified in BaseAgent.

ActionInfo — Action specification


specification object | array of specification objects

This property is read-only.

3-167
3 Objects

Action specification, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data type, and names of the action
signals.

The value of ActionInfo matches the corresponding value specified in BaseAgent.

SampleTime — Sample time of agent


positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.

The initial value of SampleTime matches the value specified in BaseAgent. If you change the value
of SampleTime in either the base agent or the MBPO agent, the same value is used for the other
agent.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified environment

Examples

Create MBPO Agent

Create an environment interface and extract observation and action specifications.

env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a base off-policy agent. For this example, use a SAC agent.

agentOpts = rlSACAgentOptions;
agentOpts.MiniBatchSize = 256;
initOpts = rlAgentInitializationOptions(NumHiddenUnit=64);
baseagent = rlSACAgent(obsInfo,actInfo,initOpts,agentOpts);

Check your agent with a random input observation.

getAction(baseagent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-7.2875]}

3-168
rlMBPOAgent

The neural network environment uses a function approximator object to approximate the
environment transition function. The function approximator object uses one or more neural networks
as approximator model. To account for modeling uncertainty, you can specify multiple transition
models. For this example, create a single transition model.

Create a neural network to use as approximation model within the transition function object. Define
each network path as an array of layer objects. Specify a name for the input and output layers, so you
can later explicitly associate them with the appropriate channel.
% Observation and action paths
obsPath = featureInputLayer(obsInfo.Dimension(1),Name="obsIn");
actionPath = featureInputLayer(actInfo.Dimension(1),Name="actIn");

% Common path: concatenate along dimension 1


commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(obsInfo.Dimension(1),Name="nextObsOut")];

% Add layers to layerGraph object


transNet = layerGraph(obsPath);
transNet = addLayers(transNet,actionPath);
transNet = addLayers(transNet,commonPath);

% Connect layers
transNet = connectLayers(transNet,"obsIn","concat/in1");
transNet = connectLayers(transNet,"actIn","concat/in2");

% Convert to dlnetwork object


transNet = dlnetwork(transNet);

% Display number of weights


summary(transNet)

Initialized: true

Number of learnables: 4.8k

Inputs:
1 'obsIn' 4 features
2 'actIn' 1 features

Create the transition function approximator object.


transitionFcnAppx = rlContinuousDeterministicTransitionFunction( ...
transNet,obsInfo,actInfo,...
ObservationInputNames="obsIn",...
ActionInputNames="actIn",...
NextObservationOutputNames="nextObsOut");

Create a neural network to use as a reward model for the reward function approximator object.
% Observation and action paths
actionPath = featureInputLayer(actInfo.Dimension(1),Name="actIn");
nextObsPath = featureInputLayer(obsInfo.Dimension(1),Name="nextObsIn");

3-169
3 Objects

% Common path: concatenate along dimension 1


commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(1)];

% Add layers to layerGraph object


rewardNet = layerGraph(nextObsPath);
rewardNet = addLayers(rewardNet,actionPath);
rewardNet = addLayers(rewardNet,commonPath);

% Connect layers
rewardNet = connectLayers(rewardNet,"nextObsIn","concat/in1");
rewardNet = connectLayers(rewardNet,"actIn","concat/in2");

% Convert to dlnetwork object


rewardNet = dlnetwork(rewardNet);

% Display number of weights


summary(transNet)

Initialized: true

Number of learnables: 4.8k

Inputs:
1 'obsIn' 4 features
2 'actIn' 1 features

Create the reward function approximator object.


rewardFcnAppx = rlContinuousDeterministicRewardFunction( ...
rewardNet,obsInfo,actInfo, ...
ActionInputNames="actIn",...
NextObservationInputNames="nextObsIn");

Create an is-done model for the reward function approximator object.


% Define main path
net = [featureInputLayer(obsInfo.Dimension(1),Name="nextObsIn");
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(2)
softmaxLayer(Name="isdoneOut")];

% Convert to layergraph object


isDoneNet = layerGraph(net);

% Convert to dlnetwork object


isDoneNet = dlnetwork(isDoneNet);

% Display number of weights


summary(transNet)

3-170
rlMBPOAgent

Initialized: true

Number of learnables: 4.8k

Inputs:
1 'obsIn' 4 features
2 'actIn' 1 features

Create the reward function approximator object.

isdoneFcnAppx = rlIsDoneFunction(isDoneNet,obsInfo,actInfo, ...


NextObservationInputNames="nextObsIn");

Create the neural network environment using the observation and action specifications and the three
function approximator objects.

generativeEnv = rlNeuralNetworkEnvironment( ...


obsInfo,actInfo,...
transitionFcnAppx,rewardFcnAppx,isdoneFcnAppx);

Specify options for creating an MBPO agent. Specify the optimizer options for the transition network
and use default values for all other options.

MBPOAgentOpts = rlMBPOAgentOptions;
MBPOAgentOpts.TransitionOptimizerOptions = rlOptimizerOptions(...
LearnRate=1e-4,...
GradientThreshold=1.0);

Create the MBPO agent.

agent = rlMBPOAgent(baseagent,generativeEnv,MBPOAgentOpts);

Check your agent with a random input observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[7.8658]}

Version History
Introduced in R2022a

See Also
Objects
rlMBPOAgentOptions | rlNeuralNetworkEnvironment

Topics
“Model-Based Policy Optimization Agents”
“Train MBPO Agent to Balance Cart-Pole System”

3-171
3 Objects

rlMBPOAgentOptions
Options for MBPO agent

Description
Use an rlMBPOAgentOptions object to specify options for model-based policy optimization (MBPO)
agents. To create an MBPO agent, use rlMBPOAgent.

For more information, see “Model-Based Policy Optimization Agents”.

Creation

Syntax
opt = rlMBPOAgentOptions
opt = rlMBPOAgentOptions(Name=Value)

Description

opt = rlMBPOAgentOptions creates an option object for use as an argument when creating an
MBPO agent using all default options. You can modify the object properties using dot notation.

opt = rlMBPOAgentOptions(Name=Value) sets option properties on page 3-172 using name-


value pair arguments. For example, rlMBPOAgentOptions(DiscountFactor=0.95) creates an
option set with a discount factor of 0.95. You can specify multiple name-value pair arguments.

Properties
NumEpochForTrainingModel — Number of epochs
5 (default) | positive integer

Number of epochs for training the environment model, specified as a positive integer.

NumMiniBatches — Number of mini-batches


10 (default) | positive integer | "all"

Number of mini-batches used in each environment model training epoch, specified as a positive
scalar or "all". When you specify NumMiniBatches to "all", the agent selects the number of mini-
batches such that all samples in the base agents experience buffer are used to train the model.

MiniBatchSize — Size of random experience mini-batch


128 (default) | positive integer

Size of random experience mini-batch for training environment model, specified as a positive integer.
During each model training episode, the agent randomly samples experiences from the experience
buffer when computing gradients for updating the environment model properties. Large mini-batches
reduce the variance when computing gradients but increase the computational effort.

3-172
rlMBPOAgentOptions

ModelExperienceBufferLength — Generated experience buffer size


100000 (default) | positive integer

Generated experience buffer size, specified as a positive integer. When the agent generates
experiences, they are added to the model experience buffer.

RealSampleRatio — Ratio of real experiences in a mini-batch


0.2 (default) | nonnegative scalar less than or equal to 1

Ratio of real experiences in a mini-batch for agent training, specified as a nonnegative scalar less
than or equal to 1.

TransitionOptimizerOptions — Transition function optimizer options


rlOptimizerOptions object | array of rlOptimizerOptions objects

Transition function optimizer options, specified as one of the following:

• rlOptimizerOptions object — When your neural network environment has a single transition
function or if you want to use the same options for multiple transition functions, specify a single
options object.
• Array of rlOptimizerOptions objects — When your neural network environment agent has
multiple transition functions and you want to use different optimizer options for the transition
functions, specify an array of options objects with length equal to the number of transition
functions.

Using these objects, you can specify training parameters for the transition deep neural network
approximators as well as the optimizer algorithms and parameters.

If you have previously trained transition models and do not want the MBPO agent to modify these
models during training, set TransitionOptimizerOptions.LearnRate to 0.

RewardOptimizerOptions — Reward function optimizer options


rlOptimizerOptions object

Reward function optimizer options, specified as an rlOptimizerOptions object. Using this object,
you can specify training parameters for the reward deep neural network approximator as well as the
optimizer algorithm and its parameters.

If you specify a ground-truth reward function using a custom function, the MBPO agent ignores these
options.

If you have a previously trained reward model and do not want the MBPO agent to modify the model
during training, set RewardOptimizerOptions.LearnRate to 0.

IsDoneOptimizerOptions — Is-done function optimizer options


rlOptimizerOptions object

Is-done function optimizer options, specified as an rlOptimizerOptions object. Using this object,
you can specify training parameters for the is-done deep neural network approximator as well as the
optimizer algorithm and its parameters.

If you specify a ground-truth is-done function using a custom function, the MBPO agent ignores these
options.

3-173
3 Objects

If you have a previously trained is-done model and do not want the MBPO agent to modify the model
during training, set IsDoneOptimizerOptions.LearnRate to 0.

ModelRolloutOptions — Model roll-out options


rlModelRolloutOptions object

Model roll-out options for controlling the number and length of generated experience trajectories,
specified as an rlModelRolloutOptions object with the following fields. At the start of each epoch,
the agent generates the roll-out trajectories and adds them to the model experience buffer. To modify
the roll-out options, use dot notation.

NumRollout — Number of trajectories


2000 (default) | positive integer

Number of trajectories for generating samples, specified as a positive integer.

Horizon — Initial trajectory horizon


1 (default) | positive integer

Initial trajectory horizon, specified as a positive integer.

HorizonUpdateSchedule — Option for increasing horizon length


"none" (default) | "piecewise"

Option for increasing the horizon length, specified as one of the following values.

• "none" — Do not increase the horizon length.


• "piecewise" — Increase the horizon length by one after every N model training epochs, where
N is equal to HorizonUpdateFrequency.

RolloutHorizonUpdateFrequency — Number of epochs after which the horizon increases


100 (default) | positive integer

Number of epochs after which the horizon increases, specified as a positive integer. When
RolloutHorizonSchedule is "none" this option is ignored.

HorizonMax — Maximum horizon length


20 (default) | positive integer

Maximum horizon length, specified as a positive integer greater than or equal to RolloutHorizon.
When RolloutHorizonSchedule is "none" this option is ignored.

HorizonUpdateStartEpoch — Training epoch at which to start generating trajectories


1 (default) | positive integer

Training epoch at which to start generating trajectories, specified as a positive integer.

NoiseOptions — Exploration model options


[] (default) | EpsilonGreedyExploration object | GaussianActionNoise object

Exploration model options for generating experiences using the internal environment model,
specified as one of the following:

• [] — Use the exploration policy of the base agent. You must use this option when training a SAC
base agent.

3-174
rlMBPOAgentOptions

• EpsilonGreedyExploration object — You can use this option when training a DQN base agent.
• GaussianActionNoise object — You can use this option when training a DDPG or TD3 base
agent.

The exploration model uses only the initial noise option values and does not update the values during
training.

To specify NoiseOptions, create a default model object. Then, specify any nondefault model
properties using dot notation.

• Specify epsilon greedy exploration options.


opt = rlMBPOAgentOptions;
opt.ModelRolloutOptions.NoiseOptions = ...
rl.option.EpsilonGreedyExploration;
opt.ModelRolloutOptions.NoiseOptions.EpsilonMin = 0.03;
• Specify Gaussian action noise options.
opt = rlMBPOAgentOptions;
opt.ModelRolloutOptions.NoiseOptions = ...
rl.option.GaussianActionNoise;
opt.ModelRolloutOptions.NoiseOptions.StandardDeviation = sqrt(0.15);

For more information on noise models, see “Noise Models” on page 3-176.

Object Functions
rlMBPOAgent Model-based policy optimization reinforcement learning agent

Examples

Create MBPO Agent Options Object

Create an MBPO agent options object, specifying the ratio of real experiences to use for training the
agent as 30%.
opt = rlMBPOAgentOptions(RealSampleRatio=0.3)

opt =
rlMBPOAgentOptions with properties:

NumEpochForTrainingModel: 1
NumMiniBatches: 10
MiniBatchSize: 128
TransitionOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
RewardOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
IsDoneOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
ModelExperienceBufferLength: 100000
ModelRolloutOptions: [1x1 rl.option.rlModelRolloutOptions]
RealSampleRatio: 0.3000
InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the mini-batch size to 64.
opt.MiniBatchSize = 64;

3-175
3 Objects

Algorithms
Noise Models
Gaussian Action Noise

A GaussianActionNoise object has the following numeric value properties. When generating
experiences, MBPO agents do not update their exploration model parameters.

Property Description Default Value


Mean Noise model mean 0
StandardDeviation Noise model standard deviation sqrt(0.2)
StandardDeviationDecayRa Decay rate of the standard 0
te deviation

(not used for generating


samples)
StandardDeviationMin Minimum standard deviation, 0.1
which must be less than
StandardDeviation

(not used for generating


samples)
LowerLimit Noise sample lower limit -Inf
UpperLimit Noise sample upper limit Inf

At each time step k, the Gaussian noise v is sampled as shown in the following code.

w = Mean + rand(ActionSize).*StandardDeviation(k);
v(k+1) = min(max(w,LowerLimit),UpperLimit);

Epsilon Greedy Exploration

An EpsilonGreedyExploration object has the following numeric value properties. When


generating experiences, MBPO agents do not update their exploration model parameters.

Property Description Default Value


Epsilon Probability threshold to either 1
randomly select an action or
select the action that maximizes
the state-action value function.
A larger value of Epsilon
means that the agent randomly
explores the action space at a
higher rate.
EpsilonMin Minimum value of Epsilon 0.01

(not used for generating


samples)

3-176
rlMBPOAgentOptions

Property Description Default Value


EpsilonDecay Decay rate 0.005

(not used for generating


samples)

Version History
Introduced in R2022a

See Also
Objects
rlMBPOAgent | rlNeuralNetworkEnvironment

Topics
“Model-Based Policy Optimization Agents”

3-177
3 Objects

rlMDPEnv
Create Markov decision process environment for reinforcement learning

Description
A Markov decision process (MDP) is a discrete time stochastic control process. It provides a
mathematical framework for modeling decision making in situations where outcomes are partly
random and partly under the control of the decision maker. MDPs are useful for studying optimization
problems solved using reinforcement learning. Use rlMDPEnv to create a Markov decision process
environment for reinforcement learning in MATLAB.

Creation

Syntax
env = rlMDPEnv(MDP)

Description

env = rlMDPEnv(MDP) creates a reinforcement learning environment env with the specified MDP
model.

Input Arguments

MDP — Markov decision process model


GridWorld object | GenericMDP object

Markov decision process model, specified as one of the following:

• GridWorld object created using createGridWorld.


• GenericMDP object created using createMDP.

Properties
Model — Markov decision process model
GridWorld object | GenericMDP object

Markov decision process model, specified as a GridWorld object or GenericMDP object.

ResetFcn — Reset function


function handle

Reset function, specified as a function handle.

3-178
rlMDPEnv

Object Functions
getActionInfo Obtain action data specifications from reinforcement learning environment,
agent, or experience buffer
getObservationInfo Obtain observation data specifications from reinforcement learning
environment, agent, or experience buffer
sim Simulate trained reinforcement learning agents within specified environment
train Train reinforcement learning agents within a specified environment
validateEnvironment Validate custom reinforcement learning environment

Examples

Create Grid World Environment

For this example, consider a 5-by-5 grid world with the following rules:
1 A 5-by-5 grid world bounded by borders, with 4 possible actions (North = 1, South = 2, East = 3,
West = 4).
2 The agent begins from cell [2,1] (second row, first column).
3 The agent receives reward +10 if it reaches the terminal state at cell [5,5] (blue).
4 The environment contains a special jump from cell [2,4] to cell [4,4] with +5 reward.
5 The agent is blocked by obstacles in cells [3,3], [3,4], [3,5] and [4,3] (black cells).
6 All other actions result in -1 reward.

First, create a GridWorld object using the createGridWorld function.


GW = createGridWorld(5,5)

GW =
GridWorld with properties:

3-179
3 Objects

GridSize: [5 5]
CurrentState: "[1,1]"
States: [25x1 string]
Actions: [4x1 string]
T: [25x25x4 double]
R: [25x25x4 double]
ObstacleStates: [0x1 string]
TerminalStates: [0x1 string]
ProbabilityTolerance: 8.8818e-16

Now, set the initial, terminal and obstacle states.

GW.CurrentState = '[2,1]';
GW.TerminalStates = '[5,5]';
GW.ObstacleStates = ["[3,3]";"[3,4]";"[3,5]";"[4,3]"];

Update the state transition matrix for the obstacle states and set the jump rule over the obstacle
states.

updateStateTranstionForObstacles(GW)
GW.T(state2idx(GW,"[2,4]"),:,:) = 0;
GW.T(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 1;

Next, define the rewards in the reward transition matrix.

nS = numel(GW.States);
nA = numel(GW.Actions);
GW.R = -1*ones(nS,nS,nA);
GW.R(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 5;
GW.R(:,state2idx(GW,GW.TerminalStates),:) = 10;

Now, use rlMDPEnv to create a grid world environment using the GridWorld object GW.

env = rlMDPEnv(GW)

env =
rlMDPEnv with properties:

Model: [1x1 rl.env.GridWorld]


ResetFcn: []

You can visualize the grid world environment using the plot function.

plot(env)

3-180
rlMDPEnv

Version History
Introduced in R2019a

See Also
createGridWorld | rlPredefinedEnv

Topics
“Train Reinforcement Learning Agent in Basic Grid World”
“Create Custom Grid World Environments”
“Train Reinforcement Learning Agent in MDP Environment”

3-181
3 Objects

rlMultiAgentTrainingOptions
Options for training multiple reinforcement learning agents

Description
Use an rlMultiAgentTrainingOptions object to specify training options for multiple agents. To
train the agents, use train.

For more information on training agents, see “Train Reinforcement Learning Agents”.

Creation

Syntax
trainOpts = rlMultiAgentTrainingOptions
trainOpts = rlMultiAgentTrainingOptions(Name,Value)

Description

trainOpts = rlMultiAgentTrainingOptions returns the default options for training multiple


reinforcement learning agents. Use training options to specify parameters for the training session,
such as the maximum number of episodes to train, criteria for stopping training, and criteria for
saving agents. After configuring the options, use trainOpts as an input argument for train.

trainOpts = rlMultiAgentTrainingOptions(Name,Value) creates a training option set and


sets object “Properties” on page 3-182 using one or more name-value pair arguments.

Properties
AgentGroups — Agent grouping indices
"auto" (default) | cell array of positive integers | cell array of integer arrays

Agent grouping indices, specified as a cell array of positive integers or a cell array of integer arrays.

For instance, consider a training scenario with 4 agents. You can group the agents in the following
ways:

• Allocate each agent in a separate group:

trainOpts = rlMultiAgentTrainingOptions("AgentGroups","auto")
• Specify four agent groups with one agent in each group:

trainOpts = rlMultiAgentTrainingOptions("AgentGroups",{1,2,3,4})
• Specify two agent groups with two agents each:

trainOpts = rlMultiAgentTrainingOptions("AgentGroups",{[1,2],[3,4]})
• Specify three agent groups:

3-182
rlMultiAgentTrainingOptions

trainOpts = rlMultiAgentTrainingOptions("AgentGroups",{[1,4],2,3})

AgentGroups and LearningStrategy must be used together to specify whether agent groups
learn in a centralized manner or decentralized manner.
Example: AgentGroups={1,2,[3,4]}

LearningStrategy — Learning strategy for each agent group


"decentralized" (default) | "centralized"

Learning strategy for each agent group, specified as either "decentralized" or "centralized".
In decentralized training, agents collect their own set of experiences during the episodes and learn
independently from those experiences. In centralized training, the agents share the collected
experiences and learn from them together.

AgentGroups and LearningStrategy must be used together to specify whether agent groups
learn in a centralized manner or decentralized manner. For example, you can use the following
command to configure training for three agent groups with different learning strategies. The agents
with indices [1,2] and [3,5] learn in a centralized manner, while agent 4 learns in a decentralized
manner.
trainOpts = rlMultiAgentTrainingOptions(...
AgentGroups={[1,2],4,[3,5]},...
LearningStrategy=["centralized","decentralized","centralized"] )

Example: LearningStrategy="centralized"

MaxEpisodes — Maximum number of episodes to train the agents


500 (default) | positive integer

Maximum number of episodes to train the agents, specified as a positive integer. Regardless of other
criteria for termination, training terminates after MaxEpisodes.
Example: MaxEpisodes=1000

MaxStepsPerEpisode — Maximum number of steps to run per episode


500 (default) | positive integer

Maximum number of steps to run per episode, specified as a positive integer. In general, you define
episode termination conditions in the environment. This value is the maximum number of steps to run
in the episode if other termination conditions are not met.
Example: MaxStepsPerEpisode=1000

ScoreAveragingWindowLength — Window length for averaging


5 (default) | positive integer scalar | positive integer vector

Window length for averaging the scores, rewards, and number of steps for each agent, specified as a
scalar or vector.

Specify a scalar to apply the same window length to all agents. To use a different window length for
each agent, specify ScoreAveragingWindowLength as a vector. In this case, the order of the
elements in the vector correspond to the order of the agents used during environment creation.

For options expressed in terms of averages, ScoreAveragingWindowLength is the number of


episodes included in the average. For instance, if StopTrainingCriteria is "AverageReward"
and StopTrainingValue is 500 for a given agent then for that agent, training terminates when the

3-183
3 Objects

average reward over the number of episodes specified in ScoreAveragingWindowLength equals or


exceeds 500. For the other agents, training continues until:

• All agents reach their stop criteria.


• The number of episodes reaches MaxEpisodes.
• You stop training by clicking the Stop Training button in Episode Manager or pressing Ctrl-C at
the MATLAB command line.

Example: ScoreAveragingWindowLength=10

StopTrainingCriteria — Training termination condition


"AverageSteps" (default) | "AverageReward" | "EpisodeReward" | "GlobalStepCount" |
"EpisodeCount" | ...

Training termination condition, specified as one of the following strings:

• "AverageSteps" — Stop training when the running average number of steps per episode equals
or exceeds the critical value specified by the option StopTrainingValue. The average is
computed using the window 'ScoreAveragingWindowLength'.
• "AverageReward" — Stop training when the running average reward equals or exceeds the
critical value.
• "EpisodeReward" — Stop training when the reward in the current episode equals or exceeds the
critical value.
• "GlobalStepCount" — Stop training when the total number of steps in all episodes (the total
number of times the agent is invoked) equals or exceeds the critical value.
• "EpisodeCount" — Stop training when the number of training episodes equals or exceeds the
critical value.

Example: StopTrainingCriteria="AverageReward"

StopTrainingValue — Critical value of training termination condition


500 (default) | scalar | vector

Critical value of the training termination condition, specified as a scalar or a vector.

Specify a scalar to apply the same termination criterion to all agents. To use a different termination
criterion for each agent, specify StopTrainingValue as a vector. In this case, the order of the
elements in the vector corresponds to the order of the agents used during environment creation.

For a given agent, training ends when the termination condition specified by the
StopTrainingCriteria option equals or exceeds this value. For the other agents, the training
continues until:

• All agents reach their stop criteria.


• The number of episodes reaches maxEpisodes.
• You stop training by clicking the Stop Training button in Episode Manager or pressing Ctrl-C at
the MATLAB command line.

For instance, if StopTrainingCriteria is "AverageReward" and StopTrainingValue is 100 for


a given agent, then for that agent training terminates when the average reward over the number of
episodes specified in ScoreAveragingWindowLength equals or exceeds 100.
Example: StopTrainingValue=100

3-184
rlMultiAgentTrainingOptions

SaveAgentCriteria — Condition for saving agents during training


"None" (default) | "EpisodeReward" | "AverageSteps" | "AverageReward" |
"GlobalStepCount" | "EpisodeCount" | ...

Condition for saving agents during training, specified as one of the following strings:

• "None" — Do not save any agents during training.


• "EpisodeReward" — Save the agent when the reward in the current episode equals or exceeds
the critical value.
• "AverageSteps" — Save the agent when the running average number of steps per episode
equals or exceeds the critical value specified by the option StopTrainingValue. The average is
computed using the window 'ScoreAveragingWindowLength'.
• "AverageReward" — Save the agent when the running average reward over all episodes equals
or exceeds the critical value.
• "GlobalStepCount" — Save the agent when the total number of steps in all episodes (the total
number of times the agent is invoked) equals or exceeds the critical value.
• "EpisodeCount" — Save the agent when the number of training episodes equals or exceeds the
critical value.

Set this option to store candidate agents that perform well according to the criteria you specify. When
you set this option to a value other than "none", the software sets the SaveAgentValue option to
500. You can change that value to specify the condition for saving the agent.

For instance, suppose you want to store for further testing any agent that yields an episode reward
that equals or exceeds 100. To do so, set SaveAgentCriteria to "EpisodeReward" and set the
SaveAgentValue option to 100. When an episode reward equals or exceeds 100, train saves the
corresponding agent in a MAT file in the folder specified by the SaveAgentDirectory option. The
MAT file is called AgentK.mat, where K is the number of the corresponding episode. The agent is
stored within that MAT file as saved_agent.
Example: SaveAgentCriteria="EpisodeReward"

SaveAgentValue — Critical value of condition for saving agents


"none" (default) | 500 | scalar | vector

Critical value of the condition for saving agents, specified as a scalar or a vector.

Specify a scalar to apply the same saving criterion to each agent. To save the agents when one meets
a particular criterion, specify SaveAgentValue as a vector. In this case, the order of the elements in
the vector corresponds to the order of the agents used when creating the environment. When a
criteria for saving an agent is met, all agents are saved in the same MAT file.

When you specify a condition for saving candidate agents using SaveAgentCriteria, the software
sets this value to 500. Change the value to specify the condition for saving the agent. See the
SaveAgentCriteria option for more details.
Example: SaveAgentValue=100

SaveAgentDirectory — Folder for saved agents


"savedAgents" (default) | string | character vector

Folder for saved agents, specified as a string or character vector. The folder name can contain a full
or relative path. When an episode occurs that satisfies the condition specified by the

3-185
3 Objects

SaveAgentCriteria and SaveAgentValue options, the software saves the agents in a MAT file in
this folder. If the folder does not exist, train creates it. When SaveAgentCriteria is "none", this
option is ignored and train does not create a folder.
Example: SaveAgentDirectory = pwd + "\run1\Agents"

StopOnError — Option to stop training when error occurs


"on" (default) | "off"

Option to stop training when an error occurs during an episode, specified as "on" or "off". When
this option is "off", errors are captured and returned in the SimulationInfo output of train, and
training continues to the next episode.
Example: StopOnError = "off"

Verbose — Option to display training progress on the command line


false (0) (default) | true (1)

Option to display training progress on the command line, specified as the logical value false (0) or
true (1). Set to true to write information from each training episode to the MATLAB command line
during training.
Example: Verbose = true

Plots — Option to display training progress with Episode Manager


"training-progress" (default) | "none"

Option to display training progress with Episode Manager, specified as "training-progress" or


"none". By default, calling train opens the Reinforcement Learning Episode Manager, which
graphically and numerically displays information about the training progress, such as the reward for
each episode, average reward, number of episodes, and total number of steps. (For more information,
see train.) To turn off this display, set this option to "none".
Example: Plots = "none"

Object Functions
train Train reinforcement learning agents within a specified environment

Examples

Configure Options for Multi Agent Training

Create an options set for training 5 reinforcement learning agents. Set the maximum number of
episodes and the maximum number of steps per episode to 1000. Configure the options to stop
training when the average reward equals or exceeds 480, and turn on both the command-line display
and Reinforcement Learning Episode Manager for displaying training results. You can set the options
using name-value pair arguments when you create the options set. Any options that you do not
explicitly set have their default values.

trainOpts = rlMultiAgentTrainingOptions(...
AgentGroups={[1,2],3,[4,5]},...
LearningStrategy=["centralized","decentralized","centralized"],...
MaxEpisodes=1000,...
MaxStepsPerEpisode=1000,...

3-186
rlMultiAgentTrainingOptions

StopTrainingCriteria="AverageReward",...
StopTrainingValue=480,...
Verbose=true,...
Plots="training-progress")

trainOpts =
rlMultiAgentTrainingOptions with properties:

AgentGroups: {[1 2] [3] [4 5]}


LearningStrategy: ["centralized" "decentralized" ... ]
MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: 5
StopTrainingCriteria: "AverageReward"
StopTrainingValue: 480
SaveAgentCriteria: "none"
SaveAgentValue: "none"
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"

Alternatively, create a default options set and use dot notation to change some of the values.

trainOpts = rlMultiAgentTrainingOptions;

trainOpts.AgentGroups = {[1,2],3,[4,5]};
trainOpts.LearningStrategy = ["centralized","decentralized","centralized"];
trainOpts.MaxEpisodes = 1000;
trainOpts.MaxStepsPerEpisode = 1000;
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 480;
trainOpts.Verbose = true;
trainOpts.Plots = "training-progress";

trainOpts

trainOpts =
rlMultiAgentTrainingOptions with properties:

AgentGroups: {[1 2] [3] [4 5]}


LearningStrategy: ["centralized" "decentralized" ... ]
MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: 5
StopTrainingCriteria: "AverageReward"
StopTrainingValue: 480
SaveAgentCriteria: "none"
SaveAgentValue: "none"
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"

You can now use trainOpts as an input argument to the train command.

3-187
3 Objects

Configure Options for Training a Multi-Agent Environment

Create an options object for concurrently training three agents in the same environment.

Set the maximum number of episodes and the maximum steps per episode to 1000. Configure the
options to stop training the first agent when its average reward over 5 episodes equals or exceeds
400, the second agent when its average reward over 10 episodes equals or exceeds 500, and the
third when its average reward over 15 episodes equals or exceeds 600. The order of agents is the one
used during environment creation.

Save the agents when the reward for the first agent in the current episode exceeds 100, or when the
reward for the second agent exceeds 120, the reward for the third agent equals or exceeds 140.

Turn on both the command-line display and Reinforcement Learning Episode Manager for displaying
training results. You can set the options using name-value pair arguments when you create the
options set. Any options that you do not explicitly set have their default values.

trainOpts = rlMultiAgentTrainingOptions(...
MaxEpisodes=1000,...
MaxStepsPerEpisode=1000,...
ScoreAveragingWindowLength=[5 10 15],...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=[400 500 600],...
SaveAgentCriteria="EpisodeReward",...
SaveAgentValue=[100 120 140],...
Verbose=true,...
Plots="training-progress")

trainOpts =
rlMultiAgentTrainingOptions with properties:

AgentGroups: "auto"
LearningStrategy: "decentralized"
MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: [5 10 15]
StopTrainingCriteria: "AverageReward"
StopTrainingValue: [400 500 600]
SaveAgentCriteria: "EpisodeReward"
SaveAgentValue: [100 120 140]
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"

Alternatively, create a default options set and use dot notation to change some of the values.

trainOpts = rlMultiAgentTrainingOptions;
trainOpts.MaxEpisodes = 1000;
trainOpts.MaxStepsPerEpisode = 1000;

trainOpts.ScoreAveragingWindowLength = [5 10 15];

trainOpts.StopTrainingCriteria = "AverageReward";

3-188
rlMultiAgentTrainingOptions

trainOpts.StopTrainingValue = [400 500 600];

trainOpts.SaveAgentCriteria = "EpisodeReward";
trainOpts.SaveAgentValue = [100 120 140];

trainOpts.Verbose = true;
trainOpts.Plots = "training-progress";

trainOpts

trainOpts =
rlMultiAgentTrainingOptions with properties:

AgentGroups: "auto"
LearningStrategy: "decentralized"
MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: [5 10 15]
StopTrainingCriteria: "AverageReward"
StopTrainingValue: [400 500 600]
SaveAgentCriteria: "EpisodeReward"
SaveAgentValue: [100 120 140]
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"

You can specify a scalar to apply the same criterion to all agents. For example, use a window length of
10 for all three agents.

trainOpts.ScoreAveragingWindowLength = 10

trainOpts =
rlMultiAgentTrainingOptions with properties:

AgentGroups: "auto"
LearningStrategy: "decentralized"
MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: 10
StopTrainingCriteria: "AverageReward"
StopTrainingValue: [400 500 600]
SaveAgentCriteria: "EpisodeReward"
SaveAgentValue: [100 120 140]
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"

You can now use trainOpts as an input argument to the train command.

Version History
Introduced in R2022a

3-189
3 Objects

See Also
train | rlTrainingOptions

Topics
“Reinforcement Learning Agents”

3-190
rlNeuralNetworkEnvironment

rlNeuralNetworkEnvironment
Environment model with deep neural network transition models

Description
Use an rlNeuralNetworkEnvironment object to create a reinforcement learning environment that
computes state transitions using deep neural networks.

Using an rlNeuralNetworkEnvironment object you can:

• Create an internal environment model for a model-based policy optimization (MBPO) agent.
• Create an environment for training other types of reinforcement learning agents. You can identify
the state-transition network using experimental or simulated data.

Such environments can compute environment rewards and termination conditions using deep neural
networks or custom functions.

Creation

Syntax
env = rlNeuralNetworkEnvironment(obsInfo,actInfo,transitionFcn,rewardFcn,
isDoneFcn)

Description

env = rlNeuralNetworkEnvironment(obsInfo,actInfo,transitionFcn,rewardFcn,
isDoneFcn) creates a model for an environment with the observation and action specifications
specified in obsInfo and actInfo, respectively. This syntax sets the TransitionFcn, RewardFcn,
and IsDoneFcn properties.

Input Arguments

obsInfo — Observation specifications


specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data type, and names of the observation
signals.

You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

actInfo — Action specifications


specification object

3-191
3 Objects

Action specifications, specified as a reinforcement learning specification object defining properties


such as dimensions, data type, and names of the action signals.

You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.

Properties
TransitionFcn — Environment transition function
rlContinuousDeterministicTransitionFunction object |
rlContinuousGaussianTransitionFunction object | array of transition objects

Environment transition function, specified as one of the following:

• rlContinuousDeterministicTransitionFunction object — Use this option when you expect


the environment transitions to be deterministic.
• rlContinuousGaussianTransitionFunction object — Use this option when you expect the
environment transitions to be stochastic.
• Vector of transition objects — Use multiple transition models for an MBPO agent.

RewardFcn — Environment reward function


rlContinuousDeterministicRewardFunction object |
rlContinuousGaussianRewardFunction object | function handle

Environment reward function, specified as one of the following:

• rlContinuousDeterministicRewardFunction object — Use this option when you do not


know a ground-truth reward signal for your environment and you expect the reward signal to be
deterministic.
• rlContinuousGaussianRewardFunction object — Use this option when you do not know a
ground-truth reward signal for your environment and you expect the reward signal to be
stochastic.
• Function handle — Use this option when you know a ground-truth reward signal for your
environment. When you use an rlNeuralNetworkEnvironment object to create an
rlMBPOAgent object, the custom reward function must return a batch of rewards given a batch of
inputs.

IsDoneFcn — Environment is-done function


rlIsDoneFunction object | function handle

Environment is-done function, specified as one of the following:

• rlIsDoneFunction object — Use this option when you do not know a ground-truth termination
signal for your environment.
• Function handle — Use this option when you know a ground-truth termination signal for your
environment. When you use an rlNeuralNetworkEnvironment object to create an
rlMBPOAgent object, the custom is-done function must return a batch of termination signals
given a batch of inputs.

Observation — Observation values


cell array

3-192
rlNeuralNetworkEnvironment

Observation values, specified as a cell array with length equal to the number of specification objects
in obsInfo. The order of the observations in Observation must match the order in obsInfo. Also,
the dimensions of each element of the cell array must match the dimensions of the corresponding
observation specification in obsInfo.

To evaluate whether the transition models are well-trained, you can manually evaluate the
environment for a given observation value using the step function. Specify the observation values
before calling step.

When you use this neural network environment object within an MBPO agent, this property is
ignored.

TransitionModelNum — Transition model index


1 (default) | positive integer

Transition model index, specified as a positive integer.

To evaluate whether the transition models are well-trained, you can manually evaluate the
environment for a given observation value using the step function. To select which transition model
in TransitionFcn to evaluate, specify the transition model index before calling step.

When you use this neural network environment object within an MBPO agent, this property is
ignored.

Object Functions
rlMBPOAgent Model-based policy optimization reinforcement learning agent

Examples

Create Neural Network Environment

Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Get the dimension of the observation and action spaces.


numObservations = obsInfo.Dimension(1);
numActions = actInfo.Dimension(1);

Create a deterministic transition function based on a deep neural network with two input channels
(current observations and actions) and one output channel (predicted next observation).
% Create network layers.
statePath = featureInputLayer(numObservations, ...
Normalization="none",Name="state");
actionPath = featureInputLayer(numActions, ...
Normalization="none",Name="action");
commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")

3-193
3 Objects

fullyConnectedLayer(64, Name="FC3")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(numObservations,Name="nextObservation")];

% Combine network layers.


transitionNetwork = layerGraph(statePath);
transitionNetwork = addLayers(transitionNetwork,actionPath);
transitionNetwork = addLayers(transitionNetwork,commonPath);
transitionNetwork = connectLayers( ...
transitionNetwork,"state","concat/in1");
transitionNetwork = connectLayers( ...
transitionNetwork,"action","concat/in2");

% Create dlnetwork object.


transitionNetwork = dlnetwork(transitionNetwork);

% Create transition function object.


transitionFcn = rlContinuousDeterministicTransitionFunction(...
transitionNetwork,obsInfo,actInfo,...
ObservationInputNames="state", ...
ActionInputNames="action", ...
NextObservationOutputNames="nextObservation");

Create a deterministic reward function with two input channels (current action and next
observations) and one output channel (predicted reward value).

% Create network layers.


nextStatePath = featureInputLayer( ...
numObservations,Name="nextState");
commonPath = [concatenationLayer(1,3,Name="concat")
fullyConnectedLayer(32,Name="fc")
reluLayer(Name="relu1")
fullyConnectedLayer(32,Name="fc2")];
meanPath = [reluLayer(Name="rewardMeanRelu")
fullyConnectedLayer(1,Name="rewardMean")];
stdPath = [reluLayer(Name="rewardStdRelu")
fullyConnectedLayer(1,Name="rewardStdFc")
softplusLayer(Name="rewardStd")];

% Combine network layers.


rewardNetwork = layerGraph(statePath);
rewardNetwork = addLayers(rewardNetwork,actionPath);
rewardNetwork = addLayers(rewardNetwork,nextStatePath);
rewardNetwork = addLayers(rewardNetwork,commonPath);
rewardNetwork = addLayers(rewardNetwork,meanPath);
rewardNetwork = addLayers(rewardNetwork,stdPath);

rewardNetwork = connectLayers( ...


rewardNetwork,"nextState","concat/in1");
rewardNetwork = connectLayers( ...
rewardNetwork,"action","concat/in2");
rewardNetwork = connectLayers( ...
rewardNetwork,"state","concat/in3");
rewardNetwork = connectLayers( ...
rewardNetwork,"fc2","rewardMeanRelu");
rewardNetwork = connectLayers( ...
rewardNetwork,"fc2","rewardStdRelu");

3-194
rlNeuralNetworkEnvironment

% Create dlnetwork object.


rewardNetwork = dlnetwork(rewardNetwork);

% Create reward function object.


rewardFcn = rlContinuousGaussianRewardFunction(...
rewardNetwork,obsInfo,actInfo,...
ObservationInputNames="state",...
ActionInputNames="action", ...
NextObservationInputNames="nextState", ...
RewardMeanOutputNames="rewardMean", ...
RewardStandardDeviationOutputNames="rewardStd");

Create an is-done function with one input channel (next observations) and one output channel
(predicted termination signal).
% Create network layers.
commonPath = [featureInputLayer(numObservations, ...
Normalization="none",Name="nextState");
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64,Name="FC3")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(2,Name="isdone0")
softmaxLayer(Name="isdone")];
isDoneNetwork = layerGraph(commonPath);

% Create dlnetwork object.


isDoneNetwork = dlnetwork(isDoneNetwork);

% Create is-done function object.


isDoneFcn = rlIsDoneFunction(isDoneNetwork, ...
obsInfo,actInfo, ...
NextObservationInputNames="nextState");

Create a neural network environment using the transition, reward, and is-done functions.
env = rlNeuralNetworkEnvironment( ...
obsInfo,actInfo, ...
transitionFcn,rewardFcn,isDoneFcn);

Create Neural Network Environment Using Custom Functions

Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
numObservations = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
numActions = actInfo.Dimension(1);

Create a deterministic transition function based on a deep neural network with two input channels
(current observations and actions) and one output channel (predicted next observation).
% Create network layers.
statePath = featureInputLayer(numObservations,...

3-195
3 Objects

Normalization="none",Name="state");
actionPath = featureInputLayer(numActions,...
Normalization="none",Name="action");
commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64, Name="FC3")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(numObservations,Name="nextObservation")];

% Combine network layers.


transitionNetwork = layerGraph(statePath);
transitionNetwork = addLayers(transitionNetwork,actionPath);
transitionNetwork = addLayers(transitionNetwork,commonPath);
transitionNetwork = connectLayers(transitionNetwork,"state","concat/in1");
transitionNetwork = connectLayers(transitionNetwork,"action","concat/in2");

% Create dlnetwork object.


transitionNetwork = dlnetwork(transitionNetwork);

% Create transition function object.


transitionFcn = rlContinuousDeterministicTransitionFunction(...
transitionNetwork,obsInfo,actInfo,...
ObservationInputNames="state", ...
ActionInputNames="action", ...
NextObservationOutputNames="nextObservation");

You can define a known reward function for your environment using a custom function. Your custom
reward function must take the observations, actions, and next observations as cell-array inputs and
return a scalar reward value. For this example, use the following custom reward function, which
computes the reward based on the next observation.

type cartPoleRewardFunction.m

function reward = cartPoleRewardFunction(obs,action,nextObs)


% Compute reward value based on the next observation.

if iscell(nextObs)
nextObs = nextObs{1};
end

% Distance at which to fail the episode


xThreshold = 2.4;

% Reward each time step the cart-pole is balanced


rewardForNotFalling = 1;

% Penalty when the cart-pole fails to balance


penaltyForFalling = -5;

x = nextObs(1,:);
distReward = 1 - abs(x)/xThreshold;

isDone = cartPoleIsDoneFunction(obs,action,nextObs);

reward = zeros(size(isDone));
reward(logical(isDone)) = penaltyForFalling;
reward(~logical(isDone)) = ...

3-196
rlNeuralNetworkEnvironment

0.5 * rewardForNotFalling + 0.5 * distReward(~logical(isDone));


end

You can define a known is-done function for your environment using a custom function. Your custom
is-done function must take the observations, actions, and next observations as cell-array inputs and
return a logical termination signal. For this example, use the following custom is-done function, which
computes the termination signal based on the next observation.

type cartPoleIsDoneFunction.m

function isDone = cartPoleIsDoneFunction(obs,action,nextObs)


% Compute termination signal based on next observation.

if iscell(nextObs)
nextObs = nextObs{1};
end

% Angle at which to fail the episode


thetaThresholdRadians = 12 * pi/180;

% Distance at which to fail the episode


xThreshold = 2.4;

x = nextObs(1,:);
theta = nextObs(3,:);

isDone = abs(x) > xThreshold | abs(theta) > thetaThresholdRadians;


end

Create a neural network environment using the transition function object and the custom reward and
is-done functions.

env = rlNeuralNetworkEnvironment(obsInfo,actInfo,transitionFcn,...
@cartPoleRewardFunction,@cartPoleIsDoneFunction);

Version History
Introduced in R2022a

See Also
Objects
rlMBPOAgent | rlMBPOAgentOptions | rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction

Topics
“Model-Based Policy Optimization Agents”

3-197
3 Objects

rlNumericSpec
Create continuous action or observation data specifications for reinforcement learning environments

Description
An rlNumericSpec object specifies continuous action or observation data specifications for
reinforcement learning environments.

Creation

Syntax
spec = rlNumericSpec(dimension)
spec = rlNumericSpec(dimension,Name,Value)

Description

spec = rlNumericSpec(dimension) creates a data specification for continuous actions or


observations and sets the Dimension property.

spec = rlNumericSpec(dimension,Name,Value) sets “Properties” on page 3-198 using name-


value pair arguments.

Properties
LowerLimit — Lower limit of the data space
-Inf (default) | scalar | matrix

Lower limit of the data space, specified as a scalar or matrix of the same size as the data space. When
LowerLimit is specified as a scalar, rlNumericSpec applies it to all entries in the data space.

UpperLimit — Upper limit of the data space


Inf (default) | scalar | matrix

Upper limit of the data space, specified as a scalar or matrix of the same size as the data space. When
UpperLimit is specified as a scalar, rlNumericSpec applies it to all entries in the data space.

Name — Name of the rlNumericSpec object


string

Name of the rlNumericSpec object, specified as a string.

Description — Description of the rlNumericSpec object


string

Description of the rlNumericSpec object, specified as a string.

3-198
rlNumericSpec

Dimension — Dimension of the data space


numeric vector

This property is read-only.

Dimension of the data space, specified as a numeric vector.

DataType — Information about the type of data


"double" (default) | string

This property is read-only.

Information about the type of data, specified as a string, such as "double" or "single".

Object Functions
rlSimulinkEnv Create reinforcement learning environment using dynamic model
implemented in Simulink
rlFunctionEnv Specify custom reinforcement learning environment dynamics
using functions
rlValueFunction Value function approximator object for reinforcement learning
agents
rlQValueFunction Q-Value function approximator object for reinforcement learning
agents
rlVectorQValueFunction Vector Q-value function approximator for reinforcement learning
agents
rlContinuousDeterministicActor Deterministic actor with a continuous action space for
reinforcement learning agents
rlDiscreteCategoricalActor Stochastic categorical actor with a discrete action space for
reinforcement learning agents
rlContinuousGaussianActor Stochastic Gaussian actor with a continuous action space for
reinforcement learning agents

Examples

Create Reinforcement Learning Environment for Simulink Model

For this example, consider the rlSimplePendulumModel Simulink model. The model is a simple
frictionless pendulum that initially hangs in a downward position.

Open the model.


mdl = 'rlSimplePendulumModel';
open_system(mdl)

Create rlNumericSpec and rlFiniteSetSpec objects for the observation and action information,
respectively.

The observation is a vector containing three signals: the sine, cosine, and time derivative of the
angle.
obsInfo = rlNumericSpec([3 1])

obsInfo =
rlNumericSpec with properties:

3-199
3 Objects

LowerLimit: -Inf
UpperLimit: Inf
Name: [0×0 string]
Description: [0×0 string]
Dimension: [3 1]
DataType: "double"

The action is a scalar expressing the torque and can be one of three possible values, -2 Nm, 0 Nm and
2 Nm.
actInfo = rlFiniteSetSpec([-2 0 2])

actInfo =
rlFiniteSetSpec with properties:

Elements: [3×1 double]


Name: [0×0 string]
Description: [0×0 string]
Dimension: [1 1]
DataType: "double"

You can use dot notation to assign property values for the rlNumericSpec and rlFiniteSetSpec
objects.
obsInfo.Name = 'observations';
actInfo.Name = 'torque';

Assign the agent block path information, and create the reinforcement learning environment for the
Simulink model using the information extracted in the previous steps.
agentBlk = [mdl '/RL Agent'];
env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo)

env =
SimulinkEnvWithAgent with properties:

Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on

You can also include a reset function using dot notation. For this example, randomly initialize theta0
in the model workspace.
env.ResetFcn = @(in) setVariable(in,'theta0',randn,'Workspace',mdl)

env =
SimulinkEnvWithAgent with properties:

Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : @(in)setVariable(in,'theta0',randn,'Workspace',mdl)
UseFastRestart : on

3-200
rlNumericSpec

Version History
Introduced in R2019a

See Also
rlFiniteSetSpec | rlSimulinkEnv | getActionInfo | getObservationInfo |
rlValueRepresentation | rlQValueRepresentation |
rlDeterministicActorRepresentation | rlStochasticActorRepresentation |
rlFunctionEnv

Topics
“Train DDPG Agent for Adaptive Cruise Control”

3-201
3 Objects

rlOptimizerOptions
Optimization options for actors and critics

Description
Use an rlOptimizerOptions object to specify an optimization options set for actors and critics.

Creation
Syntax
optOpts = rlOptimizerOptions
optOpts = rlOptimizerOptions(Name=Value)

Description

optOpts = rlOptimizerOptions creates a default optimizer option set to use as a


CriticOptimizerOptions or ActorOptimizerOptions property of an agent option object, or as
a last argument of rlOptimizer to create an optimizer object. You can modify the object properties
using dot notation.

optOpts = rlOptimizerOptions(Name=Value) creates an options set with the specified


properties using one or more name-value arguments.

Properties
LearnRate — Learning rate used in training the actor or critic function approximator
0.01 (default) | positive scalar

Learning rate used in training the actor or critic function approximator, specified as a positive scalar.
If the learning rate is too low, then training takes a long time. If the learning rate is too high, then
training might reach a suboptimal result or diverge.
Example: LearnRate=0.025

GradientThreshold — Gradient threshold value for the training of the actor or critic
function approximator
Inf (default) | positive scalar

Gradient threshold value used in training the actor or critic function approximator, specified as Inf
or a positive scalar. If the gradient exceeds this value, the gradient is clipped as specified by the
GradientThresholdMethod option. Clipping the gradient limits how much the network parameters
can change in a training iteration.
Example: GradientThreshold=1

GradientThresholdMethod — Gradient threshold method used in training the actor or critic


function approximator
"l2norm" (default) | "global-l2norm" | "absolute-value"

3-202
rlOptimizerOptions

Gradient threshold method used in training the actor or critic function approximator. This is the
specific method used to clip gradient values that exceed the gradient threshold, and it is specified as
one of the following values.

• "l2norm" — If the L2 norm of the gradient of a learnable parameter is larger than


GradientThreshold, then scale the gradient so that the L2 norm equals GradientThreshold.
• "global-l2norm" — If the global L2 norm L is larger than GradientThreshold, then scale all
gradients by a factor of GradientThreshold/L. The global L2 norm considers all learnable
parameters.
• "absolute-value" — If the absolute value of an individual partial derivative in the gradient of a
learnable parameter is larger than GradientThreshold, then scale the partial derivative to have
magnitude equal to GradientThreshold and retain the sign of the partial derivative.

For more information, see “Gradient Clipping” in the Algorithms section of trainingOptions in
Deep Learning Toolbox.
Example: GradientThresholdMethod="absolute-value"

L2RegularizationFactor — Factor for L2 regularization used in training the actor or critic


function approximator
0.0001 (default) | nonnegative scalar

Factor for L2 regularization (weight decay) used in training the actor or critic function approximator,
specified as a nonnegative scalar. For more information, see “L2 Regularization” in the Algorithms
section of trainingOptions in Deep Learning Toolbox.

To avoid overfitting when using a representation with many parameters, consider increasing the
L2RegularizationFactor option.
Example: L2RegularizationFactor=0.0005

Algorithm — Algorithm used for training actor or critic function approximator


"adam" (default) | "sgdm" | "rmsprop"

Algorithm used for training the actor or critic function approximator, specified as one of the following
values.

• "adam" — Use the Adam (adaptive movement estimation) algorithm. You can specify the decay
rates of the gradient and squared gradient moving averages using the GradientDecayFactor
and SquaredGradientDecayFactor fields of the OptimizerParameters option.
• "sgdm" — Use the stochastic gradient descent with momentum (SGDM) algorithm. You can
specify the momentum value using the Momentum field of the OptimizerParameters option.
• "rmsprop" — Use the RMSProp algorithm. You can specify the decay rate of the squared gradient
moving average using the SquaredGradientDecayFactor fields of the OptimizerParameters
option.

For more information about these algorithms, see “Stochastic Gradient Descent” in the Algorithms
section of trainingOptions in Deep Learning Toolbox.
Example: Optimizer="sgdm"

OptimizerParameters — Parameters for the training algorithm used for training the actor
or critic function approximator
OptimizerParameters object

3-203
3 Objects

Parameters for the training algorithm used for training the actor or critic function approximator,
specified as an OptimizerParameters object with the following parameters.

Parameter Description
Momentum Contribution of previous step, specified as a
scalar from 0 to 1. A value of 0 means no
contribution from the previous step. A value of 1
means maximal contribution.

This parameter applies only when Optimizer is


"sgdm". In that case, the default value is 0.9.
This default value works well for most problems.
Epsilon Denominator offset, specified as a positive scalar.
The optimizer adds this offset to the denominator
in the network parameter updates to avoid
division by zero.

This parameter applies only when Optimizer is


"adam" or "rmsprop". In that case, the default
value is 10–8. This default value works well for
most problems.
GradientDecayFactor Decay rate of gradient moving average, specified
as a positive scalar from 0 to 1.

This parameter applies only when Optimizer is


"adam". In that case, the default value is 0.9.
This default value works well for most problems.
SquaredGradientDecayFactor Decay rate of squared gradient moving average,
specified as a positive scalar from 0 to 1.

This parameter applies only when Optimizer is


"adam" or "rmsprop". In that case, the default
value is 0.999. This default value works well for
most problems.

When a particular property of OptimizerParameters is not applicable to the optimizer type


specified in Algorithm, that property is set to "Not applicable".

To change property values, create an rlOptimizerOptions object and use dot notation to access
and change the properties of OptimizerParameters.
repOpts = rlRepresentationOptions;
repOpts.OptimizerParameters.GradientDecayFactor = 0.95;

Object Functions
rlQAgentOptions Options for Q-learning agent
rlSARSAAgentOptions Options for SARSA agent
rlDQNAgentOptions Options for DQN agent
rlPGAgentOptions Options for PG agent
rlDDPGAgentOptions Options for DDPG agent
rlTD3AgentOptions Options for TD3 agent
rlACAgentOptions Options for AC agent

3-204
rlOptimizerOptions

rlPPOAgentOptions Options for PPO agent


rlTRPOAgentOptions Options for TRPO agent
rlSACAgentOptions Options for SAC agent
rlOptimizer Creates an optimizer object for actors and critics

Examples

Create Optimizer Options Object

Use rlOprimizerOptions to create a default optimizer option object to use for the training of a
critic function approximator.

myCriticOpts = rlOptimizerOptions

myCriticOpts =
rlOptimizerOptions with properties:

LearnRate: 0.0100
GradientThreshold: Inf
GradientThresholdMethod: "l2norm"
L2RegularizationFactor: 1.0000e-04
Algorithm: "adam"
OptimizerParameters: [1x1 rl.option.OptimizerParameters]

Using dot notation, change the training algorithm to stochastic gradient descent with momentum and
set the value of the momentum parameter to 0.6.

myCriticOpts.Algorithm = "sgdm";
myCriticOpts.OptimizerParameters.Momentum = 0.6;

Create an AC agent option object, and set its CriticOptimizerOptions property to


myCriticOpts.

myAgentOpt = rlACAgentOptions;
myAgentOpt.CriticOptimizerOptions = myCriticOpts;

You can now use myAgentOpt as last input argument to rlACAgent when creating your AC agent.

Create Optimizer Options Object Specifying Property Values

Use rlOprimizerOptions to create an optimizer option object to use for the training of an actor
function approximator. Specify a learning rate of 0.2 and set the GradientThresholdMethod to
"absolute-value".

myActorOpts=rlOptimizerOptions(LearnRate=0.2, ...
GradientThresholdMethod="absolute-value")

myActorOpts =
rlOptimizerOptions with properties:

LearnRate: 0.2000
GradientThreshold: Inf

3-205
3 Objects

GradientThresholdMethod: "absolute-value"
L2RegularizationFactor: 1.0000e-04
Algorithm: "adam"
OptimizerParameters: [1x1 rl.option.OptimizerParameters]

Using dot notation, change the a GradientThreshold to 10.

myActorOpts.GradientThreshold = 10;

Create an AC agent option object and set its ActorOptimizerOptions property to myActorOpts.

myAgentOpt = rlACAgentOptions( ...


"ActorOptimizerOptions",myActorOpts);

You can now use myAgentOpt as last input argument to rlACAgent when creating your AC agent.

Version History
Introduced in R2022a

See Also
Functions
rlOptimizer

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-206
rlPGAgent

rlPGAgent
Policy gradient reinforcement learning agent

Description
The policy gradient (PG) algorithm is a model-free, online, on-policy reinforcement learning method.
A PG agent is a policy-based reinforcement learning agent that uses the REINFORCE algorithm to
directly compute an optimal policy which maximizes the long-term reward. The action space can be
either discrete or continuous.

For more information on PG agents and the REINFORCE algorithm, see “Policy Gradient Agents”. For
more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation

Syntax
agent = rlPGAgent(observationInfo,actionInfo)
agent = rlPGAgent(observationInfo,actionInfo,initOpts)

agent = rlPGAgent(actor)
agent = rlPGAgent(actor,critic)

agent = rlPGAgent( ___ ,agentOptions)

Description
Create Agent from Observation and Action Specifications

agent = rlPGAgent(observationInfo,actionInfo) creates a policy gradient agent for an


environment with the given observation and action specifications, using default initialization options.
The actor and critic in the agent use default deep neural networks built from the observation
specification observationInfo and the action specification actionInfo. The ObservationInfo
and ActionInfo properties of agent are set to the observationInfo and actionInfo input
arguments, respectively.

agent = rlPGAgent(observationInfo,actionInfo,initOpts) creates a policy gradient


agent for an environment with the given observation and action specifications. The agent uses default
networks in which each hidden fully connected layer has the number of units specified in the
initOpts object. Policy gradient agents do not support recurrent neural networks. For more
information on the initialization options, see rlAgentInitializationOptions.
Create Agent from Actor and Critic

agent = rlPGAgent(actor) creates a PG agent with the specified actor network. By default, the
UseBaseline property of the agent is false in this case.

3-207
3 Objects

agent = rlPGAgent(actor,critic) creates a PG agent with the specified actor and critic
networks. By default, the UseBaseline option is true in this case.
Specify Agent Options

agent = rlPGAgent( ___ ,agentOptions) creates a PG agent and sets the AgentOptions
property to the agentOptions input argument. Use this syntax after any of the input arguments in
the previous syntaxes.

Input Arguments

initOpts — Agent initialization options


rlAgentInitializationOptions object

Agent initialization options, specified as an rlAgentInitializationOptions object. Policy


gradient agents do not support recurrent neural networks.

actor — Actor
rlDiscreteCategoricalActor object | rlContinuousGaussianActor object

Actor that implements the policy, specified as an rlDiscreteCategoricalActor or


rlContinuousGaussianActor function approximator object. For more information on creating
actor approximators, see “Create Policies and Value Functions”.

critic — Baseline critic


rlValueFunction object

Baseline critic that estimates the discounted long-term reward, specified as an rlValueFunction
object. For more information on creating critic approximators, see “Create Policies and Value
Functions”.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data type, and names of the observation
signals.

If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.

You can extract observationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specification


specification object

Action specifications, specified as a reinforcement learning specification object defining properties


such as dimensions, data type, and names of the action signals.

For a discrete action space, you must specify actionInfo as an rlFiniteSetSpec object.

3-208
rlPGAgent

For a continuous action space, you must specify actionInfo as an rlNumericSpec object.

If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.

You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

AgentOptions — Agent options


rlPGAgentOptions object

Agent options, specified as an rlPGAgentOptions object.

UseExplorationPolicy — Option to use exploration policy


true (default) | false

Option to use exploration policy when selecting actions, specified as a one of the following logical
values.

• true — Use the base agent exploration policy when selecting actions in sim and
generatePolicyFunction. In this case, the agent selects its actions by sampling its probability
distribution, the policy is therefore stochastic and the agent explores its observation space.
• false — Use the base agent greedy policy (the action with maximum likelihood) when selecting
actions in sim and generatePolicyFunction. In this case, the simulated agent and generated
policy behave deterministically.

Note This option affects only simulation and deployment; it does not affect training.

SampleTime — Sample time of agent


positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent

3-209
3 Objects

getCritic Get critic from reinforcement learning agent


setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object

Examples

Create Discrete Policy Gradient Agent from Observation and Action Specifications

Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to the pole).

env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

Obtain observation and action specification objects.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.

rng(0)

Create a policy gradient agent from the environment observation and action specifications.

agent = rlPGAgent(obsInfo,actInfo);

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[-2]}

You can now test and train the agent within the environment.

Create Continuous Policy Gradient Agent Using Initialization Options

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.

env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");

Obtain observation and action specifications

3-210
rlPGAgent

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256). Policy gradient agents do not
support recurrent networks, so setting the UseRNN option to true generates an error when the agent
is created.

initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);

The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.

rng(0)

Create a policy gradient agent from the environment observation and action specifications.

agent = rlPGAgent(obsInfo,actInfo,initOpts);

Extract the deep neural networks from both the agent actor and critic.

actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));

Display the layers of the critic network, and verify that each hidden fully connected layer has 128
neurons

criticNet.Layers

ans =
11x1 Layer array with layers:

1 'concat' Concatenation Concatenation of 2 inputs along dimension 1


2 'relu_body' ReLU ReLU
3 'fc_body' Fully Connected 128 fully connected layer
4 'body_output' ReLU ReLU
5 'input_1' Image Input 50x50x1 images
6 'conv_1' 2-D Convolution 64 3x3x1 convolutions with stride [1 1] and padding
7 'relu_input_1' ReLU ReLU
8 'fc_1' Fully Connected 128 fully connected layer
9 'input_2' Feature Input 1 features
10 'fc_2' Fully Connected 128 fully connected layer
11 'output' Fully Connected 1 fully connected layer

Plot actor and critic networks, and display their number of weights.

plot(layerGraph(actorNet))

3-211
3 Objects

summary(actorNet)

Initialized: true

Number of learnables: 18.9M

Inputs:
1 'input_1' 50x50x1 images
2 'input_2' 1 features

plot(layerGraph(criticNet))

3-212
rlPGAgent

summary(criticNet)

Initialized: true

Number of learnables: 18.9M

Inputs:
1 'input_1' 50x50x1 images
2 'input_2' 1 features

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[0.9228]}

You can now test and train the agent within the environment.

Create a Discrete PG Agent from Actor and Baseline Critic

Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Train PG Agent with
Baseline to Control Double Integrator System”. The observation from the environment is a vector

3-213
3 Objects

containing the position and velocity of a mass. The action is a scalar representing a force, applied to
the mass, having three possible values (-2, 0, or 2 Newton).
env = rlPredefinedEnv("DoubleIntegrator-Discrete");
obsInfo = getObservationInfo(env)

obsInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"

actInfo = getActionInfo(env)

actInfo =
rlFiniteSetSpec with properties:

Elements: [-2 0 2]
Name: "force"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"

For policy gradient agents, the baseline critic estimates a value function, therefore it must take the
observation signal as input and return a scalar value.

Define the network as an array of layer objects, and get the dimension of the observation space from
the environment specification object.
baselineNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(1)];

Convert to a dlnetwork object and display the number of weights.


baselineNet = dlnetwork(baselineNet);
summary(baselineNet)

Initialized: true

Number of learnables: 257

Inputs:
1 'input' 2 features

Create a critic to use as a baseline. Policy gradient agents use an rlValueFunction object to
implement the critic.
baseline = rlValueFunction(baselineNet,obsInfo);

Check the critic with a random input observation.

3-214
rlPGAgent

getValue(baseline,{rand(obsInfo.Dimension)})

ans = single
-0.1204

To approximate the policy within the actor, use a deep neural network. For policy gradient agents, the
actor executes a stochastic policy, which for discrete action spaces is implemented by a discrete
categorical actor. In this case the network must take the observation signal as input and return a
probability for each action. Therefore the output layer must have as many elements as the number of
possible actions.

Define the network as an array of layer objects, and get the dimension of the observation space and
the number of possible actions from the environment specification objects.

actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))];

Convert to a dlnetwork object and display the number of weights.

actorNet = dlnetwork(actorNet);
summary(actorNet)

Initialized: true

Number of learnables: 387

Inputs:
1 'input' 2 features

Create the actor using rlDiscreteCategoricalActor, as well as the observation and action
specifications.

actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);

Check the actor with a random observation input.

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-2]}

Create the PG agent using the actor and the baseline critic.

agent = rlPGAgent(actor,baseline)

agent =
rlPGAgent with properties:

AgentOptions: [1x1 rl.option.rlPGAgentOptions]


UseExplorationPolicy: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1

3-215
3 Objects

Specify options for the agent, including training options for the actor and critic.

agent.AgentOptions.UseBaseline = true;
agent.AgentOptions.DiscountFactor = 0.99;

agent.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;

agent.AgentOptions.ActorOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

Check the agent with a random observation input.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-2]}

You can now test and train the agent within the environment.

Create a Continuous PG Agent from Actor and Baseline Critic

Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the double integrator continuous action space environment used
in the example “Train DDPG Agent to Control Double Integrator System”.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env)

obsInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"

actInfo = getActionInfo(env)

actInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"

3-216
rlPGAgent

In this example, the action is a scalar value representing a force ranging from -2 to 2 Newton. To
make sure that the output from the agent is in this range, you perform an appropriate scaling
operation. Store these limits so you can easily access them later.

actInfo.LowerLimit = -2;
actInfo.UpperLimit = 2;

For policy gradient agents, the baseline critic estimates a value function, therefore it must take the
observation signal as input and return a scalar value. To approximate the value function within the
baseline, use a neural network.

Define the network as an array of layer objects, and get the dimensions of the observation space from
the environment specification object.

baselineNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(1)];

Convert to a dlnetwork object and display the number of weights.

baselineNet = dlnetwork(baselineNet);
summary(baselineNet)

Initialized: true

Number of learnables: 257

Inputs:
1 'input' 2 features

Create a critic to use as a baseline. Policy gradient agents use an rlValueFunction object to
implement the critic.

baseline = rlValueFunction(baselineNet,obsInfo);

Check the critic with a random input observation.

getValue(baseline,{rand(obsInfo.Dimension)})

ans = single
-0.1204

To approximate the policy within the actor, use a deep neural network as approximation model. For
policy gradient agents, the actor executes a stochastic policy, which for continuous action spaces is
implemented by a continuous Gaussian actor. In this case the network must take the observation
signal as input and return both a mean value and a standard deviation value for each action.
Therefore it must have two output layers (one for the mean values the other for the standard
deviation values), each having as many elements as the dimension of the action space.

Note that standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.

3-217
3 Objects

Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects, and specify a name for the input layers, so
you can later explicitly associate them with the appropriate environment channel.

% Input path
inPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="obs_in")
fullyConnectedLayer(32)
reluLayer(Name="ip_out") ];

% Mean path
meanPath = [
fullyConnectedLayer(16,Name="mp_fc1")
reluLayer
fullyConnectedLayer(1)
tanhLayer(Name="tanh"); % range: -1,1
scalingLayer(Name="mp_out", ...
Scale=actInfo.UpperLimit) ]; % range: -2,2

% Standard deviation path


sdevPath = [
fullyConnectedLayer(16,Name="sp_fc1")
reluLayer
fullyConnectedLayer(1);
softplusLayer(Name="sp_out") ]; % non negative

% Add layers to layerGraph object


actorNet = layerGraph(inPath);
actorNet = addLayers(actorNet,meanPath);
actorNet = addLayers(actorNet,sdevPath);

% Connect output of inPath to meanPath input


actorNet = connectLayers(actorNet,"ip_out","mp_fc1/in");
% Connect output of inPath to variancePath input
actorNet = connectLayers(actorNet,"ip_out","sp_fc1/in");

% plot network
plot(actorNet)

3-218
rlPGAgent

% Convert to dlnetwork object


actorNet = dlnetwork(actorNet);

% Display the number of weights


summary(actorNet)

Initialized: true

Number of learnables: 1.1k

Inputs:
1 'obs_in' 2 features

Create the actor using rlContinuousGaussianActor, together with actorNet, the observation
and action specifications, as well as the names of the network input and output layers.
actor = rlContinuousGaussianActor(actorNet, ...
obsInfo,actInfo, ...
ObservationInputNames="obs_in", ...
ActionMeanOutputNames="mp_out", ...
ActionStandardDeviationOutputNames="sp_out");

Check the actor with a random input observation.


getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[0.0963]}

3-219
3 Objects

Create the PG agent using the actor and the baseline critic.

agent = rlPGAgent(actor,baseline)

agent =
rlPGAgent with properties:

AgentOptions: [1x1 rl.option.rlPGAgentOptions]


UseExplorationPolicy: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlNumericSpec]
SampleTime: 1

Specify options for the agent, including training options for the actor and critic.

agent.AgentOptions.UseBaseline = true;
agent.AgentOptions.DiscountFactor = 0.99;

agent.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;

agent.AgentOptions.ActorOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

Check your agent with a random input observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[1.1197]}

You can now test and train the agent within the environment.

Create a Discrete PG Agent with Recurrent Neural Networks

For this example, load the environment used in the example “Train PG Agent with Baseline to Control
Double Integrator System”. The observation from the environment is a vector containing the position
and velocity of a mass. The action is a scalar representing a force, applied to the mass, having three
possible values (-2, 0, or 2 Newton).

env = rlPredefinedEnv("DoubleIntegrator-Discrete");

Get observation and specification information.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a critic to use as a baseline. For policy gradient agents, the baseline critic estimates a value
function, therefore it must take the observation signal as input and return a scalar value. To
approximate the value function within the critic, use a recurrent neural network.

Define the network as an array of layer objects, and get the dimension of the observation space from
the environment specification object. To create a recurrent neural network for the critic, use

3-220
rlPGAgent

sequenceInputLayer as the input layer and include an lstmLayer as one of the other network
layers.
baselineNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
lstmLayer(32)
reluLayer
fullyConnectedLayer(1)];

Convert to a dlnetwork object and display the number of weights.


baselineNet = dlnetwork(baselineNet);
summary(baselineNet)

Initialized: true

Number of learnables: 4.5k

Inputs:
1 'sequenceinput' Sequence input with 2 dimensions

Create the critic based on the network approximator model. Policy gradient agents use an
rlValueFunction object to implement the critic.
baseline = rlValueFunction(baselineNet,obsInfo);

Check the baseline critic with a random input observation.


getValue(baseline,{rand(obsInfo.Dimension)})

ans = single
-0.0065

Since the critic has a recurrent network, the actor must have a recurrent network too. Define a
recurrent neural network for the actor. For policy gradient agents, the actor executes a stochastic
policy, which for discrete action spaces is implemented by a discrete categorical actor. In this case
the network must take the observation signal as input and return a probability for each action.
Therefore the output layer must have as many elements as the number of possible actions.

Define the network as an array of layer objects, and get the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
lstmLayer(32)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))];

Convert to a dlnetwork object and display the number of weights.


actorNet = dlnetwork(actorNet);
summary(actorNet)

Initialized: true

Number of learnables: 4.5k

Inputs:
1 'sequenceinput' Sequence input with 2 dimensions

3-221
3 Objects

Create the actor. Policy gradient agents use stochastic actors, which for discrete action spaces are
implemented by rlDiscreteCategoricalActor objects.

actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);

Check the actor with a random observation input.

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[0]}

Set some training options for the critic.

baselineOpts = rlOptimizerOptions( ...


'LearnRate',5e-3,'GradientThreshold',1);

Set some training options for the actor.

actorOpts = rlOptimizerOptions( ...


'LearnRate',5e-3,'GradientThreshold',1);

Specify agent options, including training options for the actor and the critic.

agentOpts = rlPGAgentOptions(...
'UseBaseline',true, ...
'DiscountFactor', 0.99, ...
'CriticOptimizerOptions',baselineOpts, ...
'ActorOptimizerOptions', actorOpts);

create a PG agent using the actor, the critic and the agent option object.

agent = rlPGAgent(actor,baseline,agentOpts);

For PG agent with recurrent neural networks, the training sequence length is the whole episode.

Check the agent with a random observation input.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[0]}

You can now test and train the agent within the environment.

Tips
• For continuous action spaces, the rlPGAgent agent does not enforce the constraints set by the
action specification, so you must enforce action space constraints within the environment.

Version History
Introduced in R2019a

3-222
rlPGAgent

See Also
rlAgentInitializationOptions | rlPGAgentOptions | rlQValueFunction |
rlDiscreteCategoricalActor | rlContinuousGaussianActor | Deep Network Designer

Topics
“Policy Gradient Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”

3-223
3 Objects

rlPGAgentOptions
Options for PG agent

Description
Use an rlPGAgentOptions object to specify options for policy gradient (PG) agents. To create a PG
agent, use rlPGAgent

For more information on PG agents, see “Policy Gradient Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation
Syntax
opt = rlPGAgentOptions
opt = rlPGAgentOptions(Name,Value)

Description

opt = rlPGAgentOptions creates an rlPGAgentOptions object for use as an argument when


creating a PG agent using all default settings. You can modify the object properties using dot
notation.

opt = rlPGAgentOptions(Name,Value) sets option properties on page 3-224 using name-value


pairs. For example, rlPGAgentOptions('DiscountFactor',0.95) creates an option set with a
discount factor of 0.95. You can specify multiple name-value pairs. Enclose each property name in
quotes.

Properties
UseBaseline — Use baseline for learning
true (default) | false

Option to use baseline for learning, specified as a logical value. When UseBaseline is true, you
must specify a critic network as the baseline function approximator.

In general, for simpler problems with smaller actor networks, PG agents work better without a
baseline.

EntropyLossWeight — Entropy loss weight


0 (default) | scalar value between 0 and 1

Entropy loss weight, specified as a scalar value between 0 and 1. A higher entropy loss weight value
promotes agent exploration by applying a penalty for being too certain about which action to take.
Doing so can help the agent move out of local optima.

3-224
rlPGAgentOptions

When gradients are computed during training, an additional gradient component is computed for
minimizing this loss function.

ActorOptimizerOptions — Actor optimizer options


rlOptimizerOptions object

Actor optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the actor approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

CriticOptimizerOptions — Critic optimizer options


rlOptimizerOptions object

Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the critic approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

SampleTime — Sample time of agent


1 (default) | positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

DiscountFactor — Discount factor


0.99 (default) | positive scalar less than or equal to 1

Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.

Object Functions
rlPGAgent Policy gradient reinforcement learning agent

Examples

Create PG Agent Options Object

This example shows how to create and modify a PG agent options object.

Create a PG agent options object, specifying the discount factor.

opt = rlPGAgentOptions('DiscountFactor',0.9)

3-225
3 Objects

opt =
rlPGAgentOptions with properties:

UseBaseline: 1
EntropyLossWeight: 0
ActorOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
SampleTime: 1
DiscountFactor: 0.9000
InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the agent sample time to 0.5.

opt.SampleTime = 0.5;

Version History
Introduced in R2019a

Simulation and deployment: UseDeterministicExploitation will be removed


Warns starting in R2022a

The property UseDeterministicExploitation of the rlPGAgentOptions object will be removed


in a future release. Use the UseExplorationPolicy property of rlPGAgent instead.

Previously, you set UseDeterministicExploitation as follows.

• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.

agent.AgentOptions.UseDeterministicExploitation = true;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.

agent.AgentOptions.UseDeterministicExploitation = false;

Starting in R2022a, set UseExplorationPolicy as follows.

• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.

agent.UseExplorationPolicy = false;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.

agent.UseExplorationPolicy = true;

Similarly to UseDeterministicExploitation, UseExplorationPolicy affects only simulation


and deployment; it does not affect training.

3-226
rlPGAgentOptions

See Also
Topics
“Policy Gradient Agents”

3-227
3 Objects

rlPPOAgent
Proximal policy optimization reinforcement learning agent

Description
Proximal policy optimization (PPO) is a model-free, online, on-policy, policy gradient reinforcement
learning method. This algorithm alternates between sampling data through environmental interaction
and optimizing a clipped surrogate objective function using stochastic gradient descent. The action
space can be either discrete or continuous.

For more information on PPO agents, see “Proximal Policy Optimization Agents”. For more
information on the different types of reinforcement learning agents, see “Reinforcement Learning
Agents”.

Creation

Syntax
agent = rlPPOAgent(observationInfo,actionInfo)
agent = rlPPOAgent(observationInfo,actionInfo,initOpts)

agent = rlPPOAgent(actor,critic)

agent = rlPPOAgent( ___ ,agentOptions)

Description

Create Agent from Observation and Action Specifications

agent = rlPPOAgent(observationInfo,actionInfo) creates a proximal policy optimization


(PPO) agent for an environment with the given observation and action specifications, using default
initialization options. The actor and critic in the agent use default deep neural networks built from
the observation specification observationInfo and the action specification actionInfo. The
ObservationInfo and ActionInfo properties of agent are set to the observationInfo and
actionInfo input arguments, respectively.

agent = rlPPOAgent(observationInfo,actionInfo,initOpts) creates a PPO agent for an


environment with the given observation and action specifications. The agent uses default networks
configured using options specified in the initOpts object. Actor-critic agents do not support
recurrent neural networks. For more information on the initialization options, see
rlAgentInitializationOptions.

Create Agent from Actor and Critic

agent = rlPPOAgent(actor,critic) creates a PPO agent with the specified actor and critic,
using the default options for the agent.

3-228
rlPPOAgent

Specify Agent Options

agent = rlPPOAgent( ___ ,agentOptions) creates a PPO agent and sets the AgentOptions
property to the agentOptions input argument. Use this syntax after any of the input arguments in
the previous syntaxes.

Input Arguments

initOpts — Agent initialization options


rlAgentInitializationOptions object

Agent initialization options, specified as an rlAgentInitializationOptions object.

actor — Actor
rlDiscreteCategoricalActor object | rlContinuousGaussianActor object

Actor that implements the policy, specified as an rlDiscreteCategoricalActor or


rlContinuousGaussianActor function approximator object. For more information on creating
actor approximators, see “Create Policies and Value Functions”.

critic — Critic
rlValueFunction object

Critic that estimates the discounted long-term reward, specified as an rlValueFunction object. For
more information on creating critic approximators, see “Create Policies and Value Functions”.

Your critic can use a recurrent neural network as its function approximator. In this case, your actor
must also use a recurrent neural network. For an example, see “Create PPO Agent with Recurrent
Neural Networks” on page 3-240.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data type, and names of the observation
signals.

If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.

You can extract observationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specification


specification object

Action specifications, specified as a reinforcement learning specification object defining properties


such as dimensions, data type, and names of the action signals.

For a discrete action space, you must specify actionInfo as an rlFiniteSetSpec object.

For a continuous action space, you must specify actionInfo as an rlNumericSpec object.

3-229
3 Objects

If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.

You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

AgentOptions — Agent options


rlPPOAgentOptions object

Agent options, specified as an rlPPOAgentOptions object.

UseExplorationPolicy — Option to use exploration policy


true (default) | false

Option to use exploration policy when selecting actions, specified as a one of the following logical
values.

• true — Use the base agent exploration policy when selecting actions in sim and
generatePolicyFunction. In this case, the agent selects its actions by sampling its probability
distribution, the policy is therefore stochastic and the agent explores its observation space.
• false — Use the base agent greedy policy (the action with maximum likelihood) when selecting
actions in sim and generatePolicyFunction. In this case, the simulated agent and generated
policy behave deterministically.

Note This option affects only simulation and deployment; it does not affect training.

SampleTime — Sample time of agent


positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent

3-230
rlPPOAgent

generatePolicyFunction Generate function that evaluates policy of an agent or policy object

Examples

Create Discrete PPO Agent from Observation and Action Specifications

Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to a swinging pole).

env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

Obtain observation and action specifications from the environment.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.

rng(0)

Create a PPO agent from the environment observation and action specifications.

agent = rlPPOAgent(obsInfo,actInfo);

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[-2]}

You can now test and train the agent within the environment. You can also use getActor and
getCritic to extract the actor and critic, respectively, and getModel to extract the approximator
model (by default a deep neural network) from the actor or critic.

Create Continuous PPO Agent Using Initialization Options

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.

env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

3-231
3 Objects

Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256).

initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);

The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.

rng(0)

Create a PPO actor-critic agent from the environment observation and action specifications.

agent = rlPPOAgent(obsInfo,actInfo,initOpts);

Extract the deep neural networks from both the agent actor and critic.

actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));

Display the layers of the critic network, and verify that each hidden fully connected layer has 128
neurons

criticNet.Layers

ans =
11x1 Layer array with layers:

1 'concat' Concatenation Concatenation of 2 inputs along dimension 1


2 'relu_body' ReLU ReLU
3 'fc_body' Fully Connected 128 fully connected layer
4 'body_output' ReLU ReLU
5 'input_1' Image Input 50x50x1 images
6 'conv_1' 2-D Convolution 64 3x3x1 convolutions with stride [1 1] and padding
7 'relu_input_1' ReLU ReLU
8 'fc_1' Fully Connected 128 fully connected layer
9 'input_2' Feature Input 1 features
10 'fc_2' Fully Connected 128 fully connected layer
11 'output' Fully Connected 1 fully connected layer

Plot actor and critic networks

plot(layerGraph(actorNet))

3-232
rlPPOAgent

plot(layerGraph(criticNet))

3-233
3 Objects

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[0.9228]}

You can now test and train the agent within the environment.

Create Proximal Policy Optimization Agent

Create an environment interface, and obtain its observation and action specifications.

env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

For PPO agents, the critic estimates a value function, therefore it must take the observation signal as
input and return a scalar value. Create a deep neural network to be used as approximation model
within the critic. Define the network as an array of layer objects.

criticNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(100)

3-234
rlPPOAgent

reluLayer
fullyConnectedLayer(1)
];

Convert to a dlnetwork object and display the number of parameters.


criticNet = dlnetwork(criticNet);
summary(criticNet)

Initialized: true

Number of learnables: 601

Inputs:
1 'input' 4 features

Create the critic using criticNet. PPO agents use an rlValueFunction object to implement the
critic.
critic = rlValueFunction(criticNet,obsInfo);

Check the critic with a random observation input.


getValue(critic,{rand(obsInfo.Dimension)})

ans = single
-0.2479

To approximate the policy within the actor use a neural network. For PPO agents, the actor executes
a stochastic policy, which for discrete action spaces is implemented by a discrete categorical actor. In
this case the approximator must take the observation signal as input and return a probability for each
action. Therefore the output layer must have as many elements as the number of possible actions.

Define the network as an array of layer objects, getting the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(200)
reluLayer
fullyConnectedLayer(numel(actInfo.Dimension))
];

Convert to a dlnetwork object and display the number of parameters.


actorNet = dlnetwork(actorNet);
summary(actorNet)

Initialized: true

Number of learnables: 1.4k

Inputs:
1 'input' 4 features

Create the actor using actorNet. PPO agents use an rlDiscreteCategoricalActor object to
implement the actor for discrete action spaces.
actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);

3-235
3 Objects

Check the actor with a random observation input.

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-10]}

Create a PPO agent using the actor and the critic.

agent = rlPPOAgent(actor,critic)

agent =
rlPPOAgent with properties:

AgentOptions: [1x1 rl.option.rlPPOAgentOptions]


UseExplorationPolicy: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1

Specify agent options, including training options for the actor and the critic.

agent.AgentOptions.ExperienceHorizon = 1024;
agent.AgentOptions.DiscountFactor = 0.95;

agent.AgentOptions.CriticOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-10]}

You can now test and train the agent against the environment.

Create Continuous PPO Agent

Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the double integrator continuous action space environment used
in the example “Train DDPG Agent to Control Double Integrator System”. The observation from the
environment is a vector containing the position and velocity of a mass. The action is a scalar
representing a force, applied to the mass, ranging continuously from -2 to 2 Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env)

obsInfo =
rlNumericSpec with properties:

3-236
rlPPOAgent

LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"

actInfo = getActionInfo(env)

actInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"

In this example, the action is a scalar value representing a force ranging from -2 to 2 Newton. To
make sure that the output from the agent is in this range, you perform an appropriate scaling
operation. Store these limits so you can easily access them later.
actInfo.LowerLimit=-2;
actInfo.UpperLimit=2;

The actor and critic networks are initialized randomly. Ensure reproducibility by fixing the seed of the
random generator.
rng(0)

For PPO agents, the critic estimates a value function, therefore it must take the observation signal as
input and return a scalar value. To approximate the value function within the critic, use a neural
network. Define the network as an array of layer objects.
criticNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(100)
reluLayer
fullyConnectedLayer(1)];

Convert to a dlnetwork object and display the number of parameters.


criticNet = dlnetwork(criticNet);
summary(criticNet)

Initialized: true

Number of learnables: 401

Inputs:
1 'input' 2 features

Create the critic using criticNet. PPO agents use an rlValueFunction object to implement the
critic.
critic = rlValueFunction(criticNet,obsInfo);

3-237
3 Objects

Check the critic with a random observation input.

getValue(critic,{rand(obsInfo.Dimension)})

ans = single
-0.0899

To approximate the policy within the actor, use a neural network. For PPO agents, the actor executes
a stochastic policy, which for continuous action spaces is implemented by a continuous Gaussian
actor. In this case the network must take the observation signal as input and return both a mean
value and a standard deviation value for each action. Therefore it must have two output layers (one
for the mean values the other for the standard deviation values), each having as many elements as
the dimension of the action space.

Note that standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.

Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces, and the action range limits from the environment specification objects. Specify a name
for the input and output layers, so you can later explicitly associate them with the appropriate
environment channel.

% Define common input path layer


commonPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="comPathIn")
fullyConnectedLayer(100)
reluLayer
fullyConnectedLayer(1,Name="comPathOut") ];

% Define mean value path


meanPath = [
fullyConnectedLayer(15,Name="meanPathIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension));
tanhLayer;
scalingLayer(Name="meanPathOut",Scale=actInfo.UpperLimit) ];

% Define standard deviation path


sdevPath = [
fullyConnectedLayer(15,'Name',"stdPathIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension));
softplusLayer(Name="stdPathOut") ];

% Add layers to layerGraph object


actorNet = layerGraph(commonPath);
actorNet = addLayers(actorNet,meanPath);
actorNet = addLayers(actorNet,sdevPath);

% Connect paths
actorNet = connectLayers(actorNet,"comPathOut","meanPathIn/in");
actorNet = connectLayers(actorNet,"comPathOut",'stdPathIn/in');

% Plot network
plot(actorNet)

3-238
rlPPOAgent

% Convert to dlnetwork and display number of weights


actorNet = dlnetwork(actorNet);
summary(actorNet)

Initialized: true

Number of learnables: 493

Inputs:
1 'comPathIn' 2 features

Create the actor using actorNet. PPO agents use an rlContinuousGaussianActor object to
implement the actor for continuous action spaces.

actor = rlContinuousGaussianActor(actorNet, obsInfo, actInfo, ...


'ActionMeanOutputNames',"meanPathOut",...
'ActionStandardDeviationOutputNames',"stdPathOut",...
'ObservationInputNames',"comPathIn");

Check the actor with a random observation input.

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-0.2267]}

Create a PPO agent using the actor and the critic.

3-239
3 Objects

agent = rlPPOAgent(actor,critic)

agent =
rlPPOAgent with properties:

AgentOptions: [1x1 rl.option.rlPPOAgentOptions]


UseExplorationPolicy: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlNumericSpec]
SampleTime: 1

Specify agent options, including training options for the actor and the critic.
agent.AgentOptions.ExperienceHorizon = 1024;
agent.AgentOptions.DiscountFactor = 0.95;

agent.AgentOptions.CriticOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

Specify training options for the critic.


criticOpts = rlOptimizerOptions( ...
'LearnRate',8e-3,'GradientThreshold',1);

To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[0.2719]}

You can now test and train the agent within the environment.

Create PPO Agent with Recurrent Neural Networks

For this example load the predefined environment used for the “Train DQN Agent to Balance Cart-
Pole System” example.
env = rlPredefinedEnv("CartPole-Discrete");

Get observation and action information. This environment has a continuous four-dimensional
observation space (the positions and velocities of both cart and pole) and a discrete one-dimensional
action space consisting on the application of two possible forces, -10N or 10N.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

For PPO agents, the critic estimates a value function, therefore it must take the observation signal as
input and return a scalar value. To approximate the value function within the critic, use a neural
network.

Define the network as an array of layer objects, and get the dimensions of the observation space from
the environment specification object. To create a recurrent neural network, use a

3-240
rlPPOAgent

sequenceInputLayer as the input layer and include an lstmLayer as one of the other network
layers.
criticNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(8)
reluLayer
lstmLayer(8)
fullyConnectedLayer(1)];

Convert to a dlnetwork object and display the number of learnable parameters.


criticNet = dlnetwork(criticNet);
summary(criticNet)

Initialized: true

Number of learnables: 593

Inputs:
1 'sequenceinput' Sequence input with 4 dimensions

Create the critic using criticNetwork. PPO agents use an rlValueFunction object to implement
the critic.
critic = rlValueFunction(criticNet,obsInfo);

Check the critic with a random observation input.


getValue(critic,{rand(obsInfo.Dimension)})

ans = single
0.0017

Since the critic has a recurrent network, the actor must have a recurrent network too. For PPO
agents, the actor executes a stochastic policy, which for discrete action spaces is implemented by a
discrete categorical actor. In this case the network must take the observation signal as input and
return a probability for each action. Therefore the output layer must have as many elements as the
number of possible actions.

Define the network as an array of layer objects, and get the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(100)
reluLayer
lstmLayer(8)
fullyConnectedLayer(numel(actInfo.Elements))
softmaxLayer
];

Convert the network to a dlnetwork object and display the number of learnable parameters.
actorNet = dlnetwork(actorNet);
summary(actorNet)

Initialized: true

3-241
3 Objects

Number of learnables: 4k

Inputs:
1 'sequenceinput' Sequence input with 4 dimensions

Create the actor using actorNetwork. PPO agents use an rlDiscreteCategoricalActor object
to implement the actor for discrete action spaces.

actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);

Check the actor with a random observation input.

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-10]}

Set some training option for the critic.

criticOptions = rlOptimizerOptions( ...


LearnRate=1e-2, ...
GradientThreshold=1);

Set some training options for the actor.

actorOptions = rlOptimizerOptions( ...


LearnRate=1e-3, ...
GradientThreshold=1);

Create the agent options object.

agentOptions = rlPPOAgentOptions(...
AdvantageEstimateMethod="finite-horizon", ...
ClipFactor=0.1, ...
CriticOptimizerOptions=criticOptions, ...
ActorOptimizerOptions=actorOptions);

When recurrent neural networks are used, the MiniBatchSize property is the length of the learning
trajectory.

agentOptions.MiniBatchSize

ans = 128

Create the agent using the actor and critic, as well as the agent options object.

agent = rlPPOAgent(actor,critic,agentOptions);

Check your agent with a random observation input.

getAction(agent,rand(obsInfo.Dimension))

ans = 1x1 cell array


{[-10]}

3-242
rlPPOAgent

Tips
• For continuous action spaces, this agent does not enforce the constraints set by the action
specification. In this case, you must enforce action space constraints within the environment.
• While tuning the learning rate of the actor network is necessary for PPO agents, it is not
necessary for TRPO agents.

Version History
Introduced in R2019b

See Also
rlAgentInitializationOptions | rlPPOAgentOptions | rlValueFunction |
rlDiscreteCategoricalActor | rlContinuousGaussianActor | Deep Network Designer

Topics
“Proximal Policy Optimization Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”

3-243
3 Objects

rlPPOAgentOptions
Options for PPO agent

Description
Use an rlPPOAgentOptions object to specify options for proximal policy optimization (PPO) agents.
To create a PPO agent, use rlPPOAgent.

For more information on PPO agents, see “Proximal Policy Optimization Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation

Syntax
opt = rlPPOAgentOptions
opt = rlPPOAgentOptions(Name,Value)

Description

opt = rlPPOAgentOptions creates an rlPPOAgentOptions object for use as an argument when


creating a PPO agent using all default settings. You can modify the object properties using dot
notation.

opt = rlPPOAgentOptions(Name,Value) sets option properties on page 3-244 using name-value


arguments. For example, rlPPOAgentOptions('DiscountFactor',0.95) creates an option set
with a discount factor of 0.95. You can specify multiple name-value arguments. Enclose each
property name in quotes.

Properties
ExperienceHorizon — Number of steps the agent interacts with the environment before
learning
512 (default) | positive integer

Number of steps the agent interacts with the environment before learning from its experience,
specified as a positive integer.

The ExperienceHorizon value must be greater than or equal to the MiniBatchSize value.

MiniBatchSize — Mini-batch size


128 (default) | positive integer

Mini-batch size used for each learning epoch, specified as a positive integer. When the agent uses a
recurrent neural network, MiniBatchSize is treated as the training trajectory length.

3-244
rlPPOAgentOptions

The MiniBatchSize value must be less than or equal to the ExperienceHorizon value.

ClipFactor — Clip factor


0.2 (default) | positive scalar less than 1

Clip factor for limiting the change in each policy update step, specified as a positive scalar less than
1.

EntropyLossWeight — Entropy loss weight


0.01 (default) | scalar value between 0 and 1

Entropy loss weight, specified as a scalar value between 0 and 1. A higher entropy loss weight value
promotes agent exploration by applying a penalty for being too certain about which action to take.
Doing so can help the agent move out of local optima.

When gradients are computed during training, an additional gradient component is computed for
minimizing this loss function. For more information, see “Entropy Loss”.

NumEpoch — Number of epochs


3 (default) | positive integer

Number of epochs for which the actor and critic networks learn from the current experience set,
specified as a positive integer.

AdvantageEstimateMethod — Method for estimating advantage values


"gae" (default) | "finite-horizon"

Method for estimating advantage values, specified as one of the following:

• "gae" — Generalized advantage estimator


• "finite-horizon" — Finite horizon estimation

For more information on these methods, see the training algorithm information in “Proximal Policy
Optimization Agents”.

GAEFactor — Smoothing factor for generalized advantage estimator


0.95 (default) | scalar value between 0 and 1

Smoothing factor for generalized advantage estimator, specified as a scalar value between 0 and 1,
inclusive. This option applies only when the AdvantageEstimateMethod option is "gae"

NormalizedAdvantageMethod — Method for normalizing advantage function


"none" (default) | "current | "moving"

Method for normalizing advantage function values, specified as one of the following:

• "none" — Do not normalize advantage values


• "current" — Normalize the advantage function using the mean and standard deviation for the
current mini-batch of experiences.
• "moving" — Normalize the advantage function using the mean and standard deviation for a
moving window of recent experiences. To specify the window size, set the
AdvantageNormalizingWindow option.

3-245
3 Objects

In some environments, you can improve agent performance by normalizing the advantage function
during training. The agent normalizes the advantage function by subtracting the mean advantage
value and scaling by the standard deviation.

AdvantageNormalizingWindow — Window size for normalizing advantage function


1e6 (default) | positive integer

Window size for normalizing advantage function values, specified as a positive integer. Use this
option when the NormalizedAdvantageMethod option is "moving".

ActorOptimizerOptions — Actor optimizer options


rlOptimizerOptions object

Actor optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the actor approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

CriticOptimizerOptions — Critic optimizer options


rlOptimizerOptions object

Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the critic approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

SampleTime — Sample time of agent


1 (default) | positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

DiscountFactor — Discount factor


0.99 (default) | positive scalar less than or equal to 1

Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.

Object Functions
rlPPOAgent Proximal policy optimization reinforcement learning agent

Examples

3-246
rlPPOAgentOptions

Create PPO Agent Options Object

Create a PPO agent options object, specifying the experience horizon.

opt = rlPPOAgentOptions('ExperienceHorizon',256)

opt =
rlPPOAgentOptions with properties:

ExperienceHorizon: 256
MiniBatchSize: 128
ClipFactor: 0.2000
EntropyLossWeight: 0.0100
NumEpoch: 3
AdvantageEstimateMethod: "gae"
GAEFactor: 0.9500
NormalizedAdvantageMethod: "none"
AdvantageNormalizingWindow: 1000000
ActorOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
SampleTime: 1
DiscountFactor: 0.9900
InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the agent sample time to 0.5.

opt.SampleTime = 0.5;

Version History
Introduced in R2019b

Simulation and deployment: UseDeterministicExploitation will be removed


Warns starting in R2022a

The property UseDeterministicExploitation of the rlPPOAgentOptions object will be


removed in a future release. Use the UseExplorationPolicy property of rlPPOAgent instead.

Previously, you set UseDeterministicExploitation as follows.

• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.

agent.AgentOptions.UseDeterministicExploitation = true;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.

agent.AgentOptions.UseDeterministicExploitation = false;

Starting in R2022a, set UseExplorationPolicy as follows.

• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.

agent.UseExplorationPolicy = false;

3-247
3 Objects

• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.

agent.UseExplorationPolicy = true;

Similarly to UseDeterministicExploitation, UseExplorationPolicy affects only simulation


and deployment; it does not affect training.

See Also
Topics
“Proximal Policy Optimization Agents”

3-248
rlPrioritizedReplayMemory

rlPrioritizedReplayMemory
Replay memory experience buffer with prioritized sampling

Description
An off-policy reinforcement learning agent stores experiences in a circular experience buffer. During
training, the agent samples mini-batches of experiences from the buffer and uses these mini-batches
to update its actor and critic function approximators.

By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object
as their experience buffer. Agents uniformly sample data from this buffer. To perform nonuniform
prioritized sampling [1], which can improve sample efficiency when training your agent, use an
rlPrioritizedReplayMemory object. For more information on prioritized sampling, see
“Algorithms” on page 3-251.

Creation
Syntax
buffer = rlPrioritizedReplayMemory(obsInfo,actInfo)
buffer = rlPrioritizedReplayMemory(obsInfo,actInfo,maxLength)

Description

buffer = rlPrioritizedReplayMemory(obsInfo,actInfo) creates a prioritized replay


memory experience buffer that is compatible with the observation and action specifications in
obsInfo and actInfo, respectively.

buffer = rlPrioritizedReplayMemory(obsInfo,actInfo,maxLength) sets the maximum


length of the buffer by setting the MaxLength property.

Input Arguments

obsInfo — Observation specifications


specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the
observation signals.

You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

actInfo — Action specifications


specification object | array of specification objects

Action specifications, specified as a reinforcement learning specification object defining properties


such as dimensions, data types, and names of the action signals.

3-249
3 Objects

You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.

Properties
MaxLength — Maximum buffer length
10000 (default) | positive integer

This property is read-only.

Maximum buffer length, specified as a positive integer.

Length — Number of experiences in buffer


0 (default) | nonnegative integer

This property is read-only.

Number of experiences in buffer, specified as a nonnegative integer.

PriorityExponent — Priority exponent


0.6 (default) | nonnegative scalar less than or equal to 1

Priority exponent to control the impact of prioritization during probability computation, specified as a
nonnegative scalar less than or equal to 1.

If the priority exponent is zero, the agent uses uniform sampling.

InitialImportanceSamplingExponent — Initial value of importance sampling exponent


0.4 (default) | nonnegative scalar less than or equal to 1

Initial value of the importance sampling exponent, specified as a nonnegative scalar less than or
equal to 1

NumAnnealingSteps — Number of annealing steps


1000000 (default) | positive integer

Number of annealing steps for updating the importance sampling exponent, specified as a positive
integer.

ImportanceSamplingExponent — Current value of importance sampling exponent


0.4 (default) | nonnegative scalar less than or equal to 1

This property is read-only.

Current value of the importance sampling exponent, specified as a nonnegative scalar less than or
equal to 1.

During training, ImportanceSamplingExponent is linearly increased from


InitialImportanceSamplingExponent to 1 over NumAnnealingSteps steps.

Object Functions
append Append experiences to replay memory buffer

3-250
rlPrioritizedReplayMemory

sample Sample experiences from replay memory buffer


resize Resize replay memory experience buffer
allExperiences Return all experiences in replay memory buffer
getActionInfo Obtain action data specifications from reinforcement learning environment,
agent, or experience buffer
getObservationInfo Obtain observation data specifications from reinforcement learning
environment, agent, or experience buffer
reset Reset environment, agent, experience buffer, or policy object

Examples

Create DQN Agent With Prioritized Replay Memory

Create an environment for training the agent. For this example, load a predefined environment.
env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

Extract the observation and action specifications from the agent.


obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a DQN agent from the environment specifications.


agent = rlDQNAgent(obsInfo,actInfo);

By default, the agent uses a replay memory experience buffer with uniform sampling.

Replace the default experience buffer with a prioritized replay memory buffer.
agent.ExperienceBuffer = rlPrioritizedReplayMemory(obsInfo,actInfo);

Configure the prioritized replay memory options. For example, set the initial importance sampling
exponent to 0.5 and the number of annealing steps for updating the exponent during training to 1e4.
agent.ExperienceBuffer.NumAnnealingSteps = 1e4;
agent.ExperienceBuffer.PriorityExponent = 0.5;
agent.ExperienceBuffer.InitialImportanceSamplingExponent = 0.5;

Limitations
• Prioritized experience replay does not support agents that use recurrent neural networks.

Algorithms
Prioritized replay memory samples experiences according to experience priorities. For a given
experience, the priority is defined as the absolute value of the associated temporal difference (TD)
error. A larger TD error indicates that the critic network is not well-trained for the corresponding
experience. Therefore, sampling such experiences during critic updates can help efficiently improve
the critic performance, which often improves the sample efficiency of agent training.

When using prioritized replay memory, agents use the following process when sampling a mini-batch
of experiences and updating a critic.

3-251
3 Objects

1 Compute the sampling probability P for each experience in the buffer based on the experience
priority.

α
p j
P( j) = α
∑iN= 1 p i
Here:

• N is the number of experiences in the replay memory buffer


• p is the experience priority.
• α is a priority exponent. To set α, use the PriorityExponent parameter.
2 Sample a mini-batch of experiences according to the computed probabilities.
3 Compute the importance sampling weights (w) for the sampled experiences.

−β
w′( j) = N ⋅ P j
w′ j
w j
max w′ i
i ∈ mini‐batch

Here, β is the importance sampling exponent. The ImportanceSamplingExponent parameter


contains the current value of β. To control β, set the ImportanceSamplingExponent and
NumAnnealingSteps parameters.
4 Compute the weighted loss using the importance sampling weights w and the TD error δ to
update a critic
5 Update the priorities of the sampled experiences based on the TD error.

p j = δ
6 Update the importance sampling exponent β by linearly annealing the exponent value until it
reaches 1.

1 − β0
β β+
NS

Here:

• β0 is the initial importance sampling exponent. To specify β0, use the


InitialImportanceSamplingExponent parameter.
• NS is the number of annealing steps. To specify Ns, use the NumAnnealingSteps parameter.

Version History
Introduced in R2022b

References
[1] Schaul, Tom, John Quan, Ioannis Antonoglou, and David Silver. 'Prioritized experience replay'.
arXiv:1511.05952 [Cs] 25 February 2016. https://arxiv.org/abs/1511.05952.

3-252
rlPrioritizedReplayMemory

See Also
rlReplayMemory

3-253
3 Objects

rlQAgent
Q-learning reinforcement learning agent

Description
The Q-learning algorithm is a model-free, online, off-policy reinforcement learning method. A Q-
learning agent is a value-based reinforcement learning agent which trains a critic to estimate the
return or future rewards.

For more information on Q-learning agents, see “Q-Learning Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation

Syntax
agent = rlQAgent(critic,agentOptions)

Description

agent = rlQAgent(critic,agentOptions) creates a Q-learning agent with the specified critic


network and sets the AgentOptions property.

Input Arguments

critic — Critic
rlQValueFunction object

Critic, specified as an rlQValueFunction object. For more information on creating critics, see
“Create Policies and Value Functions”.

Properties
AgentOptions — Agent options
rlQAgentOptions object

Agent options, specified as an rlQAgentOptions object.

UseExplorationPolicy — Option to use exploration policy


false (default) | true

Option to use exploration policy when selecting actions, specified as a one of the following logical
values.

• false — Use the agent greedy policy when selecting actions.

3-254
rlQAgent

• true — Use the agent exploration policy when selecting actions.

ObservationInfo — Observation specifications


specification object

This property is read-only.

Observation specifications, specified as a reinforcement learning specification object defining


properties such as dimensions, data type, and name of the observation signal.

The value of ObservationInfo matches the corresponding value specified in critic.

ActionInfo — Action specification


rlFiniteSetSpec object

This property is read-only.

Action specification, specified as an rlFiniteSetSpec object.

The value of ActionInfo matches the corresponding value specified in critic.

SampleTime — Sample time of agent


positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object

Examples

3-255
3 Objects

Create a Q-Learning Agent

Create an environment interface. For this example, use the same environment as in the example
“Train Reinforcement Learning Agent in Basic Grid World”.

env = rlPredefinedEnv("BasicGridWorld");

Get observation and action specifications.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a table approximation model derived from the environment observation and action
specifications.

qTable = rlTable(obsInfo,actInfo);

Create the critic using qTable. Q agents use an rlValueFunction object to implement the critic.

critic = rlQValueFunction(qTable,obsInfo,actInfo);

Create a Q-learning agent using the specified critic and an epsilon value of 0.05.

opt = rlQAgentOptions;
opt.EpsilonGreedyExploration.Epsilon = 0.05;

agent = rlQAgent(critic,opt)

agent =
rlQAgent with properties:

AgentOptions: [1x1 rl.option.rlQAgentOptions]


UseExplorationPolicy: 0
ObservationInfo: [1x1 rl.util.rlFiniteSetSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1

To check your agent, use getAction to return the action from a random observation.

act = getAction(agent,{randi(numel(obsInfo.Elements))});
act{1}

ans = 1

You can now test and train the agent against the environment.

Version History
Introduced in R2019a

See Also
Functions
rlQAgentOptions | rlQValueFunction

3-256
rlQAgent

Topics
“Q-Learning Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”

3-257
3 Objects

rlQAgentOptions
Options for Q-learning agent

Description
Use an rlQAgentOptions object to specify options for creating Q-learning agents. To create a Q-
learning agent, use rlQAgent

For more information on Q-learning agents, see “Q-Learning Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation

Syntax
opt = rlQAgentOptions
opt = rlQAgentOptions(Name,Value)

Description

opt = rlQAgentOptions creates an rlQAgentOptions object for use as an argument when


creating a Q-learning agent using all default settings. You can modify the object properties using dot
notation.

opt = rlQAgentOptions(Name,Value)sets option properties on page 3-258 using name-value


pairs. For example, rlQAgentOptions('DiscountFactor',0.95) creates an option set with a
discount factor of 0.95. You can specify multiple name-value pairs. Enclose each property name in
quotes.

Properties
EpsilonGreedyExploration — Options for epsilon-greedy exploration
EpsilonGreedyExploration object

Options for epsilon-greedy exploration, specified as an EpsilonGreedyExploration object with the


following properties.

3-258
rlQAgentOptions

Property Description Default Value


Epsilon Probability threshold to either 1
randomly select an action or
select the action that maximizes
the state-action value function.
A larger value of Epsilon
means that the agent randomly
explores the action space at a
higher rate.
EpsilonMin Minimum value of Epsilon 0.01
EpsilonDecay Decay rate 0.0050

At the end of each training time step, if Epsilon is greater than EpsilonMin, then it is updated
using the following formula.

Epsilon = Epsilon*(1-EpsilonDecay)

If your agent converges on local optima too quickly, you can promote agent exploration by increasing
Epsilon.

To specify exploration options, use dot notation after creating the rlQAgentOptions object opt. For
example, set the epsilon value to 0.9.
opt.EpsilonGreedyExploration.Epsilon = 0.9;

CriticOptimizerOptions — Critic optimizer options


rlOptimizerOptions object

Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the critic approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

SampleTime — Sample time of agent


1 (default) | positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

DiscountFactor — Discount factor


0.99 (default) | positive scalar less than or equal to 1

Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.

3-259
3 Objects

Object Functions
rlQAgent Q-learning reinforcement learning agent

Examples

Create Q-Learning Agent Options Object

This example shows how to create an options object for a Q-Learning agent.

Create an rlQAgentOptions object that specifies the agent sample time.

opt = rlQAgentOptions('SampleTime',0.5)

opt =
rlQAgentOptions with properties:

EpsilonGreedyExploration: [1x1 rl.option.EpsilonGreedyExploration]


CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
SampleTime: 0.5000
DiscountFactor: 0.9900
InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the agent discount factor to 0.95.

opt.DiscountFactor = 0.95;

Version History
Introduced in R2019a

See Also
Topics
“Q-Learning Agents”

3-260
rlQValueFunction

rlQValueFunction
Q-Value function approximator object for reinforcement learning agents

Description
This object implements a Q-value function approximator that you can use as a critic for a
reinforcement learning agent. A Q-value function maps an environment state-action pair to a scalar
value representing the predicted discounted cumulative long-term reward when the agent starts from
the given state and executes the given action. A Q-value function critic therefore needs both the
environment state and an action as inputs. After you create an rlQValueFunction critic, use it to
create an agent such as rlQAgent, rlDQNAgent, rlSARSAAgent, rlDDPGAgent, or rlTD3Agent.
For more information on creating representations, see “Create Policies and Value Functions”.

Creation

Syntax
critic = rlQValueFunction(net,observationInfo,actionInfo)

critic = rlQValueFunction(tab,observationInfo,actionInfo)

critic = rlQValueFunction({basisFcn,W0},observationInfo,actionInfo)

critic = rlQValueFunction( ___ ,Name=Value)

Description

critic = rlQValueFunction(net,observationInfo,actionInfo) creates the Q-value


function object critic. Here, net is the deep neural network used as an approximator, and it must
have both observation and action as inputs and a single scalar output. The network input layers are
automatically associated with the environment observation and action channels according to the
dimension specifications in observationInfo and actionInfo. This function sets the
ObservationInfo and ActionInfo properties of critic to the observationInfo and
actionInfo input arguments, respectively.

critic = rlQValueFunction(tab,observationInfo,actionInfo) creates the Q-value


function object critic with discrete action and observation spaces from the Q-value table tab. tab
is a rlTable object containing a table with as many rows as the number of possible observations and
as many columns as the number of possible actions. The function sets the ObservationInfo and
ActionInfo properties of critic respectively to the observationInfo and actionInfo input
arguments, which in this case must be scalar rlFiniteSetSpec objects.

critic = rlQValueFunction({basisFcn,W0},observationInfo,actionInfo) creates a Q-


value function object critic using a custom basis function as underlying approximator. The first
input argument is a two-element cell array whose first element is the handle basisFcn to a custom
basis function and whose second element is the initial weight vector W0. Here the basis function must
have both observation and action as inputs and W0 must be a column vector. The function sets the

3-261
3 Objects

ObservationInfo and ActionInfo properties of critic to the observationInfo and


actionInfo input arguments, respectively.

critic = rlQValueFunction( ___ ,Name=Value) specifies one or more name-value arguments.


You can specify the input and output layer names (to mandate their association with the environment
observation and action channels) for deep neural network approximators. For all types of
approximators, you can specify the computation device, for example UseDevice="gpu".

Input Arguments

net — Deep neural network


array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object |
dlNetwork object (preferred)

Deep neural network used as the underlying approximator within the critic, specified as one of the
following:

• Array of Layer objects


• layerGraph object
• DAGNetwork object
• SeriesNetwork object
• dlnetwork object

The network must have both the environment observation and action as inputs and a single scalar as
output.

Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
Deep Learning Toolbox neural network object. The resulting dlnet is the dlnetwork object that you
use for your critic or actor. This practice allows a greater level of insight and control for cases in
which the conversion is not straightforward and might require additional specifications.

rlQValueFunction objects support recurrent deep neural networks.

The learnable parameters of the critic are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.

tab — Q-value table


rlTable object

Q-value table, specified as an rlTable object containing an array with as many rows as the possible
observations and as many columns as the possible actions. The element (s,a) is the expected
cumulative long-term reward for taking action a from observed state s. The elements of this array are
the learnable parameters of the critic.

basisFcn — Custom basis function


function handle

3-262
rlQValueFunction

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The output
of the critic is the scalar c = W'*B, where W is a weight vector containing the learnable parameters,
and B is the column vector returned by the custom basis function.

Your basis function must have the following signature.


B = myBasisFunction(obs1,obs2,...,obsN,act)

Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the
environment observation channels defined in observationInfo and act is an input with the same
data type and dimension as the environment action channel defined in actionInfo.

For an example on how to use a basis function to create a Q-value function critic with a mixed
continuous and discrete observation space, see “Create Mixed Observation Space Q-Value Function
Critic from Custom Basis Function” on page 3-273.
Example: @(obs1,obs2,act) [act(2)*obs1(1)^2; abs(obs2(5)+act(1))]

W0 — Initial value of the basis function weights


column vector

Initial value of the basis function weights W, specified as a column vector having the same length as
the vector returned by the basis function.
Name-Value Pair Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the


argument name and Value is the corresponding value. Name-value arguments must appear after
other arguments, but the order of the pairs does not matter.
Example: UseDevice="gpu"

ActionInputNames — Network input layers name corresponding to the environment action


channel
string array | cell array of character vectors

Network input layers name corresponding to the environment action channel, specified as a string
array or a cell array of character vectors. The function assigns the environment action channel
specified in actionInfo to the specified network input layer. Therefore, the specified network input
layer must have the same data type and dimensions as defined in actionInfo.

Note The function does not use the name or the description (if any) of the action channel specified in
actionInfo.

This name-value argument is supported only when the approximation model is a deep neural network.
Example: ActionInputNames="myNetOutput_Force"

ObservationInputNames — Network input layers names corresponding to the environment


observation channels
string array | cell array of character vectors

Network input layers names corresponding to the environment observation channels, specified as a
string array or a cell array of character vectors. The function assigns, in sequential order, each

3-263
3 Objects

environment observation channel specified in observationInfo to each specified network input


layer. Therefore, the specified network input layers, ordered as indicated in this argument, must have
the same data type and dimensions as the observation specifications, as ordered in
observationInfo.

Note Of the information specified in observationInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.

This name-value argument is supported only when the approximation model is a deep neural network.
Example: ObservationInputNames={"NetInput1_airspeed","NetInput2_altitude"}

Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


containing a mix of such objects. Each element in the array defines the properties of an environment
observation channel, such as its dimensions, data type, and name. Note that only the data type and
dimension of a channel are used by the software to create actors or critics, but not its (optional)
name.

rlQValueFucntion sets the ObservationInfo property of critic to the input argument


observationInfo.

You can extract ObservationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually.

ActionInfo — Action specifications


rlFiniteSetSpec object | rlNumericSpec object

Action specifications, specified as an rlFiniteSetSpec or rlNumericSpec object. This object


defines the properties of the environment action channel, such as its dimensions, data type, and
name. Note that the function does not use the name of the action channel specified in actionInfo.

Note Only one action channel is allowed.

rlQValueRepresentation sets the ActionInfo property of critic to the input actionInfo.

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specifications manually.

UseDevice — Computation device used for training and simulation


"cpu" (default) | "gpu"

Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".

3-264
rlQValueFunction

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.

To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: "gpu"

Object Functions
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning
agent
rlDQNAgent Deep Q-network (DQN) reinforcement learning agent
rlQAgent Q-learning reinforcement learning agent
rlSARSAAgent SARSA reinforcement learning agent
rlSACAgent Soft actor-critic reinforcement learning agent
getValue Obtain estimated value from a critic given environment observations and
actions
getMaxQValue Obtain maximum estimated value over all possible actions from a Q-value
function critic with discrete action space, given environment observations
evaluate Evaluate function approximator object given observation (or observation-
action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data
accelerate Option to accelerate computation of gradient for approximator object
based on neural network
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
setModel Set function approximation model for actor or critic
getModel Get function approximator model from actor or critic

Examples

Create Q-Value Function Critic from Deep Neural Network

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.

3-265
3 Objects

obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.

actInfo = rlNumericSpec([2 1]);

To approximate the Q-value function within the critic, use a deep neural network.

The network must have two inputs, one for the observation and one for the action. The observation
input must accept a four-element vector (the observation vector defined by obsInfo). The action
input must accept a two-element vector (the action vector defined by actInfo). The output of the
network must be a scalar, representing the expected cumulative long-term reward when the agent
starts from the given observation and takes the given action.

You can also obtain the number of observations from the obsInfo specification (regardless of
whether the observation space is a column vector, row vector, or matrix,
prod(obsInfo.Dimension) is its total number of dimensions, in this case four, similarly,
prod(actInfo.Dimension) is the number of dimension of the action space, in the case two).

Create the network as an array of layer objects.

% Observation path layers


obsPath = [featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(5)
reluLayer
fullyConnectedLayer(5,Name="obsout")];

% Action path layers


actPath = [featureInputLayer(prod(actInfo.Dimension))
fullyConnectedLayer(5)
reluLayer
fullyConnectedLayer(5,Name="actout")];

% Common path to output layers


% Concatenate two layers along dimension one
comPath = [concatenationLayer(1,2,Name="cct")
fullyConnectedLayer(5)
reluLayer
fullyConnectedLayer(1, Name="output")];

% Add layers to network object


net = addLayers(layerGraph(obsPath),actPath);
net = addLayers(net,comPath);

% Connect layers
net = connectLayers(net,"obsout","cct/in1");
net = connectLayers(net,"actout","cct/in2");

% Plot network
plot(net)

3-266
rlQValueFunction

% Convert the network to a dlnetwork object


net = dlnetwork(net);

% Summarize properties
summary(net)

Initialized: true

Number of learnables: 161

Inputs:
1 'input' 4 features
2 'input_1' 2 features

Create the critic with rlQValueFunction, using the network as well as the observations and action
specification objects. When using this syntax, the network input layers are automatically associated
with the components of the observation and action signals according to the dimension specifications
in obsInfo and actInfo.
critic = rlQValueFunction(net,obsInfo,actInfo)

critic =
rlQValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlNumericSpec]
UseDevice: "cpu"

3-267
3 Objects

To check your critic, use the getValue function to return the value of a random observation and
action, given the current network weights.

v = getValue(critic, ...
{rand(obsInfo.Dimension)}, ...
{rand(actInfo.Dimension)})

v = single
-1.1006

You can now use the critic (along with an with an actor) to create an agent relying on a Q-value
function critic (such as rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent).

Create Q-Value Function Critic from Deep Neural Network Specifying Layer Names

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.

obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.

actInfo = rlNumericSpec([2 1]);

To approximate the Q-value function within the critic, use a deep neural network .

The network must have two inputs, one for the observation and one for the action. The observation
input (here called netObsInput) must accept a four-element vector (the observation vector defined
by obsInfo). The action input (here called netActInput) must accept a two-element vector (the
action vector defined by actInfo). The output of the network must be a scalar, representing the
expected cumulative long-term reward when the agent starts from the given observation and takes
the given action.

You can also obtain the number of observations from the obsInfo specification object (regardless of
whether the observation space is a column vector, row vector, or matrix,
prod(obsInfo.Dimension) is its number of dimensions, in this case four, similarly,
prod(actInfo.Dimension) is the number of dimension of the action space, in the case two).

To create the neural network paths, use vectors of layer objects. Name the network input layers for
the observation and action netObsInput and netActInput, respectively.

% Observation path layers


obsPath = [featureInputLayer( ...
prod(obsInfo.Dimension), ...
Name="netObsInput")
fullyConnectedLayer(5)
reluLayer
fullyConnectedLayer(5,Name="obsout")];

% Action path layers

3-268
rlQValueFunction

actPath = [featureInputLayer( ...


prod(actInfo.Dimension), ...
Name="netActInput")
fullyConnectedLayer(5)
reluLayer
fullyConnectedLayer(5,Name="actout")];

% Common path to output layers


% Concatenate two layers along dimension one
comPath = [concatenationLayer(1,2,Name="cct")
fullyConnectedLayer(5)
reluLayer
fullyConnectedLayer(1, Name="output")];

% Add layers to network object


net = addLayers(layerGraph(obsPath),actPath);
net = addLayers(net,comPath);

% Connect layers
net = connectLayers(net,"obsout","cct/in1");
net = connectLayers(net,"actout","cct/in2");

% Plot network
plot(net)

% Convert to dlnetwork object

3-269
3 Objects

net = dlnetwork(net);

% Summarize properties
summary(net);

Initialized: true

Number of learnables: 161

Inputs:
1 'netObsInput' 4 features
2 'netActInput' 2 features

Create the critic with rlQValueFunction, using the network, the observations and action
specification objects, and the names of the network input layers to be associated with the observation
and action from the environment.
critic = rlQValueFunction(net,...
obsInfo,actInfo, ...
ObservationInputNames="netObsInput",...
ActionInputNames="netActInput")

critic =
rlQValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlNumericSpec]
UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a random observation and
action, given the current network weights.
v = getValue(critic, ...
{rand(obsInfo.Dimension)}, ...
{rand(actInfo.Dimension)})

v = single
-1.1006

You can now use the critic (along with an with an actor) to create an agent relying on a Q-value
function critic (such as rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent).

Create Q-Value Function Critic from Table

Create a finite set observation specification object (or alternatively use getObservationInfo to
extract the specification object from an environment with a discrete observation space). For this
example define the observation space as a finite set with of four possible values.
obsInfo = rlFiniteSetSpec([7 5 3 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example define the
action space as a finite set with 2 possible values.
actInfo = rlFiniteSetSpec([4 8]);

3-270
rlQValueFunction

Create a table to approximate the value function within the critic. rlTable creates a value table
object from the observation and action specifications objects.
qTable = rlTable(obsInfo,actInfo);

The table stores a value (representing the expected cumulative long term reward) for each possible
observation-action pair. Each row corresponds to an observation and each column corresponds to an
action. You can access the table using the Table property of the vTable object. The initial value of
each element is zero.
qTable.Table

ans = 4×2

0 0
0 0
0 0
0 0

You can initialize the table to any value, in this case an array containing the integer from 1 through 8.
qTable.Table=reshape(1:8,4,2)

qTable =
rlTable with properties:

Table: [4x2 double]

Create the critic using the table as well as the observations and action specification objects.
critic = rlQValueFunction(qTable,obsInfo,actInfo)

critic =
rlQValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlFiniteSetSpec]


ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a given observation and action,
using the current table entries.
v = getValue(critic,{5},{8})

v = 6

You can now use the critic (along with an with an actor) to create a discrete action space agent
relying on a Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).

Create Q-Value Function Critic from Custom Basis Function

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a

3-271
3 Objects

continuous three-dimensional space, so that a single observation is a column vector containing three
doubles.

obsInfo = rlNumericSpec([3 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.

actInfo = rlNumericSpec([2 1]);

Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations and
actions respectively defined by obsInfo and actInfo.

myBasisFcn = @(myobs,myact) [ ...


myobs(2)^2; ...
myobs(1)+exp(myact(1)); ...
abs(myact(2)); ...
myobs(3) ]

myBasisFcn = function_handle with value:


@(myobs,myact)[myobs(2)^2;myobs(1)+exp(myact(1));abs(myact(2));myobs(3)]

The output of the critic is the scalar W'*myBasisFcn(myobs,myact), where W is a weight column
vector which must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the best
possible action. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = [1;4;4;2];

Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second and third arguments are, respectively, the
observation and action specification objects.

critic = rlQValueFunction({myBasisFcn,W0}, ...


obsInfo,actInfo)

critic =
rlQValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlNumericSpec]
UseDevice: "cpu"

To check your critic, use getValue to return the value of a given observation-action pair, using the
current parameter vector.

v = getValue(critic,{[1 2 3]'},{[4 5]'})

v = 252.3926

You can now use the critic (along with an with an actor) to create an agent relying on a Q-value
function critic (such as rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent).

3-272
rlQValueFunction

Create Mixed Observation Space Q-Value Function Critic from Custom Basis Function

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as
consisting of two channels, the first is a vector over a continuous two-dimensional space and the
second is a vector over a three-dimensional space that can assume only four values.

obsInfo = [rlNumericSpec([1 2])


rlFiniteSetSpec({[1 0 -1], ...
[-1 2 1], ...
[0.1 0.2 0.3], ...
[0 0 0]})];

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as discrete set
consisting of three possible actions, 1, 2, and 3.

actInfo = rlFiniteSetSpec({1,2,3});

Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations and
the action respectively defined by obsInfo and actInfo. Note that the selected action, as defined,
has only one element, while each observation channel has two elements.

myBasisFcn = @(obsA,obsB,act) [obsA(1)+obsB(2)+obsB(3)+act(1);


obsA(2)+obsB(1)+obsB(2)-act(1);
obsA(1)+obsB(2)+obsB(3)+act(1)^2;
obsA(1)+obsB(1)+obsB(2)-act(1)^2];

The output of the critic is the scalar W'*myBasisFcn(obsA,obsB,act), where W is a weight column
vector that must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the action
specified as last input. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = ones(4,1);

Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second and third arguments are, respectively, the
observation and action specification objects.

critic = rlQValueFunction({myBasisFcn,W0},obsInfo,actInfo)

critic =
rlQValueFunction with properties:

ObservationInfo: [2x1 rl.util.RLDataSpec]


ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a given observation-action pair,
using the current parameter vector.

3-273
3 Objects

v = getValue(critic,{[-0.5 0.6],[1 0 -1]},{3})

v = -0.9000

Note that the critic does not enforce the set constraint for the discrete set elements.

v = getValue(critic,{[-0.5 0.6],[10 -10 -0.05]},{33})

v = -21.0000

You can now use the critic (along with an with an actor) to create an agent with a discrete action
space relying on a Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).

Version History
Introduced in R2022a

See Also
Functions
rlValueFunction | rlVectorQValueFunction | rlTable | getActionInfo |
getObservationInfo

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-274
rlQValueRepresentation

rlQValueRepresentation
(Not recommended) Q-Value function critic representation for reinforcement learning agents

Note rlQValueRepresentation is not recommended. Use rlQValueFunction or


rlVectorQValueFunction instead. For more information, see “rlQValueRepresentation is not
recommended”.

Description
This object implements a Q-value function approximator to be used as a critic within a reinforcement
learning agent. A Q-value function is a function that maps an observation-action pair to a scalar value
representing the expected total long-term rewards that the agent is expected to accumulate when it
starts from the given observation and executes the given action. Q-value function critics therefore
need both observations and actions as inputs. After you create an rlQValueRepresentation critic,
use it to create an agent relying on a Q-value function critic, such as an rlQAgent, rlDQNAgent,
rlSARSAAgent, rlDDPGAgent, or rlTD3Agent. For more information on creating representations,
see “Create Policies and Value Functions”.

Creation
Syntax
critic = rlQValueRepresentation(net,observationInfo,actionInfo,'Observation',
obsName,'Action',actName)
critic = rlQValueRepresentation(tab,observationInfo,actionInfo)
critic = rlQValueRepresentation({basisFcn,W0},observationInfo,actionInfo)

critic = rlQValueRepresentation(net,observationInfo,actionInfo,'Observation',
obsName)
critic = rlQValueRepresentation({basisFcn,W0},observationInfo,actionInfo)

critic = rlQValueRepresentation( ___ ,options)

Description
Scalar Output Q-Value Critic

critic = rlQValueRepresentation(net,observationInfo,actionInfo,'Observation',
obsName,'Action',actName) creates the Q-value function critic. net is the deep neural
network used as an approximator, and must have both observations and action as inputs, and a single
scalar output. This syntax sets the ObservationInfo and ActionInfo properties of critic respectively
to the inputs observationInfo and actionInfo, containing the observations and action
specifications. obsName must contain the names of the input layers of net that are associated with
the observation specifications. The action name actName must be the name of the input layer of net
that is associated with the action specifications.

critic = rlQValueRepresentation(tab,observationInfo,actionInfo) creates the Q-


value function based critic with discrete action and observation spaces from the Q-value table tab.

3-275
3 Objects

tab is a rlTable object containing a table with as many rows as the possible observations and as
many columns as the possible actions. This syntax sets the ObservationInfo and ActionInfo properties
of critic respectively to the inputs observationInfo and actionInfo, which must be
rlFiniteSetSpec objects containing the specifications for the discrete observations and action
spaces, respectively.

critic = rlQValueRepresentation({basisFcn,W0},observationInfo,actionInfo)
creates a Q-value function based critic using a custom basis function as underlying approximator.
The first input argument is a two-elements cell in which the first element contains the handle
basisFcn to a custom basis function, and the second element contains the initial weight vector W0.
Here the basis function must have both observations and action as inputs and W0 must be a column
vector. This syntax sets the ObservationInfo and ActionInfo properties of critic respectively to the
inputs observationInfo and actionInfo.
Multi-Output Discrete Action Space Q-Value Critic

critic = rlQValueRepresentation(net,observationInfo,actionInfo,'Observation',
obsName) creates the multi-output Q-value function critic for a discrete action space. net is the
deep neural network used as an approximator, and must have only the observations as input and a
single output layer having as many elements as the number of possible discrete actions. This syntax
sets the ObservationInfo and ActionInfo properties of critic respectively to the inputs
observationInfo and actionInfo, containing the observations and action specifications. Here,
actionInfo must be an rlFiniteSetSpec object containing the specifications for the discrete
action space. The observation names obsName must be the names of the input layers of net.

critic = rlQValueRepresentation({basisFcn,W0},observationInfo,actionInfo)
creates the multi-output Q-value function critic for a discrete action space using a custom basis
function as underlying approximator. The first input argument is a two-elements cell in which the first
element contains the handle basisFcn to a custom basis function, and the second element contains
the initial weight matrix W0. Here the basis function must have only the observations as inputs, and
W0 must have as many columns as the number of possible actions. This syntax sets the
ObservationInfo and ActionInfo properties of critic respectively to the inputs observationInfo
and actionInfo.
Options

critic = rlQValueRepresentation( ___ ,options) creates the value function based critic
using the additional option set options, which is an rlRepresentationOptions object. This
syntax sets the Options property of critic to the options input argument. You can use this syntax
with any of the previous input-argument combinations.

Input Arguments

net — Deep neural network


array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object |
dlNetwork object

Deep neural network used as the underlying approximator within the critic, specified as one of the
following:

• Array of Layer objects


• layerGraph object
• DAGNetwork object

3-276
rlQValueRepresentation

• SeriesNetwork object
• dlnetwork object

For single output critics, net must have both observations and actions as inputs, and a scalar output,
representing the expected cumulative long-term reward when the agent starts from the given
observation and takes the given action. For multi-output discrete action space critics, net must have
only the observations as input and a single output layer having as many elements as the number of
possible discrete actions. Each output element represents the expected cumulative long-term reward
when the agent starts from the given observation and takes the corresponding action. The learnable
parameters of the critic are the weights of the deep neural network.

The network input layers must be in the same order and with the same data type and dimensions as
the signals defined in ObservationInfo. Also, the names of these input layers must match the
observation names listed in obsName.

The network output layer must have the same data type and dimension as the signal defined in
ActionInfo. Its name must be the action name specified in actName.

rlQValueRepresentation objects support recurrent deep neural networks for multi-output


discrete action space critics.

For a list of deep neural network layers, see “List of Deep Learning Layers”. For more information on
creating deep neural networks for reinforcement learning, see “Create Policies and Value Functions”.

obsName — Observation names


string | character vector | cell array or character vectors

Observation names, specified as a cell array of strings or character vectors. The observation names
must be the names of the observation input layers in net.
Example: {'my_obs'}

actName — Action name


string | character vector | single-element cell array containing a character vector

Action name, specified as a single-element cell array that contains a string or character vector. It
must be the name of the action input layer of net.
Example: {'my_act'}

tab — Q-value table


rlTable object

Q-value table, specified as an rlTable object containing an array with as many rows as the possible
observations and as many columns as the possible actions. The element (s,a) is the expected
cumulative long-term reward for taking action a from observed state s. The elements of this array are
the learnable parameters of the critic.

basisFcn — Custom basis function


function handle

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The output
of the critic is c = W'*B, where W is a weight vector or matrix containing the learnable parameters,
and B is the column vector returned by the custom basis function.

3-277
3 Objects

For a single-output Q-value critic, c is a scalar representing the expected cumulative long term
reward when the agent starts from the given observation and takes the given action. In this case,
your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN,act)

For a multiple-output Q-value critic with a discrete action space, c is a vector in which each element
is the expected cumulative long term reward when the agent starts from the given observation and
takes the action corresponding to the position of the considered element. In this case, your basis
function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here, obs1 to obsN are observations in the same order and with the same data type and dimensions
as the signals defined in observationInfo and act has the same data type and dimensions as the
action specifications in actionInfo.
Example: @(obs1,obs2,act) [act(2)*obs1(1)^2; abs(obs2(5)+act(1))]

W0 — Initial value of the basis function weights


matrix

Initial value of the basis function weights, W. For a single-output Q-value critic, W is a column vector
having the same length as the vector returned by the basis function. For a multiple-output Q-value
critic with a discrete action space, W is a matrix which must have as many rows as the length of the
basis function output, and as many columns as the number of possible actions.

Properties
Options — Representation options
rlRepresentationOptions object

Representation options, specified as an rlRepresentationOptions object. Available options


include the optimizer used for training and the learning rate.

ObservationInfo — Observation specifications


rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


containing a mix of such objects. These objects define properties such as the dimensions, data type,
and names of the observation signals.

rlQValueRepresentation sets the ObservationInfo property of critic to the input


observationInfo.

You can extract ObservationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually.

ActionInfo — Action specifications


rlFiniteSetSpec object | rlNumericSpec object

Action specifications, specified as an rlFiniteSetSpec or rlNumericSpec object. These objects


define properties such as the dimensions, data type and name of the action signals.

rlQValueRepresentation sets the ActionInfo property of critic to the input actionInfo.

3-278
rlQValueRepresentation

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specifications manually.

Object Functions
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning agent
rlDQNAgent Deep Q-network (DQN) reinforcement learning agent
rlQAgent Q-learning reinforcement learning agent
rlSARSAAgent SARSA reinforcement learning agent
rlSACAgent Soft actor-critic reinforcement learning agent
getValue Obtain estimated value from a critic given environment observations and actions
getMaxQValue Obtain maximum estimated value over all possible actions from a Q-value function
critic with discrete action space, given environment observations

Examples

Create Q-Value Function Critic from Deep Neural Network

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.

obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.

actInfo = rlNumericSpec([2 1]);

Create a deep neural network to approximate the Q-value function. The network must have two
inputs, one for the observation and one for the action. The observation input (here called myobs)
must accept a four-element vector (the observation vector defined by obsInfo). The action input
(here called myact) must accept a two-element vector (the action vector defined by actInfo). The
output of the network must be a scalar, representing the expected cumulative long-term reward when
the agent starts from the given observation and takes the given action.

% observation path layers


obsPath = [featureInputLayer(4, ...
'Normalization','none','Name','myobs')
fullyConnectedLayer(1,'Name','obsout')];

% action path layers


actPath = [featureInputLayer(2, ...
'Normalization','none','Name','myact')
fullyConnectedLayer(1,'Name','actout')];

% common path to output layers


comPath = [additionLayer(2,'Name', 'add') ...
fullyConnectedLayer(1, 'Name', 'output')];

% add layers to network object

3-279
3 Objects

net = addLayers(layerGraph(obsPath),actPath);
net = addLayers(net,comPath);

% connect layers
net = connectLayers(net,'obsout','add/in1');
net = connectLayers(net,'actout','add/in2');

Create the critic with rlQValueRepresentation, using the network, the observations and action
specification objects, as well as the names of the network input layers.

critic = rlQValueRepresentation(net,obsInfo,actInfo, ...


'Observation',{'myobs'},'Action',{'myact'})

critic =
rlQValueRepresentation with properties:

ActionInfo: [1x1 rl.util.rlNumericSpec]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the value of a random observation and
action, using the current network weights.

v = getValue(critic,{rand(4,1)},{rand(2,1)})

v = single
0.1102

You can now use the critic (along with an with an actor) to create an agent relying on a Q-value
function critic (such as an rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent agent).

Create Multi-Output Q-Value Function Critic from Deep Neural Network

This example shows how to create a multi-output Q-value function critic for a discrete action space
using a deep neural network approximator.

This critic takes only the observation as input and produces as output a vector with as many elements
as the possible actions. Each element represents the expected cumulative long term reward when the
agent starts from the given observation and takes the action corresponding to the position of the
element in the output vector.

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.

obsInfo = rlNumericSpec([4 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example, define the
action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).

actInfo = rlFiniteSetSpec([7 5 3]);

3-280
rlQValueRepresentation

Create a deep neural network approximator to approximate the Q-value function within the critic.
The input of the network (here called myobs) must accept a four-element vector, as defined by
obsInfo. The output must be a single output layer having as many elements as the number of
possible discrete actions (three in this case, as defined by actInfo).

net = [featureInputLayer(4,...
'Normalization','none','Name','myobs')
fullyConnectedLayer(3,'Name','value')];

Create the critic using the network, the observations specification object, and the name of the
network input layer.

critic = rlQValueRepresentation(net,obsInfo,actInfo,...
'Observation',{'myobs'})

critic =
rlQValueRepresentation with properties:

ActionInfo: [1x1 rl.util.rlFiniteSetSpec]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the values of a random observation, using
the current network weights. There is one value for each of the three possible actions.

v = getValue(critic,{rand(4,1)})

v = 3x1 single column vector

0.7232
0.8177
-0.2212

You can now use the critic (along with an actor) to create a discrete action space agent relying on a
Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).

Create Q-Value Function Critic from Table

Create a finite set observation specification object (or alternatively use getObservationInfo to
extract the specification object from an environment with a discrete observation space). For this
example define the observation space as a finite set with of 4 possible values.

obsInfo = rlFiniteSetSpec([7 5 3 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example define the
action space as a finite set with 2 possible values.

actInfo = rlFiniteSetSpec([4 8]);

Create a table to approximate the value function within the critic. rlTable creates a value table
object from the observation and action specifications objects.

3-281
3 Objects

qTable = rlTable(obsInfo,actInfo);

The table stores a value (representing the expected cumulative long term reward) for each possible
observation-action pair. Each row corresponds to an observation and each column corresponds to an
action. You can access the table using the Table property of the vTable object. The initial value of
each element is zero.
qTable.Table

ans = 4×2

0 0
0 0
0 0
0 0

You can initialize the table to any value, in this case, an array containing the integer from 1 through
8.
qTable.Table=reshape(1:8,4,2)

qTable =
rlTable with properties:

Table: [4x2 double]

Create the critic using the table as well as the observations and action specification objects.
critic = rlQValueRepresentation(qTable,obsInfo,actInfo)

critic =
rlQValueRepresentation with properties:

ActionInfo: [1x1 rl.util.rlFiniteSetSpec]


ObservationInfo: [1x1 rl.util.rlFiniteSetSpec]
Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the value of a given observation and action,
using the current table entries.
v = getValue(critic,{5},{8})

v = 6

You can now use the critic (along with an with an actor) to create a discrete action space agent
relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).

Create Q-Value Function Critic from Custom Basis Function

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 3
doubles.

3-282
rlQValueRepresentation

obsInfo = rlNumericSpec([3 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing 2 doubles.

actInfo = rlNumericSpec([2 1]);

Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations and
actions respectively defined by obsInfo and actInfo.

myBasisFcn = @(myobs,myact) [...


myobs(2)^2; ...
myobs(1)+exp(myact(1)); ...
abs(myact(2)); ...
myobs(3)]

myBasisFcn = function_handle with value:


@(myobs,myact)[myobs(2)^2;myobs(1)+exp(myact(1));abs(myact(2));myobs(3)]

The output of the critic is the scalar W'*myBasisFcn(myobs,myact), where W is a weight column
vector which must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the best
possible action. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = [1;4;4;2];

Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second and third arguments are, respectively, the
observation and action specification objects.

critic = rlQValueRepresentation({myBasisFcn,W0},...
obsInfo,actInfo)

critic =
rlQValueRepresentation with properties:

ActionInfo: [1×1 rl.util.rlNumericSpec]


ObservationInfo: [1×1 rl.util.rlNumericSpec]
Options: [1×1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the value of a given observation-action pair,
using the current parameter vector.

v = getValue(critic,{[1 2 3]'},{[4 5]'})

v =
1×1 dlarray

252.3926

3-283
3 Objects

You can now use the critic (along with an with an actor) to create an agent relying on a Q-value
function critic (such as an rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent agent).

Create Multi-Output Q-Value Function Critic from Custom Basis Function

This example shows how to create a multi-output Q-value function critic for a discrete action space
using a custom basis function approximator.

This critic takes only the observation as input and produces as output a vector with as many elements
as the possible actions. Each element represents the expected cumulative long term reward when the
agent starts from the given observation and takes the action corresponding to the position of the
element in the output vector.

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 2
doubles.
obsInfo = rlNumericSpec([2 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example, define the
action space as a finite set consisting of 3 possible values (named 7, 5, and 3 in this case).
actInfo = rlFiniteSetSpec([7 5 3]);

Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations
defined by obsInfo.
myBasisFcn = @(myobs) [myobs(2)^2; ...
myobs(1); ...
exp(myobs(2)); ...
abs(myobs(1))]

myBasisFcn = function_handle with value:


@(myobs)[myobs(2)^2;myobs(1);exp(myobs(2));abs(myobs(1))]

The output of the critic is the vector c = W'*myBasisFcn(myobs), where W is a weight matrix
which must have as many rows as the length of the basis function output, and as many columns as the
number of possible actions.

Each element of c is the expected cumulative long term reward when the agent starts from the given
observation and takes the action corresponding to the position of the considered element. The
elements of W are the learnable parameters.

Define an initial parameter matrix.


W0 = rand(4,3);

Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial parameter matrix. The second and third arguments are, respectively, the
observation and action specification objects.

3-284
rlQValueRepresentation

critic = rlQValueRepresentation({myBasisFcn,W0},...
obsInfo,actInfo)

critic =
rlQValueRepresentation with properties:

ActionInfo: [1x1 rl.util.rlFiniteSetSpec]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the values of a random observation, using
the current parameter matrix. Note that there is one value for each of the three possible actions.

v = getValue(critic,{rand(2,1)})

v =
3x1 dlarray

2.1395
1.2183
2.3342

You can now use the critic (along with an actor) to create a discrete action space agent relying on a
Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).

Create Q-Value Function Critic from Recurrent Neural Network

Create an environment and obtain observation and action information.

env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
numObs = obsInfo.Dimension(1);
numDiscreteAct = numel(actInfo.Elements);

Create a recurrent deep neural network for your critic. To create a recurrent neural network, use a
sequenceInputLayer as the input layer and include at least one lstmLayer.

Create a recurrent neural network for a multi-output Q-value function representation.

criticNetwork = [
sequenceInputLayer(numObs,...
'Normalization','none','Name','state')
fullyConnectedLayer(50, 'Name', 'CriticStateFC1')
reluLayer('Name','CriticRelu1')
lstmLayer(20,'OutputMode','sequence',...
'Name','CriticLSTM');
fullyConnectedLayer(20,'Name','CriticStateFC2')
reluLayer('Name','CriticRelu2')
fullyConnectedLayer(numDiscreteAct,...
'Name','output')];

Create a representation for your critic using the recurrent neural network.

3-285
3 Objects

criticOptions = rlRepresentationOptions(...
'LearnRate',1e-3,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,...
obsInfo,actInfo,...
'Observation','state',criticOptions);

Version History
Introduced in R2020a

rlQValueRepresentation is not recommended


Not recommended starting in R2022a

rlQValueRepresentation is not recommended. Use either rlQValueFunction or


rlVectorQValueFunction instead.

The following table shows some typical uses of rlQValueRepresentation to create neural
network-based critics, and how to update your code with one of the new Q-value approximator
objects instead.

Network-Based Q-Value Representation: Not Network-Based Q-Value Approximators:


Recommended Recommended
myCritic = myCritic =
rlQValueRepresentation(net,obsInfo,act rlQValueFunction(net,obsInfo,actInfo,'
Info,'Observation',obsNames,'Action',a ObservationInputNames',obsNames,'Actio
ctNames), with net having both observations nInputNames',actNames). Use this syntax to
and actions as inputs and a single scalar output. create a single-output state-action value function
object for a critic that takes both observation and
action as inputs.
myCritic = myCritic =
rlQValueRepresentation(net,obsInfo,act rlVectorQValueFunction(net,obsInfo,act
Info,'Observation',obsNames) with net Info,'ObservationInputNames',obsNames)
having only the observations as inputs and a . Use this syntax to create a multiple-output state-
single output layer having as many elements as action value function object for a critic with a
the number of possible discrete actions. discrete action space. This critic takes
observations as inputs, and outputs a vector in
which each element is the value of one of the
possible actions.

The following table shows some typical uses of rlQValueRepresentation to create table-based
critics with discrete observation and action spaces, and how to update your code with one of the new
Q-value approximator objects instead.

Table-Based Q-Value Representation: Not Table-Based Q-Value Approximators:


Recommended Recommended
rep = rep =
rlQValueRepresentation(tab,obsInfo,act rlQValueFunction(tab,obsInfo,actInfo).
Info), where the table tab contains a vector Use this syntax to create a single-output state-
with as many elements as the number of possible action value function object for a critic that takes
observations plus the number of possible actions. both observations and actions as input.

3-286
rlQValueRepresentation

Table-Based Q-Value Representation: Not Table-Based Q-Value Approximators:


Recommended Recommended
rep = rep =
rlQValueRepresentation(tab,obsInfo,act rlVectorQValueFunction(tab,obsInfo,act
Info), where the table tab contains a Q-value Info). Use this syntax to create a multiple-
table with as many rows as the number of output state-action value function object for a
possible observations and as many columns as critic with a discrete action space. This critic
the number of possible actions. takes observations as inputs, and outputs a vector
in which each element is the value of one of the
possible actions. It is good practice to use critics
with vector outputs when possible.

The following table shows some typical uses of rlQValueRepresentation to create critics which
use a (linear in the learnable parameters) custom basis function, and how to update your code with
one of the new Q-value approximator objects instead. In these function calls, the first input argument
is a two-element cell array containing both the handle to the custom basis function and the initial
weight vector or matrix.

Custom Basis Function-Based Q-Value Custom Basis Function-Based Q-Value


Representation: Not Recommended Approximators: Recommended
rep = rep =
rlQValueRepresentation({basisFcn,W0},o rlQValueRepresentation({basisFcn,W0},o
bsInfo,actInfo), where the basis function has bsInfo,actInfo). Use this syntax to create a
both observations and action as inputs and W0 is single-output state-action value function object
a column vector. for a critic that takes both observation and action
as inputs.
rep = rep =
rlQValueRepresentation({basisFcn,W0},o rlVectorQValueRepresentation({basisFcn
bsInfo,actInfo), where the basis function has ,W0},obsInfo,actInfo). Use this syntax to
both observations and action as inputs and W0 is create a multiple-output state-action value
a matrix with as many columns as the number of function object for a critic with a discrete action
possible actions. space. This critic takes observations as inputs,
and outputs a vector in which each element is the
value of one of the possible actions. It is good
practice to use critics with vector outputs when
possible.

See Also
Functions
rlQValueFunction | rlRepresentationOptions | getActionInfo | getObservationInfo

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-287
3 Objects

rlReplayMemory
Replay memory experience buffer

Description
An off-policy reinforcement learning agent stores experiences in a circular experience buffer. During
training, the agent samples mini-batches of experiences from the buffer and uses these mini-batches
to update its actor and critic function approximators.

By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object
as their experience buffer. Agents uniformly sample data from this buffer. To perform nonuniform
prioritized sampling, use an rlPrioritizedReplayMemory object.

When you create a custom off-policy reinforcement learning agent, you can create an experience
buffer by using an rlReplayMemory object.

Creation
Syntax
buffer = rlReplayMemory(obsInfo,actInfo)
buffer = rlReplayMemory(obsInfo,actInfo,maxLength)

Description

buffer = rlReplayMemory(obsInfo,actInfo) creates a replay memory experience buffer that


is compatible with the observation and action specifications in obsInfo and actInfo, respectively.

buffer = rlReplayMemory(obsInfo,actInfo,maxLength) sets the maximum length of the


buffer by setting the MaxLength property.

Input Arguments

obsInfo — Observation specifications


specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data types, and names of the
observation signals.

You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

actInfo — Action specifications


specification object | array of specification objects

Action specifications, specified as a reinforcement learning specification object defining properties


such as dimensions, data types, and names of the action signals.

3-288
rlReplayMemory

You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.

Properties
MaxLength — Maximum buffer length
10000 (default) | positive integer

This property is read-only.

Maximum buffer length, specified as a positive integer.

Length — Number of experiences in buffer


0 (default) | nonnegative integer

This property is read-only.

Number of experiences in buffer, specified as a nonnegative integer.

Object Functions
append Append experiences to replay memory buffer
sample Sample experiences from replay memory buffer
resize Resize replay memory experience buffer
allExperiences Return all experiences in replay memory buffer
getActionInfo Obtain action data specifications from reinforcement learning environment,
agent, or experience buffer
getObservationInfo Obtain observation data specifications from reinforcement learning
environment, agent, or experience buffer

Examples

Create Experience Buffer

Define observation specifications for the environment. For this example, assume that the environment
has a single observation channel with three continuous signals in specified ranges.

obsInfo = rlNumericSpec([3 1],...


LowerLimit=0,...
UpperLimit=[1;5;10]);

Define action specifications for the environment. For this example, assume that the environment has
a single action channel with two continuous signals in specified ranges.

actInfo = rlNumericSpec([2 1],...


LowerLimit=0,...
UpperLimit=[5;10]);

Create an experience buffer with a maximum length of 20,000.

buffer = rlReplayMemory(obsInfo,actInfo,20000);

3-289
3 Objects

Append a single experience to the buffer using a structure. Each experience contains the following
elements: current observation, action, next observation, reward, and is-done.

For this example, create an experience with random observation, action, and reward values. Indicate
that this experience is not a terminal condition by setting the IsDone value to 0.

exp.Observation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Action = {actInfo.UpperLimit.*rand(2,1)};
exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Reward = 10*rand(1);
exp.IsDone = 0;

Append the experience to the buffer.

append(buffer,exp);

You can also append a batch of experiences to the experience buffer using a structure array. For this
example, append a sequence of 100 random experiences, with the final experience representing a
terminal condition.

for i = 1:100
expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)};
expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Reward = 10*rand(1);
expBatch(i).IsDone = 0;
end
expBatch(100).IsDone = 1;

append(buffer,expBatch);

After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.

miniBatch = sample(buffer,50);

You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive
experiences with a discount factor of 0.95.

horizonSample = sample(buffer,1,...
NStepHorizon=10,...
DiscountFactor=0.95);

The returned sample includes the following information.

• Observation and Action are the observation and action from the first experience in the
horizon.
• NextObservation and IsDone are the next observation and termination signal from the final
experience in the horizon.
• Reward is the cumulative reward across the horizon using the specified discount factor.

You can also sample a sequence of consecutive experiences. In this case, the structure fields contain
arrays with values for all sampled experiences.

sequenceSample = sample(buffer,1,...
SequenceLength=20);

3-290
rlReplayMemory

Create Experience Buffer with Multiple Observation Channels

Define observation specifications for the environment. For this example, assume that the environment
has two observation channels: one channel with two continuous observations and one channel with a
three-valued discrete observation.
obsContinuous = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[1;5]);
obsDiscrete = rlFiniteSetSpec([1 2 3]);
obsInfo = [obsContinuous obsDiscrete];

Define action specifications for the environment. For this example, assume that the environment has
a single action channel with one continuous action in a specified range.
actInfo = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[5;10]);

Create an experience buffer with a maximum length of 5,000.


buffer = rlReplayMemory(obsInfo,actInfo,5000);

Append a sequence of 50 random experiences to the buffer.


for i = 1:50
exp(i).Observation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
exp(i).Action = {actInfo.UpperLimit.*rand(2,1)};
exp(i).NextObservation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
exp(i).Reward = 10*rand(1);
exp(i).IsDone = 0;
end

append(buffer,exp);

After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 10 experiences from the buffer.
miniBatch = sample(buffer,10);

Resize Experience Buffer in Reinforcement Learning Agent

Create an environment for training the agent. For this example, load a predefined environment.
env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

Extract the observation and action specifications from the agent.


obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a DQN agent from the environment specifications.

3-291
3 Objects

agent = rlDQNAgent(obsInfo,actInfo);

By default, the agent uses an experience buffer with a maximum size of 10,000.

agent.ExperienceBuffer

ans =
rlReplayMemory with properties:

MaxLength: 10000
Length: 0

Increase the maximum size of the experience buffer to 20,000.

resize(agent.ExperienceBuffer,20000)

View the updated experience buffer.

agent.ExperienceBuffer

ans =
rlReplayMemory with properties:

MaxLength: 20000
Length: 0

Version History
Introduced in R2022a

See Also
rlPrioritizedReplayMemory

3-292
rlRepresentationOptions

rlRepresentationOptions
(Not recommended) Options set for reinforcement learning agent representations (critics and actors)

Note rlRepresentationOptions is not recommended. Use an rlOptimizerOptions object


within an agent options object instead. For more information, see “rlRepresentationOptions is not
recommended”.

Description
Use an rlRepresentationOptions object to specify an options set for critics
(rlValueRepresentation, rlQValueRepresentation) and actors
(rlDeterministicActorRepresentation, rlStochasticActorRepresentation).

Creation

Syntax
repOpts = rlRepresentationOptions
repOpts = rlRepresentationOptions(Name,Value)

Description

repOpts = rlRepresentationOptions creates a default option set to use as a last argument


when creating a reinforcement learning actor or critic. You can modify the object properties using dot
notation.

repOpts = rlRepresentationOptions(Name,Value) creates an options set with the specified


“Properties” on page 3-293 using one or more name-value pair arguments.

Properties
LearnRate — Learning rate for the representation
0.01 (default) | positive scalar

Learning rate for the representation, specified as a positive scalar. If the learning rate is too low, then
training takes a long time. If the learning rate is too high, then training might reach a suboptimal
result or diverge.
Example: 'LearnRate',0.025

Optimizer — Optimizer for representation


"adam" (default) | "sgdm" | "rmsprop"

Optimizer for training the network of the representation, specified as one of the following values.

3-293
3 Objects

• "adam" — Use the Adam optimizer. You can specify the decay rates of the gradient and squared
gradient moving averages using the GradientDecayFactor and
SquaredGradientDecayFactor fields of the OptimizerParameters option.
• "sgdm" — Use the stochastic gradient descent with momentum (SGDM) optimizer. You can specify
the momentum value using the Momentum field of the OptimizerParameters option.
• "rmsprop" — Use the RMSProp optimizer. You can specify the decay rate of the squared gradient
moving average using the SquaredGradientDecayFactor fields of the OptimizerParameters
option.

For more information about these optimizers, see “Stochastic Gradient Descent” in the Algorithms
section of trainingOptions in Deep Learning Toolbox.
Example: 'Optimizer',"sgdm"

OptimizerParameters — Applicable parameters for optimizer


OptimizerParameters object

Applicable parameters for the optimizer, specified as an OptimizerParameters object with the
following parameters.

Parameter Description
Momentum Contribution of previous step, specified as a
scalar from 0 to 1. A value of 0 means no
contribution from the previous step. A value of 1
means maximal contribution.

This parameter applies only when Optimizer is


"sgdm". In that case, the default value is 0.9.
This default value works well for most problems.
Epsilon Denominator offset, specified as a positive scalar.
The optimizer adds this offset to the denominator
in the network parameter updates to avoid
division by zero.

This parameter applies only when Optimizer is


"adam" or "rmsprop". In that case, the default
value is 10–8. This default value works well for
most problems.
GradientDecayFactor Decay rate of gradient moving average, specified
as a positive scalar from 0 to 1.

This parameter applies only when Optimizer is


"adam". In that case, the default value is 0.9.
This default value works well for most problems.
SquaredGradientDecayFactor Decay rate of squared gradient moving average,
specified as a positive scalar from 0 to 1.

This parameter applies only when Optimizer is


"adam" or "rmsprop". In that case, the default
value is 0.999. This default value works well for
most problems.

3-294
rlRepresentationOptions

When a particular property of OptimizerParameters is not applicable to the optimizer type


specified in the Optimizer option, that property is set to "Not applicable".

To change the default values, create an rlRepresentationOptions set and use dot notation to
access and change the properties of OptimizerParameters.
repOpts = rlRepresentationOptions;
repOpts.OptimizerParameters.GradientDecayFactor = 0.95;

GradientThreshold — Threshold value for gradient


Inf (default) | positive scalar

Threshold value for the representation gradient, specified as Inf or a positive scalar. If the gradient
exceeds this value, the gradient is clipped as specified by the GradientThresholdMethod option.
Clipping the gradient limits how much the network parameters change in a training iteration.
Example: 'GradientThreshold',1

GradientThresholdMethod — Gradient threshold method


"l2norm" (default) | "global-l2norm" | "absolute-value"

Gradient threshold method used to clip gradient values that exceed the gradient threshold, specified
as one of the following values.

• "l2norm" — If the L2 norm of the gradient of a learnable parameter is larger than


GradientThreshold, then scale the gradient so that the L2 norm equals GradientThreshold.
• "global-l2norm" — If the global L2 norm, L, is larger than GradientThreshold, then scale all
gradients by a factor of GradientThreshold/L. The global L2 norm considers all learnable
parameters.
• "absolute-value" — If the absolute value of an individual partial derivative in the gradient of a
learnable parameter is larger than GradientThreshold, then scale the partial derivative to have
magnitude equal to GradientThreshold and retain the sign of the partial derivative.

For more information, see “Gradient Clipping” in the Algorithms section of trainingOptions in
Deep Learning Toolbox.
Example: 'GradientThresholdMethod',"absolute-value"

L2RegularizationFactor — Factor for L2 regularization


0.0001 (default) | nonnegative scalar

Factor for L2 regularization (weight decay), specified as a nonnegative scalar. For more information,
see “L2 Regularization” in the Algorithms section of trainingOptions in Deep Learning Toolbox.

To avoid overfitting when using a representation with many parameters, consider increasing the
L2RegularizationFactor option.
Example: 'L2RegularizationFactor',0.0005

UseDevice — Computation device for training


"cpu" (default) | "gpu"

Computation device used to perform deep neural network operations such as gradient computation,
parameter update and prediction during training. It is specified as either "cpu" or "gpu".

3-295
3 Objects

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Note Training or simulating an agent on a GPU involves device-specific numerical round off errors.
These errors can produce different results compared to performing the same operations a CPU.

Note that if you want to use parallel processing to speed up training, you do not need to set
UseDevice. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: 'UseDevice',"gpu"

Object Functions
rlValueRepresentation (Not recommended) Value function critic representation for
reinforcement learning agents
rlQValueRepresentation (Not recommended) Q-Value function critic representation for
reinforcement learning agents
rlDeterministicActorRepresentation (Not recommended) Deterministic actor representation for
reinforcement learning agents
rlStochasticActorRepresentation (Not recommended) Stochastic actor representation for
reinforcement learning agents

Examples

Configure Options for Creating Representation

Create an options set for creating a critic or actor representation for a reinforcement learning agent.
Set the learning rate for the representation to 0.05, and set the gradient threshold to 1. You can set
the options using Name,Value pairs when you create the options set. Any options that you do not
explicitly set have their default values.
repOpts = rlRepresentationOptions('LearnRate',5e-2,...
'GradientThreshold',1)

repOpts =
rlRepresentationOptions with properties:

LearnRate: 0.0500
GradientThreshold: 1
GradientThresholdMethod: "l2norm"
L2RegularizationFactor: 1.0000e-04
UseDevice: "cpu"
Optimizer: "adam"
OptimizerParameters: [1x1 rl.option.OptimizerParameters]

Alternatively, create a default options set and use dot notation to change some of the values.

3-296
rlRepresentationOptions

repOpts = rlRepresentationOptions;
repOpts.LearnRate = 5e-2;
repOpts.GradientThreshold = 1

repOpts =
rlRepresentationOptions with properties:

LearnRate: 0.0500
GradientThreshold: 1
GradientThresholdMethod: "l2norm"
L2RegularizationFactor: 1.0000e-04
UseDevice: "cpu"
Optimizer: "adam"
OptimizerParameters: [1x1 rl.option.OptimizerParameters]

If you want to change the properties of the OptimizerParameters option, use dot notation to
access them.

repOpts.OptimizerParameters.Epsilon = 1e-7;
repOpts.OptimizerParameters

ans =
OptimizerParameters with properties:

Momentum: "Not applicable"


Epsilon: 1.0000e-07
GradientDecayFactor: 0.9000
SquaredGradientDecayFactor: 0.9990

Version History
Introduced in R2019a

rlRepresentationOptions is not recommended


Not recommended starting in R2022a

rlRepresentationOptions objects are no longer recommended. To specify optimization options


for actors and critics, use rlOptimizerOptions objects instead.

Specifically, you can create an agent options object and set its CriticOptimizerOptions and
ActorOptimizerOptions properties to suitable rlOptimizerOptions objects. Then you pass the
agent options object to the function that creates the agent. This workflow is shown in the following
table.

3-297
3 Objects

rlRepresentationOptions: Not rlOptimizerOptions: Recommended


Recommended
crtOpts = rlRepresentationOptions(...criticOpts = rlOptimizerOptions(...
'GradientThreshold',1); 'GradientThreshold',1);

critic = rlValueRepresentation(... agentOpts = rlACAgentOptions(...


net,obsInfo,'Observation',{'obs'},ctrOpts)
'CriticOptimizerOptions',crtOpts);

agent = rlACAgent(actor,critic,agentOpts)

Alternatively, you can create the agent and then use dot notation to access the optimization options
for the agent actor and critic, for example:
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;.

See Also
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-298
rlSACAgent

rlSACAgent
Soft actor-critic reinforcement learning agent

Description
The soft actor-critic (SAC) algorithm is a model-free, online, off-policy, actor-critic reinforcement
learning method. The SAC algorithm computes an optimal policy that maximizes both the long-term
expected reward and the entropy of the policy. The policy entropy is a measure of policy uncertainty
given the state. A higher entropy value promotes more exploration. Maximizing both the reward and
the entropy balances exploration and exploitation of the environment. The action space can only be
continuous.

For more information, see “Soft Actor-Critic Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation

Syntax
agent = rlSACAgent(observationInfo,actionInfo)
agent = rlSACAgent(observationInfo,actionInfo,initOptions)

agent = rlSACAgent(actor,critics)

agent = rlSACAgent( ___ ,agentOptions)

Description

Create Agent from Observation and Action Specifications

agent = rlSACAgent(observationInfo,actionInfo) creates a SAC agent for an environment


with the given observation and action specifications, using default initialization options. The actor
and critics in the agent use default deep neural networks built using the observation specification
observationInfo and action specification actionInfo. The ObservationInfo and ActionInfo
properties of agent are set to the observationInfo and actionInfo input arguments,
respectively.

agent = rlSACAgent(observationInfo,actionInfo,initOptions) creates a SAC agent with


deep neural networks configured using the specified initialization options (initOptions).

Create Agent from Actor and Critic

agent = rlSACAgent(actor,critics) creates a SAC agent with the specified actor and critic
networks and default agent options.

3-299
3 Objects

Specify Agent Options

agent = rlSACAgent( ___ ,agentOptions) sets the AgentOptions property for any of the
previous syntaxes.

Input Arguments

initOptions — Agent initialization options


rlAgentInitializationOptions object

Agent initialization options, specified as an rlAgentInitializationOptions object.

actor — Actor
rlContinuousGaussianActor object

Actor that implements the policy, specified as an rlContinuousGaussianActor function


approximator object. For more information on creating actor approximators, see “Create Policies and
Value Functions”.

critics — Critic
rlQValueFunction object | two-element row vector of rlQValueFunction objects

Critic, specified as one of the following:

• rlQValueFunction object — Create a SAC agent with a single Q-value function.


• Two-element row vector of rlQValueFunction objects — Create a SAC agent with two critic
value functions. The two critic must be unique rlQValueFunction objects with the same
observation and action specifications. The critics can either have different structures or the same
structure but with different initial parameters.

For a SAC agent, each critic must be a single-output rlQValueFunction object that takes both the
action and observations as inputs.

For more information on creating critics, see “Create Policies and Value Functions”.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data type, and names of the observation
signals.

If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.

You can extract observationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specification


rlNumericSpec object

3-300
rlSACAgent

Action specification for a continuous action space, specified as an rlNumericSpec object defining
properties such as dimensions, data type and name of the action signals.

If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.

You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

AgentOptions — Agent options


rlSACAgentOptions object

Agent options, specified as an rlSACAgentOptions object.

If you create a SAC agent with default actor and critic that use recurrent neural networks, the default
value of AgentOptions.SequenceLength is 32.

ExperienceBuffer — Experience buffer


rlReplayMemory object

Experience buffer, specified as an rlReplayMemory object. During training the agent stores each of
its experiences (S,A,R,S',D) in a buffer. Here:

• S is the current observation of the environment.


• A is the action taken by the agent.
• R is the reward for taking action A.
• S' is the next observation after taking action A.
• D is the is-done signal after taking action A.

UseExplorationPolicy — Option to use exploration policy


true (default) | false

Option to use exploration policy when selecting actions, specified as a one of the following logical
values.

• true — Use the base agent exploration policy when selecting actions in sim and
generatePolicyFunction. In this case, the agent selects its actions by sampling its probability
distribution, the policy is therefore stochastic and the agent explores its observation space.
• false — Use the base agent greedy policy (the action with maximum likelihood) when selecting
actions in sim and generatePolicyFunction. In this case, the simulated agent and generated
policy behave deterministically.

Note This option affects only simulation and deployment; it does not affect training.

SampleTime — Sample time of agent


positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.

3-301
3 Objects

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object

Examples

Create SAC Agent from Observation and Action Specifications

Create environment and obtain observation and action specifications. For this example, load the
environment used in the example “Train DDPG Agent to Control Double Integrator System”. The
observation from the environment is a vector containing the position and velocity of a mass. The
action is a scalar representing a force, applied to the mass, ranging continuously from -2 to 2
Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.

rng(0)

Create a SAC agent from the environment observation and action specifications.

agent = rlSACAgent(obsInfo,actInfo);

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension)})

ans = 1x1 cell array


{[0.0546]}

3-302
rlSACAgent

You can now test and train the agent within the environment. You can also use getActor and
getCritic to extract the actor and critic, respectively, and getModel to extract the approximator
model (by default a deep neural network) from the actor or critic.

Create SAC Agent Using Initialization Options

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”. The observation from the environment is a vector containing the
position and velocity of a mass. The action is a scalar representing a force, applied to the mass,
ranging continuously from -2 to 2 Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons.

initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);

The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.

rng(0)

Create a SAC agent from the environment observation and action specifications using the
initialization options.

agent = rlSACAgent(obsInfo,actInfo,initOpts);

Extract the deep neural network from the actor.

actorNet = getModel(getActor(agent));

Extract the deep neural networks from the two critics. Note that getModel(critics) only returns
the first critic network.

critics = getCritic(agent);
criticNet1 = getModel(critics(1));
criticNet2 = getModel(critics(2));

Display the layers of the first critic network, and verify that each hidden fully connected layer has 128
neurons.

criticNet1.Layers

ans =
9x1 Layer array with layers:

1 'concat' Concatenation Concatenation of 2 inputs along dimension 1


2 'relu_body' ReLU ReLU
3 'fc_body' Fully Connected 128 fully connected layer
4 'body_output' ReLU ReLU
5 'input_1' Feature Input 2 features

3-303
3 Objects

6 'fc_1' Fully Connected 128 fully connected layer


7 'input_2' Feature Input 1 features
8 'fc_2' Fully Connected 128 fully connected layer
9 'output' Fully Connected 1 fully connected layer

Plot the networks of the actor and of the second critic, and display the number of weights.

plot(layerGraph(actorNet))

summary(actorNet)

Initialized: true

Number of learnables: 17.1k

Inputs:
1 'input_1' 2 features

plot(layerGraph(criticNet2))

3-304
rlSACAgent

summary(criticNet2)

Initialized: true

Number of learnables: 33.6k

Inputs:
1 'input_1' 2 features
2 'input_2' 1 features

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension)})

ans = 1x1 cell array


{[-0.9867]}

You can now test and train the agent within the environment.

Create SAC Agent from Actor and Critics

Create an environment and obtain observation and action specifications. For this example, load the
environment used in the example “Train DDPG Agent to Control Double Integrator System”. The
observations from the environment is a vector containing the position and velocity of a mass. The

3-305
3 Objects

action is a scalar representing a force, applied to the mass, ranging continuously from -2 to 2
Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

A SAC agent uses two Q-value function critics. To approximate each Q-value function, use a neural
network. The network for a single-output Q-value function critic must have two input layers, one for
the observation and the other for the action, and return a scalar value representing the expected
cumulative long-term reward following from the given observation and action.

Define each network path as an array of layer objects, and the dimensions of the observation and
action spaces from the environment specification objects.
% Observation path
obsPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="obsPathIn")
fullyConnectedLayer(32)
reluLayer
fullyConnectedLayer(16,Name="obsPathOut")
];

% Action path
actPath = [
featureInputLayer(prod(actInfo.Dimension),Name="actPathIn")
fullyConnectedLayer(32)
reluLayer
fullyConnectedLayer(16,Name="actPathOut")
];

% Common path
commonPath = [
concatenationLayer(1,2,Name="concat")
reluLayer
fullyConnectedLayer(1)
];

% Add layers to layergraph object


criticNet = layerGraph;
criticNet = addLayers(criticNet,obsPath);
criticNet = addLayers(criticNet,actPath);
criticNet = addLayers(criticNet,commonPath);

% Connect layers
criticNet = connectLayers(criticNet,"obsPathOut","concat/in1");
criticNet = connectLayers(criticNet,"actPathOut","concat/in2");

To initialize the network weights differently for the two critics, create two different dlnetwork
objects. You must do this because if the agent constructor function does not accept two identical
critics.
criticNet1 = dlnetwork(criticNet);
criticNet2 = dlnetwork(criticNet);

Display the number of weights.


summary(criticNet1)

3-306
rlSACAgent

Initialized: true

Number of learnables: 1.2k

Inputs:
1 'obsPathIn' 2 features
2 'actPathIn' 1 features

Create the two critics using rlQValueFunction, using the two networks with different weights.
Alternatively, if you use exactly the same network with the same weights, you must explicitly initialize
the network each time (to make sure weights are initialized differently) before passing it to
rlQValueFunction. To do so, use initialize.
critic1 = rlQValueFunction(criticNet1,obsInfo,actInfo, ...
ActionInputNames="actPathIn",ObservationInputNames="obsPathIn");

critic2 = rlQValueFunction(criticNet2,obsInfo,actInfo, ...


ActionInputNames="actPathIn",ObservationInputNames="obsPathIn");

Check the critics with a random observation and action input.


getValue(critic1,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})

ans = single
-0.1330

getValue(critic2,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})

ans = single
-0.1526

To approximate the policy within the actor, use a deep neural network. Since SAC agents use a
continuous Gaussian actor, the network must take the observation signal as input and return both a
mean value and a standard deviation value for each action. Therefore it must have two output layers
(one for the mean values the other for the standard deviation values), each having as many elements
as the dimension of the action space.

Do not add a tanhLayer or scalingLayer in the mean output path. The SAC agent internally
transforms the unbounded Gaussian distribution to the bounded distribution to compute the
probability density function and entropy properly.

Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects, and specify a name for the input and output
layers, so you can later explicitly associate them with the appropriate channel.
% Define common input path
commonPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="netObsIn")
fullyConnectedLayer(400)
reluLayer(Name="CommonRelu")];

% Define path for mean value


meanPath = [
fullyConnectedLayer(300,Name="meanIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension),Name="MeanOut")
];

3-307
3 Objects

% Define path for standard deviation


stdPath = [
fullyConnectedLayer(300,Name="stdIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension))
softplusLayer(Name="StandardDeviationOut")];

% Add layers to layerGraph object


actorNet = layerGraph(commonPath);
actorNet = addLayers(actorNet,meanPath);
actorNet = addLayers(actorNet,stdPath);

% Connect layers
actorNet = connectLayers(actorNet,"CommonRelu","meanIn/in");
actorNet = connectLayers(actorNet,"CommonRelu","stdIn/in");

% Convert to dlnetwork and display the number of weights.


actorNet = dlnetwork(actorNet);
summary(actorNet)

Initialized: true

Number of learnables: 242.4k

Inputs:
1 'netObsIn' 2 features

Create the actor using actorNet, the observation and action specification objects, and the names of
the input and output layers.

actor = rlContinuousGaussianActor(actorNet, obsInfo, actInfo, ...


ActionMeanOutputNames="MeanOut",...
ActionStandardDeviationOutputNames="StandardDeviationOut",...
ObservationInputNames="netObsIn");

Check your actor with a random input observation.

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-0.2589]}

Specify training options for the critics.

criticOptions = rlOptimizerOptions( ...


Optimizer="adam", ...
LearnRate=1e-3,...
GradientThreshold=1, ...
L2RegularizationFactor=2e-4);

Specify training options for the actor.

actorOptions = rlOptimizerOptions( ...


Optimizer="adam", ...
LearnRate=1e-3,...
GradientThreshold=1, ...
L2RegularizationFactor=1e-5);

3-308
rlSACAgent

Specify agent options, including training options for actor and critics.

agentOptions = rlSACAgentOptions;
agentOptions.SampleTime = env.Ts;
agentOptions.DiscountFactor = 0.99;
agentOptions.TargetSmoothFactor = 1e-3;
agentOptions.ExperienceBufferLength = 1e6;
agentOptions.MiniBatchSize = 32;

agentOptions.CriticOptimizerOptions = criticOptions;
agentOptions.ActorOptimizerOptions = actorOptions;

Create SAC agent using actor, critics, and options.

agent = rlSACAgent(actor,[critic1 critic2],agentOptions);

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension)})

ans = 1x1 cell array


{[0.2504]}

You can now test and train the agent within the environment.

Create SAC Agent using Recurrent Neural Networks

For this example, load the environment used in the example “Train DDPG Agent to Control Double
Integrator System”. The observations from the environment is a vector containing the position and
velocity of a mass. The action is a scalar representing a force, applied to the mass, ranging
continuously from -2 to 2 Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

A SAC agent uses two Q-value function critics. To approximate each Q-value function, use a neural
network. The network for a single-output Q-value function critic must have two input layers, one for
the observation and the other for the action, and return a scalar value representing the expected
cumulative long-term reward following from the given observation and action.

Define each network path as an array of layer objects, and the dimensions of the observation and
action spaces from the environment specification objects. To create a recurrent neural network, use
sequenceInputLayer as the input layer and include an lstmLayer as one of the other network
layers.

% Define observation path


obsPath = [
sequenceInputLayer(prod(obsInfo.Dimension),Name="obsIn")
fullyConnectedLayer(400)
reluLayer
fullyConnectedLayer(300,Name = "obsOut")
];

3-309
3 Objects

% Define action path


actPath = [
sequenceInputLayer(prod(actInfo.Dimension),Name="actIn")
fullyConnectedLayer(300,Name="actOut")
];

% Define common path


commonPath = [
concatenationLayer(1,2,Name="cat")
lstmLayer(16)
reluLayer
fullyConnectedLayer(1)
];

% Add layers to layergraph object


criticNet = layerGraph(obsPath);
criticNet = addLayers(criticNet,actPath);
criticNet = addLayers(criticNet,commonPath);

% Connect paths
criticNet = connectLayers(criticNet,"obsOut","cat/in1");
criticNet = connectLayers(criticNet,"actOut","cat/in2");

To initialize the network weights differently for the two critics, create two different dlnetwork
objects. You must do this because if the agent constructor function does not accept two identical
critics.

criticNet1 = dlnetwork(criticNet);
criticNet2 = dlnetwork(criticNet);

Display the number of weights.

summary(criticNet1)

Initialized: true

Number of learnables: 161.6k

Inputs:
1 'obsIn' Sequence input with 2 dimensions
2 'actIn' Sequence input with 1 dimensions

Create the critic using rlQValueFunction. Use the same network structure for both critics. The
SAC agent initializes the two networks using different default parameters.

critic1 = rlQValueFunction(criticNet1,obsInfo,actInfo);
critic2 = rlQValueFunction(criticNet2,obsInfo,actInfo);

Check the critics with a random observation and action input.

getValue(critic1,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})

ans = single
-0.0020

getValue(critic2,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})

ans = single
0.0510

3-310
rlSACAgent

To approximate the policy within the actor, use a deep neural network. Since the critic has a
recurrent network, the actor must have a recurrent network too. The network must have two output
layers (one for the mean values the other for the standard deviation values), each having as many
elements as the dimension of the action space.

Do not add a tanhLayer or scalingLayer in the mean output path. The SAC agent internally
transforms the unbounded Gaussian distribution to the bounded distribution to compute the
probability density function and entropy properly.

Define each network path as an array of layer objects and specify a name for the input and output
layers, so you can later explicitly associate them with the appropriate channel.

% Define common path


commonPath = [
sequenceInputLayer(prod(obsInfo.Dimension),Name="obsIn")
fullyConnectedLayer(400)
lstmLayer(8)
reluLayer(Name="CommonOut")];

meanPath = [
fullyConnectedLayer(300,Name="MeanIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension),Name="Mean")
];

stdPath = [
fullyConnectedLayer(300,Name="StdIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension))
softplusLayer(Name="StandardDeviation")];

actorNet = layerGraph(commonPath);
actorNet = addLayers(actorNet,meanPath);
actorNet = addLayers(actorNet,stdPath);

actorNet = connectLayers(actorNet,"CommonOut","MeanIn/in");
actorNet = connectLayers(actorNet,"CommonOut","StdIn/in");

% Convert to dlnetwork and display the number of weights.


actorNet = dlnetwork(actorNet);
summary(actorNet)

Initialized: true

Number of learnables: 20.2k

Inputs:
1 'obsIn' Sequence input with 2 dimensions

Create the actor using actorNetwork.

actor = rlContinuousGaussianActor(actorNet, obsInfo, actInfo, ...


ActionMeanOutputNames="Mean",...
ActionStandardDeviationOutputNames="StandardDeviation",...
ObservationInputNames="obsIn");

Check your actor with a random input observation.

3-311
3 Objects

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-0.3476]}

Specify training options for the critics.

criticOptions = rlOptimizerOptions( ...


Optimizer = "adam", LearnRate = 1e-3,...
GradientThreshold = 1, L2RegularizationFactor = 2e-4);

Specify training options for the actor.

actorOptions = rlOptimizerOptions( ...


Optimizer = "adam", LearnRate = 1e-3,...
GradientThreshold = 1, L2RegularizationFactor = 1e-5);

Specify agent options. To use a recurrent neural network, you must specify a SequenceLength
greater than 1.

agentOptions = rlSACAgentOptions;
agentOptions.SampleTime = env.Ts;
agentOptions.DiscountFactor = 0.99;
agentOptions.TargetSmoothFactor = 1e-3;
agentOptions.ExperienceBufferLength = 1e6;
agentOptions.SequenceLength = 32;
agentOptions.MiniBatchSize = 32;

Create SAC agent using actor, critics, and options.

agent = rlSACAgent(actor,[critic1 critic2],agentOptions);

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[0.7990]}

You can now test and train the agent within the environment.

Version History
Introduced in R2020b

See Also
rlAgentInitializationOptions | rlSACAgentOptions | rlQValueFunction |
rlContinuousGaussianActor | initialize | Deep Network Designer

Topics
“Soft Actor-Critic Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”

3-312
rlSACAgentOptions

rlSACAgentOptions
Options for SAC agent

Description
Use an rlSACAgentOptions object to specify options for soft actor-critic (SAC) agents. To create a
SAC agent, use rlSACAgent.

For more information, see “Soft Actor-Critic Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation
Syntax
opt = rlSACAgentOptions
opt = rlSACAgentOptions(Name,Value)

Description

opt = rlSACAgentOptions creates an options object for use as an argument when creating a SAC
agent using all default options. You can modify the object properties using dot notation.

opt = rlSACAgentOptions(Name,Value) sets option properties on page 3-98 using name-value


pairs. For example, rlSACAgentOptions('DiscountFactor',0.95) creates an option set with a
discount factor of 0.95. You can specify multiple name-value pairs. Enclose each property name in
quotes.

Properties
EntropyWeightOptions — Entropy tuning options
EntropyWeightOptions object

Entropy tuning options, specified as an EntropyWeightOptions object with the following


properties.

EntropyWeight — Initial entropy component weight


1 (default) | positive scalar

Initial entropy component weight, specified as a positive scalar.

LearnRate — Optimizer learning rate


3e-4 (default) | nonnegative scalar

Optimizer learning rate, specified as a nonnegative scalar. If LearnRate is zero, the EntropyWeight
value is fixed during training and the TargetEntropy value is ignored.

3-313
3 Objects

TargetEntropy — Target entropy value


[] (default) | scalar

Target entropy value for tuning entropy weight, specified as a scalar. A higher target entropy value
encourages more exploration.

If you do not specify TargetEntropy, the agent uses –A as the target value, where A is the number
of actions.

Algorithm — Algorithm to tune entropy


"adam" (default) | "sgdm" | "rmsprop"

Algorithm to tune entropy, specified as one of the following strings.

• "adam" — Use the Adam optimizer. You can specify the decay rates of the gradient and squared
gradient moving averages using the GradientDecayFactor and
SquaredGradientDecayFactor fields of the OptimizerParameters option.
• "sgdm" — Use the stochastic gradient descent with momentum (SGDM) optimizer. You can specify
the momentum value using the Momentum field of the OptimizerParameters option.
• "rmsprop" — Use the RMSProp optimizer. You can specify the decay rate of the squared gradient
moving average using the SquaredGradientDecayFactor fields of the OptimizerParameters
option.

For more information about these optimizers, see “Stochastic Gradient Descent” in Deep Learning
Toolbox.

GradientThreshold — Threshold value for gradient


Inf (default) | positive scalar

Threshold value for the entropy gradient, specified as Inf or a positive scalar. If the gradient exceeds
this value, the gradient is clipped.

OptimizerParameters — Applicable parameters for optimizer


OptimizerParameters object

Applicable parameters for the optimizer, specified as an OptimizerParameters object with the
following parameters. The default parameter values work well for most problems.

Parameter Description Default


Momentum Contribution of previous step, 0.9
specified as a scalar from 0 to 1.
A value of 0 means no
contribution from the previous
step. A value of 1 means
maximal contribution.

This parameter applies only


when Optimizer is "sgdm".

3-314
rlSACAgentOptions

Parameter Description Default


Epsilon Denominator offset, specified as 1e-8
a positive scalar. The optimizer
adds this offset to the
denominator in the network
parameter updates to avoid
division by zero.

This parameter applies only


when Optimizer is "adam" or
"rmsprop".
GradientDecayFactor Decay rate of gradient moving 0.9
average, specified as a positive
scalar from 0 to 1.

This parameter applies only


when Optimizer is "adam".
SquaredGradientDecayFact Decay rate of squared gradient 0.999
or moving average, specified as a
positive scalar from 0 to 1.

This parameter applies only


when Optimizer is "adam" or
"rmsprop".

When a particular property of OptimizerParameters is not applicable to the optimizer type


specified in the Algorithm option, that property is set to "Not applicable".

To change the default values, access the properties of OptimizerParameters using dot notation.
opt = rlSACAgentOptions;
opt.EntropyWeightOptions.OptimizerParameters.GradientDecayFactor = 0.95;

PolicyUpdateFrequency — Number of steps between actor policy updates


1 (default) | positive integer

Number of steps between actor policy updates, specified as a positive integer. For more information,
see “Training Algorithm”.

CriticUpdateFrequency — Number of steps between critic updates


1 (default) | positive integer

Number of steps between critic updates, specified as a positive integer. For more information, see
“Training Algorithm”.

NumWarmStartSteps — Number of actions to take before updating actor and critic


positive integer

Number of actions to take before updating actor and critics, specified as a positive integer. By
default, the NumWarmStartSteps value is equal to the MiniBatchSize value.

NumGradientStepsPerUpdate — Number of gradient steps when updating actor and critics


1 (default) | positive integer

3-315
3 Objects

Number of gradient steps to take when updating actor and critics, specified as a positive integer.

ActorOptimizerOptions — Actor optimizer options


rlOptimizerOptions object

Actor optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the actor approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

CriticOptimizerOptions — Critic optimizer options


rlOptimizerOptions object

Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the critic approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

TargetSmoothFactor — Smoothing factor for target critic updates


1e-3 (default) | positive scalar less than or equal to 1

Smoothing factor for target critic updates, specified as a positive scalar less than or equal to 1. For
more information, see “Target Update Methods”.

TargetUpdateFrequency — Number of steps between target critic updates


1 (default) | positive integer

Number of steps between target critic updates, specified as a positive integer. For more information,
see “Target Update Methods”.

ResetExperienceBufferBeforeTraining — Option for clearing the experience buffer


true (default) | false

Option for clearing the experience buffer before training, specified as a logical value.

SequenceLength — Maximum batch-training trajectory length when using RNN


1 (default) | positive integer

Maximum batch-training trajectory length when using a recurrent neural network, specified as a
positive integer. This value must be greater than 1 when using a recurrent neural network and 1
otherwise.

MiniBatchSize — Size of random experience mini-batch


64 (default) | positive integer

Size of random experience mini-batch, specified as a positive integer. During each training episode,
the agent randomly samples experiences from the experience buffer when computing gradients for
updating the actor and critics. Large mini-batches reduce the variance when computing gradients but
increase the computational effort.

NumStepsToLookAhead — Number of future rewards used to estimate the value of the policy
1 (default) | positive integer

Number of future rewards used to estimate the value of the policy, specified as a positive integer. For
more information, see [1], Chapter 7.

3-316
rlSACAgentOptions

Note that if parallel training is enabled (that is if an rlTrainingOptions option object in which the
UseParallel property is set to true is passed to train) then NumStepsToLookAhead must be set
to 1, otherwise an error is generated. This guarantees that experiences are stored contiguously.

ExperienceBufferLength — Experience buffer size


10000 (default) | positive integer

Experience buffer size, specified as a positive integer. During training, the agent computes updates
using a mini-batch of experiences randomly sampled from the buffer.

SampleTime — Sample time of agent


1 (default) | positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

DiscountFactor — Discount factor


0.99 (default) | positive scalar less than or equal to 1

Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.

Object Functions
rlSACAgent Soft actor-critic reinforcement learning agent

Examples

Create SAC Agent Options Object

Create a SAC agent options object, specifying the discount factor.

opt = rlSACAgentOptions('DiscountFactor',0.95)

opt =
rlSACAgentOptions with properties:

EntropyWeightOptions: [1x1 rl.option.EntropyWeightOptions]


PolicyUpdateFrequency: 1
CriticUpdateFrequency: 1
NumWarmStartSteps: 64
NumGradientStepsPerUpdate: 1
ActorOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]

3-317
3 Objects

CriticOptimizerOptions: [1x2 rl.option.rlOptimizerOptions]


TargetSmoothFactor: 1.0000e-03
TargetUpdateFrequency: 1
ResetExperienceBufferBeforeTraining: 1
SequenceLength: 1
MiniBatchSize: 64
NumStepsToLookAhead: 1
ExperienceBufferLength: 10000
SampleTime: 1
DiscountFactor: 0.9500
InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the agent sample time to 0.5.
opt.SampleTime = 0.5;

For SAC agents, configure the entropy weight optimizer using the options in
EntropyWeightOptions. For example, set the target entropy value to –5.
opt.EntropyWeightOptions.TargetEntropy = -5;

Version History
Introduced in R2020b

Simulation and deployment: UseDeterministicExploitation will be removed


Warns starting in R2022a

The property UseDeterministicExploitation of the rlSACAgentOptions object will be


removed in a future release. Use the UseExplorationPolicy property of rlSACAgent instead.

Previously, you set UseDeterministicExploitation as follows.

• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.AgentOptions.UseDeterministicExploitation = true;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.AgentOptions.UseDeterministicExploitation = false;

Starting in R2022a, set UseExplorationPolicy as follows.

• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.UseExplorationPolicy = false;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.UseExplorationPolicy = true;

Similarly to UseDeterministicExploitation, UseExplorationPolicy affects only simulation


and deployment; it does not affect training.

3-318
rlSACAgentOptions

References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second
edition. Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press,
2018.

See Also
rlSACAgent

Topics
“Soft Actor-Critic Agents”

3-319
3 Objects

rlSARSAAgent
SARSA reinforcement learning agent

Description
The SARSA algorithm is a model-free, online, on-policy reinforcement learning method. A SARSA
agent is a value-based reinforcement learning agent which trains a critic to estimate the return or
future rewards.

For more information on SARSA agents, see “SARSA Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation

Syntax
agent = rlSARSAAgent(critic,agentOptions)

Description

agent = rlSARSAAgent(critic,agentOptions) creates a SARSA agent with the specified critic


network and sets the AgentOptions property.

Input Arguments

critic — Critic
rlQValueFunction object

Critic, specified as an rlQValueFunction object. For more information on creating critics, see
“Create Policies and Value Functions”.

Properties
AgentOptions — Agent options
rlSARSAAgentOptions object

Agent options, specified as an rlSARSAAgentOptions object.

UseExplorationPolicy — Option to use exploration policy


false (default) | true

Option to use exploration policy when selecting actions, specified as a one of the following logical
values.

• false — Use the agent greedy policy when selecting actions.

3-320
rlSARSAAgent

• true — Use the agent exploration policy when selecting actions.

ObservationInfo — Observation specifications


specification object

This property is read-only.

Observation specifications, specified as a reinforcement learning specification object defining


properties such as dimensions, data type, and name of the observation signal.

The value of ObservationInfo matches the corresponding value specified in critic.

ActionInfo — Action specification


rlFiniteSetSpec object

This property is read-only.

Action specification, specified as an rlFiniteSetSpec object.

The value of ActionInfo matches the corresponding value specified in critic.

SampleTime — Sample time of agent


positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The initial value of SampleTime matches the value specified in
AgentOptions.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object

Examples

3-321
3 Objects

Create a SARSA Agent

Create or load an environment interface. For this example load the Basic Grid World environment
interface also used in the example “Train Reinforcement Learning Agent in Basic Grid World”.

env = rlPredefinedEnv("BasicGridWorld");

Get observation and action specifications.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a table approximation model derived from the environment observation and action
specifications.

qTable = rlTable(obsInfo,actInfo);

Create the critic using qTable. SARSA agents use an rlValueFunction object to implement the
critic.

critic = rlQValueFunction(qTable,obsInfo,actInfo);

Create a SARSA agent using the specified critic and an epsilon value of 0.05.

opt = rlSARSAAgentOptions;
opt.EpsilonGreedyExploration.Epsilon = 0.05;

agent = rlSARSAAgent(critic,opt)

agent =
rlSARSAAgent with properties:

AgentOptions: [1x1 rl.option.rlSARSAAgentOptions]


UseExplorationPolicy: 0
ObservationInfo: [1x1 rl.util.rlFiniteSetSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1

To check your agent, use getAction to return the action from a random observation.

act = getAction(agent,{randi(numel(obsInfo.Elements))});
act{1}

ans = 1

You can now test and train the agent against the environment.

Version History
Introduced in R2019a

See Also
rlSARSAAgentOptions

3-322
rlSARSAAgent

Topics
“SARSA Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”

3-323
3 Objects

rlSARSAAgentOptions
Options for SARSA agent

Description
Use an rlSARSAAgentOptions object to specify options for creating SARSA agents. To create a
SARSA agent, use rlSARSAAgent

For more information on SARSA agents, see “SARSA Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation

Syntax
opt = rlSARSAAgentOptions
opt = rlSARSAAgentOptions(Name,Value)

Description

opt = rlSARSAAgentOptions creates an rlSARSAAgentOptions object for use as an argument


when creating a SARSA agent using all default settings. You can modify the object properties using
dot notation.

opt = rlSARSAAgentOptions(Name,Value) sets option properties on page 3-324 using name-


value pairs. For example, rlSARSAAgentOptions('DiscountFactor',0.95) creates an option
set with a discount factor of 0.95. You can specify multiple name-value pairs. Enclose each property
name in quotes.

Properties
EpsilonGreedyExploration — Options for epsilon-greedy exploration
EpsilonGreedyExploration object

Options for epsilon-greedy exploration, specified as an EpsilonGreedyExploration object with the


following properties.

3-324
rlSARSAAgentOptions

Property Description Default Value


Epsilon Probability threshold to either 1
randomly select an action or
select the action that maximizes
the state-action value function.
A larger value of Epsilon
means that the agent randomly
explores the action space at a
higher rate.
EpsilonMin Minimum value of Epsilon 0.01
EpsilonDecay Decay rate 0.0050

At the end of each training time step, if Epsilon is greater than EpsilonMin, then it is updated
using the following formula.

Epsilon = Epsilon*(1-EpsilonDecay)

If your agent converges on local optima too quickly, you can promote agent exploration by increasing
Epsilon.

To specify exploration options, use dot notation after creating the rlSARSAAgentOptions object
opt. For example, set the epsilon value to 0.9.
opt.EpsilonGreedyExploration.Epsilon = 0.9;

CriticOptimizerOptions — Critic optimizer options


rlOptimizerOptions object

Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the critic approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

SampleTime — Sample time of agent


1 (default) | positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

DiscountFactor — Discount factor


0.99 (default) | positive scalar less than or equal to 1

Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.

3-325
3 Objects

Object Functions
rlSARSAAgent SARSA reinforcement learning agent

Examples

Create a SARSA Agent Options Object

This example shows how to create a SARSA agent option object.

Create an rlSARSAAgentOptions object that specifies the agent sample time.

opt = rlSARSAAgentOptions('SampleTime',0.5)

opt =
rlSARSAAgentOptions with properties:

EpsilonGreedyExploration: [1x1 rl.option.EpsilonGreedyExploration]


CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
SampleTime: 0.5000
DiscountFactor: 0.9900
InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the agent discount factor to 0.95.

opt.DiscountFactor = 0.95;

Version History
Introduced in R2019a

See Also
Topics
“SARSA Agents”

3-326
rlSimulationOptions

rlSimulationOptions
Options for simulating a reinforcement learning agent within an environment

Description
Use an rlSimulationOptions object to specify simulation options for simulating a reinforcement
learning agent within an environment. To perform the simulation, use sim.

For more information on agents training and simulation, see “Train Reinforcement Learning Agents”.

Creation
Syntax
simOpts = rlSimulationOptions
opt = rlSimulationOptions(Name,Value)

Description

simOpts = rlSimulationOptions returns the default options for simulating a reinforcement


learning environment against an agent. Use simulation options to specify parameters about the
simulation such as the maximum number of steps to run per simulation and the number of
simulations to run. After configuring the options, use simOpts as an input argument for sim.

opt = rlSimulationOptions(Name,Value) creates a simulation options set with the specified


“Properties” on page 3-327 using one or more name-value pair arguments.

Properties
MaxSteps — Number of steps to run the simulation
500 (default) | positive integer

Number of steps to run the simulation, specified as the comma-separated pair consisting of
'MaxSteps' and a positive integer. In general, you define episode termination conditions in the
environment. This value is the maximum number of steps to run in the simulation if those termination
conditions are not met.
Example: 'MaxSteps',1000

NumSimulations — Number of simulations


1 (default) | positive integer

Number of simulations to run, specified as the comma-separated pair consisting of


'NumSimulations' and a positive integer. At the start of each simulation, sim resets the
environment. You specify what happens on environment reset when you create the environment. For
instance, resetting the environment at the start of each episode can include randomizing initial state
values, if you configure your environment to do so. In that case, running multiple simulations allows
you to validate performance of a trained agent over a range of initial conditions.

3-327
3 Objects

Example: 'NumSimulations',10

StopOnError — Stop simulation when error occurs


"on" (default) | "off"

Stop simulation when an error occurs, specified as "off" or "on". When this option is "off", errors
are captured and returned in the SimulationInfo output of sim, and simulation continues.

UseParallel — Flag for using parallel simulation


false (default) | true

Flag for using parallel simulation, specified as a logical. Setting this option to true configures the
simulation to use parallel processing to simulate the environment, thereby enabling usage of multiple
cores, processors, computer clusters or cloud resources to speed up simulation. To specify options for
parallel simulation, use the ParallelizationOptions property.

Note that if you want to speed up deep neural network calculations (such as gradient computation,
parameter update and prediction) using a local GPU you do not need to set UseParallel to true.
Instead, when creating your actor or critic representation, use an rlRepresentationOptions
object in which the UseDevice option is set to "gpu".

Using parallel computing or the GPU requires Parallel Computing Toolbox software. Using computer
clusters or cloud resources additionally requires MATLAB Parallel Server™.

For more information about training using multicore processors and GPUs, see “Train Agents Using
Parallel Computing and GPUs”.
Example: 'UseParallel',true

ParallelizationOptions — Options to control parallel simulation


ParallelTraining object

Parallelization options to control parallel simulation, specified as a ParallelTraining object. For


more information about training using parallel computing, see “Train Reinforcement Learning
Agents”.

The ParallelTraining object has the following properties, which you can modify using dot
notation after creating the rlTrainingOptions object.

WorkerRandomSeeds — Randomizer initialization for workers


–1 (default) | –2 | vector

Randomizer initialization for workers, specified as one the following:

• –1 — Assign a unique random seed to each worker. The value of the seed is the worker ID.
• –2 — Do not assign a random seed to the workers.
• Vector — Manually specify the random seed for each work. The number of elements in the vector
must match the number of workers.

TransferBaseWorkspaceVariables — Send model and workspace variables to parallel


workers
"on" (default) | "off"

3-328
rlSimulationOptions

Send model and workspace variables to parallel workers, specified as "on" or "off". When the
option is "on", the host sends variables used in models and defined in the base MATLAB workspace
to the workers.

AttachedFiles — Additional files to attach to the parallel pool


[] (default) | string | string array

Additional files to attach to the parallel pool, specified as a string or string array.

SetupFcn — Function to run before simulation starts


[] (default) | function handle

Function to run before simulation starts, specified as a handle to a function having no input
arguments. This function is run once per worker before simulation begins. Write this function to
perform any processing that you need prior to simulation.

CleanupFcn — Function to run after simulation ends


[] (default) | function handle

Function to run after simulation ends, specified as a handle to a function having no input arguments.
You can write this function to clean up the workspace or perform other processing after simulation
terminates.

Object Functions
sim Simulate trained reinforcement learning agents within specified environment

Examples

Configure Options for Simulation

Create an options set for simulating a reinforcement learning environment. Set the number of steps
to simulate to 1000, and configure the options to run three simulations.

You can set the options using Name,Value pairs when you create the options set. Any options that you
do not explicitly set have their default values.

simOpts = rlSimulationOptions(...
'MaxSteps',1000,...
'NumSimulations',3)

simOpts =
rlSimulationOptions with properties:

MaxSteps: 1000
NumSimulations: 3
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelSimulation]

Alternatively, create a default options set and use dot notation to change some of the values.

simOpts = rlSimulationOptions;
simOpts.MaxSteps = 1000;

3-329
3 Objects

simOpts.NumSimulations = 3;

simOpts

simOpts =
rlSimulationOptions with properties:

MaxSteps: 1000
NumSimulations: 3
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelSimulation]

Version History
Introduced in R2019a

See Also
Topics
“Reinforcement Learning Agents”

3-330
rlStochasticActorPolicy

rlStochasticActorPolicy
Policy object to generate stochastic actions for custom training loops and application deployment

Description
This object implements a stochastic policy, which returns stochastic actions given an input
observation, according to a probability distribution. You can create an rlStochasticActorPolicy
object from an rlDiscreteCategoricalActor or rlContinuousGaussianActor, or extract it
from an rlPGAgent, rlACAgent, rlPPOAgent, rlTRPOAgent, or rlSACAgent. You can then train
the policy object using a custom training loop or deploy it for your application using
generatePolicyBlock or generatePolicyFunction. If UseMaxLikelihoodAction is set to 1
the policy is deterministic, therefore in this case it does not explore. For more information on policies
and value functions, see “Create Policies and Value Functions”.

Creation

Syntax
policy = rlStochasticActorPolicy(actor)

Description

policy = rlStochasticActorPolicy(actor) creates the stochastic policy object policy from


the continuous Gaussian or discrete categorical actor actor. It also sets the Actor property of
policy to the input argument actor.

Properties
Actor — Actor
rlDiscreteCategoricalActor object | rlContinuousGaussianActor object

Actor, specified as an rlContinuousGaussianActor or rlDiscreteCategoricalActor object.

UseMaxLikelihoodAction — Option to enable maximum likelihood action


false (default) | true

Option to enable maximum likelihood action, specified as a logical value: either false (default, the
action is sampled from the probability distribution, this helps exploration) or true (always using
maximum likelihood action). When the option to always use the maximum likelihood action enabled
the policy is deterministic and therefore it does not explore.
Example: false

ObservationInfo — Observation specifications


rlFiniteSetSpec object | rlNumericSpec object | array

3-331
3 Objects

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


of such objects. These objects define properties such as the dimensions, data types, and names of the
observation signals.

ActionInfo — Action specifications


rlFiniteSetSpec object | rlNumericSpec object

Action specifications, specified as an rlFiniteSetSpec or rlNumericSpec object. This object


defines the properties of the environment action channel, such as its dimensions, data type, and
name. Note that the name of the action channel specified in actionInfo (if any) is not used.

Note Only one action channel is allowed.

SampleTime — Sample time of policy


positive scalar | -1 (default)

Sample time of the policy, specified as a positive scalar or as -1 (default). Setting this parameter to
-1 allows for event-based simulations.

Within a Simulink environment, the RL Agent block in which the policy is specified executes every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the policy is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience. If
SampleTime is -1, the sample time is treated as being equal to 1.
Example: 0.2

Object Functions
generatePolicyBlock Generate Simulink block that evaluates policy of an agent or policy object
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
getAction Obtain action from agent, actor, or policy object given environment
observations
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
reset Reset environment, agent, experience buffer, or policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object

Examples

Create Stochastic Actor Policy from Discrete Categorical Actor

Create observation and action specification objects. For this example, define a continuous four-
dimensional observation space and a discrete action space having two possible actions.

obsInfo = rlNumericSpec([4 1]);


actInfo = rlFiniteSetSpec([-1 1]);

3-332
rlStochasticActorPolicy

Alternatively, use getObservationInfo and getActionInfo to extract the specification objects


from an environment

Create a discrete categorical actor. This actor must accept an observation as input and return an
output vector in which each element represents the probability of taking the corresponding action.

To approximate the policy function within the actor, use a deep neural network model. Define the
network as an array of layer objects, and get the dimension of the observation space and the number
of possible actions from the environment specification objects.
layers = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(16)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))
];

Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)

Initialized: true

Number of learnables: 114

Inputs:
1 'input' 4 features

Create the actor using model, and the observation and action specifications.
actor = rlDiscreteCategoricalActor(model,obsInfo,actInfo)

actor =
rlDiscreteCategoricalActor with properties:

Distribution: [1x1 rl.distribution.rlDiscreteGaussianDistribution]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
UseDevice: "cpu"

To return the probability distribution of the possible actions as a function of a random observation,
and given the current network weights, use evaluate.
prb = evaluate(actor,{rand(obsInfo.Dimension)});
prb{1}

ans = 2x1 single column vector

0.5850
0.4150

Create a policy object from actor.


policy = rlStochasticActorPolicy(actor)

policy =
rlStochasticActorPolicy with properties:

3-333
3 Objects

Actor: [1x1 rl.function.rlDiscreteCategoricalActor]


UseMaxLikelihoodAction: 0
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: -1

You can access the policy options using dot notation. For example, set the option to always use the
maximum likelihood action, thereby making the policy deterministic.

policy.UseMaxLikelihoodAction = true

policy =
rlStochasticActorPolicy with properties:

Actor: [1x1 rl.function.rlDiscreteCategoricalActor]


UseMaxLikelihoodAction: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: -1

Check the policy with a random observation input.

act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}

ans = -1

You can now train the policy with a custom training loop and then deploy it to your application.

Version History
Introduced in R2022a

See Also
Functions
rlMaxQPolicy | rlEpsilonGreedyPolicy | rlDeterministicActorPolicy |
rlAdditiveNoisePolicy | rlStochasticActorPolicy | rlDiscreteCategoricalActor |
rlContinuousGaussianActor | rlPGAgent | rlACAgent | rlSACAgent | rlPPOAgent |
rlTRPOAgent | generatePolicyBlock | generatePolicyFunction

Blocks
RL Agent

Topics
“Create Policies and Value Functions”
“Model-Based Reinforcement Learning Using Custom Training Loop”
“Train Reinforcement Learning Policy Using Custom Training Loop”

3-334
rlStochasticActorRepresentation

rlStochasticActorRepresentation
(Not recommended) Stochastic actor representation for reinforcement learning agents

Note rlStochasticActorRepresentation is not recommended. Use either


rlDiscreteCategoricalActor or rlContinuousGaussianActor instead. For more information,
see “rlStochasticActorRepresentation is not recommended”.

Description
This object implements a function approximator to be used as a stochastic actor within a
reinforcement learning agent. A stochastic actor takes the observations as inputs and returns a
random action, thereby implementing a stochastic policy with a specific probability distribution. After
you create an rlStochasticActorRepresentation object, use it to create a suitable agent, such
as an rlACAgent or rlPGAgent agent. For more information on creating representations, see
“Create Policies and Value Functions”.

Creation
Syntax
discActor = rlStochasticActorRepresentation(net,observationInfo,
discActionInfo,'Observation',obsName)
discActor = rlStochasticActorRepresentation({basisFcn,W0},observationInfo,
actionInfo)
discActor = rlStochasticActorRepresentation( ___ ,options)

contActor = rlStochasticActorRepresentation(net,observationInfo,
contActionInfo,'Observation',obsName)
contActor = rlStochasticActorRepresentation( ___ ,options)

Description
Discrete Action Space Stochastic Actor

discActor = rlStochasticActorRepresentation(net,observationInfo,
discActionInfo,'Observation',obsName) creates a stochastic actor with a discrete action
space, using the deep neural network net as function approximator. Here, the output layer of net
must have as many elements as the number of possible discrete actions. This syntax sets the
ObservationInfo and ActionInfo properties of discActor to the inputs observationInfo and
discActionInfo, respectively. obsName must contain the names of the input layers of net.

discActor = rlStochasticActorRepresentation({basisFcn,W0},observationInfo,
actionInfo) creates a discrete space stochastic actor using a custom basis function as underlying
approximator. The first input argument is a two-elements cell in which the first element contains the
handle basisFcn to a custom basis function, and the second element contains the initial weight
matrix W0. This syntax sets the ObservationInfo and ActionInfo properties of discActor to the inputs
observationInfo and actionInfo, respectively.

3-335
3 Objects

discActor = rlStochasticActorRepresentation( ___ ,options) creates the discrete action


space, stochastic actor discActor using the additional options set options, which is an
rlRepresentationOptions object. This syntax sets the Options property of discActor to the
options input argument. You can use this syntax with any of the previous input-argument
combinations.
Continuous Action Space Gaussian Actor

contActor = rlStochasticActorRepresentation(net,observationInfo,
contActionInfo,'Observation',obsName) creates a Gaussian stochastic actor with a
continuous action space using the deep neural network net as function approximator. Here, the
output layer of net must have twice as many elements as the number of dimensions of the continuous
action space. This syntax sets the ObservationInfo and ActionInfo properties of contActor to the
inputs observationInfo and contActionInfo respectively. obsName must contain the names of
the input layers of net.

Note contActor does not enforce constraints set by the action specification, therefore, when using
this actor, you must enforce action space constraints within the environment.

contActor = rlStochasticActorRepresentation( ___ ,options) creates the continuous


action space, Gaussian actor contActor using the additional options option set, which is an
rlRepresentationOptions object. This syntax sets the Options property of contActor to the
options input argument. You can use this syntax with any of the previous input-argument
combinations.

Input Arguments

net — Deep neural network


array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object |
dlNetwork object

Deep neural network used as the underlying approximator within the actor, specified as one of the
following:

• Array of Layer objects


• layerGraph object
• DAGNetwork object
• SeriesNetwork object
• dlnetwork object

For a discrete action space stochastic actor, net must have the observations as input and a single
output layer having as many elements as the number of possible discrete actions. Each element
represents the probability (which must be nonnegative) of executing the corresponding action.

For a continuous action space stochastic actor, net must have the observations as input and a single
output layer having twice as many elements as the number of dimensions of the continuous action
space. The elements of the output vector represent all the mean values followed by all the standard
deviations (which must be nonnegative) of the Gaussian distributions for the dimensions of the action
space.

Note The fact that standard deviations must be nonnegative while mean values must fall within the
output range means that the network must have two separate paths. The first path must produce an

3-336
rlStochasticActorRepresentation

estimation for the mean values, so any output nonlinearity must be scaled so that its output falls in
the desired range. The second path must produce an estimation for the standard deviations, so you
must use a softplus or ReLU layer to enforce nonnegativity.

The network input layers must be in the same order and with the same data type and dimensions as
the signals defined in ObservationInfo. Also, the names of these input layers must match the
observation names specified in obsName. The network output layer must have the same data type and
dimension as the signal defined in ActionInfo.

rlStochasticActorRepresentation objects support recurrent deep neural networks.

For a list of deep neural network layers, see “List of Deep Learning Layers”. For more information on
creating deep neural networks for reinforcement learning, see “Create Policies and Value Functions”.

obsName — Observation names


string | character vector | cell array of character vectors

Observation names, specified as a cell array of strings or character vectors. The observation names
must be the names of the input layers in net.
Example: {'my_obs'}

basisFcn — Custom basis function


function handle

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The output
of the actor is the vector a = softmax(W'*B), where W is a weight matrix and B is the column
vector returned by the custom basis function. Each element of a represents the probability of taking
the corresponding action. The learnable parameters of the actor are the elements of W.

When creating a stochastic actor representation, your basis function must have the following
signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here obs1 to obsN are observations in the same order and with the same data type and dimensions
as the signals defined in observationInfo
Example: @(obs1,obs2,obs3) [obs3(2)*obs1(1)^2; abs(obs2(5)+obs3(1))]

W0 — Initial value of the basis function weights


column vector

Initial value of the basis function weights, W, specified as a matrix. It must have as many rows as the
length of the basis function output, and as many columns as the number of possible actions.

Properties
Options — Representation options
rlRepresentationOptions object

Representation options, specified as an rlRepresentationOptions object. Available options


include the optimizer used for training and the learning rate.

3-337
3 Objects

Note TRPO agents use only the Options.UseDevice representation options and ignore the other
training and learning rate options.

ObservationInfo — Observation specifications


rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as specified as an rlFiniteSetSpec or rlNumericSpec object


or an array containing a mix of such objects. These objects define properties such as dimensions, data
type, and names of the observation signals.

rlStochasticActorRepresentation sets the ObservationInfo property of contActor or


discActor to the input observationInfo.

You can extract observationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specifications


rlFiniteSetSpec object | rlNumericSpec object

Action specifications, specified as an rlFiniteSetSpec or rlNumericSpec object. These objects


define properties such as the dimensions, data type and name of the action signals.

For a discrete action space actor, rlStochasticActorRepresentation sets ActionInfo to the


input discActionInfo, which must be an rlFiniteSetSpec object.

For a continuous action space actor, rlStochasticActorRepresentation sets ActionInfo to the


input contActionInfo, which must be an rlNumericSpec object.

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually.

For custom basis function representations, the action signal must be a scalar, a column vector, or a
discrete action.

Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
rlSACAgent Soft actor-critic reinforcement learning agent
getAction Obtain action from agent, actor, or policy object given environment observations

Examples

Create Discrete Stochastic Actor from Deep Neural Network

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
obsInfo = rlNumericSpec([4 1]);

3-338
rlStochasticActorRepresentation

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as consisting of
three values, -10, 0, and 10.

actInfo = rlFiniteSetSpec([-10 0 10]);

Create a deep neural network approximator for the actor. The input of the network (here called
state) must accept a four-element vector (the observation vector just defined by obsInfo), and its
output (here called actionProb) must be a three-element vector. Each element of the output vector
must be between 0 and 1 since it represents the probability of executing each of the three possible
actions (as defined by actInfo). Using softmax as the output layer enforces this requirement.

net = [ featureInputLayer(4,'Normalization','none',...
'Name','state')
fullyConnectedLayer(3,'Name','fc')
softmaxLayer('Name','actionProb') ];

Create the actor with rlStochasticActorRepresentation, using the network, the observations
and action specification objects, as well as the names of the network input layer.

actor = rlStochasticActorRepresentation(net,obsInfo,actInfo,...
'Observation','state')

actor =
rlStochasticActorRepresentation with properties:

ActionInfo: [1x1 rl.util.rlFiniteSetSpec]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
Options: [1x1 rl.option.rlRepresentationOptions]

To validate your actor, use getAction to return a random action from the observation vector [1 1 1
1], using the current network weights.

act = getAction(actor,{[1 1 1 1]});


act{1}

ans = 10

You can now use the actor to create a suitable agent, such as an rlACAgent, or rlPGAgent agent.

Create Continuous Stochastic Actor from Deep Neural Network

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous six-dimensional space, so that a single observation is a column vector containing 6
doubles.

obsInfo = rlNumericSpec([6 1]);

Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing 2 doubles both between
-10 and 10.

3-339
3 Objects

actInfo = rlNumericSpec([2 1],'LowerLimit',-10,'UpperLimit',10);

Create a deep neural network approximator for the actor. The observation input (here called myobs)
must accept a six-dimensional vector (the observation vector just defined by obsInfo). The output
(here called myact) must be a four-dimensional vector (twice the number of dimensions defined by
actInfo). The elements of the output vector represent, in sequence, all the means and all the
standard deviations of every action.

The fact that standard deviations must be non-negative while mean values must fall within the output
range means that the network must have two separate paths. The first path is for the mean values,
and any output nonlinearity must be scaled so that it can produce values in the output range. The
second path is for the standard deviations, and you can use a softplus or relu layer to enforce non-
negativity.

% input path layers (6 by 1 input and a 2 by 1 output)


inPath = [ imageInputLayer([6 1 1], ...
'Normalization','none',...
'Name','myobs')
fullyConnectedLayer(2,'Name','infc') ];

% path layers for mean value


% (2 by 1 input and 2 by 1 output)
% using scalingLayer to scale the range
meanPath = [ tanhLayer('Name','tanh'); % range: (-1,1)
scalingLayer('Name','scale',...
'Scale',actInfo.UpperLimit) ]; % range: (-10,10)

% path layers for standard deviations


% (2 by 1 input and output)
% using softplus layer to make it non negative
sdevPath = softplusLayer('Name', 'splus');

% conctatenate two inputs (along dimension #3)


% to form a single (4 by 1) output layer
outLayer = concatenationLayer(3,2,'Name','mean&sdev');

% add layers to network object


net = layerGraph(inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = addLayers(net,outLayer);

% connect layers: the mean value path output MUST


% be connected to the FIRST input of the concatenationLayer

% connect output of inPath to meanPath input


net = connectLayers(net,'infc','tanh/in');
% connect output of inPath to sdevPath input
net = connectLayers(net,'infc','splus/in');
% connect output of meanPath to gaussPars input #1
net = connectLayers(net,'scale','mean&sdev/in1');
% connect output of sdevPath to gaussPars input #2
net = connectLayers(net,'splus','mean&sdev/in2');

Set some training options for the actor.

actorOpts = rlRepresentationOptions('LearnRate',8e-3,'GradientThreshold',1);

3-340
rlStochasticActorRepresentation

Create the actor with rlStochasticActorRepresentation, using the network, the observations
and action specification objects, the names of the network input layer and the options object.

actor = rlStochasticActorRepresentation(net, obsInfo, actInfo, 'Observation','myobs',actorOpts)

actor =
rlStochasticActorRepresentation with properties:

ActionInfo: [1x1 rl.util.rlNumericSpec]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
Options: [1x1 rl.option.rlRepresentationOptions]

To check your actor, use getAction to return a random action from the observation vector
ones(6,1), using the current network weights.

act = getAction(actor,{ones(6,1)});
act{1}

ans = 2x1 single column vector

-0.0763
9.6860

You can now use the actor to create a suitable agent (such as an rlACAgent, rlPGAgent, or
rlPPOAgent agent).

Create Stochastic Actor from Custom Basis Function

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 2
doubles.

obsInfo = rlNumericSpec([2 1]);

The stochastic actor based on a custom basis function does not support continuous action spaces.
Therefore, create a discrete action space specification object (or alternatively use getActionInfo to
extract the specification object from an environment with a discrete action space). For this example,
define the action space as a finite set consisting of 3 possible values (named 7, 5, and 3 in this case).

actInfo = rlFiniteSetSpec([7 5 3]);

Create a custom basis function. Each element is a function of the observations defined by obsInfo.

myBasisFcn = @(myobs) [myobs(2)^2;


myobs(1);
exp(myobs(2));
abs(myobs(1))]

myBasisFcn = function_handle with value:


@(myobs)[myobs(2)^2;myobs(1);exp(myobs(2));abs(myobs(1))]

3-341
3 Objects

The output of the actor is the action, among the ones defined in actInfo, corresponding to the
element of softmax(W'*myBasisFcn(myobs)) which has the highest value. W is a weight matrix,
containing the learnable parameters, which must have as many rows as the length of the basis
function output, and as many columns as the number of possible actions.

Define an initial parameter matrix.

W0 = rand(4,3);

Create the actor. The first argument is a two-element cell containing both the handle to the custom
function and the initial parameter matrix. The second and third arguments are, respectively, the
observation and action specification objects.

actor = rlStochasticActorRepresentation({myBasisFcn,W0},obsInfo,actInfo)

actor =
rlStochasticActorRepresentation with properties:

ActionInfo: [1x1 rl.util.rlFiniteSetSpec]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
Options: [1x1 rl.option.rlRepresentationOptions]

To check your actor use the getAction function to return one of the three possible actions,
depending on a given random observation and on the current parameter matrix.

v = getAction(actor,{rand(2,1)})

v = 1x1 cell array


{[3]}

You can now use the actor (along with an critic) to create a suitable discrete action space agent.

Create Stochastic Actor with Recurrent Neural Network

For this example, you create a stochastic actor with a discrete action space using a recurrent neural
network. You can also use a recurrent neural network for a continuous stochastic actor using the
same method.

Create an environment and obtain observation and action information.

env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
numObs = obsInfo.Dimension(1);
numDiscreteAct = numel(actInfo.Elements);

Create a recurrent deep neural network for the actor. To create a recurrent neural network, use a
sequenceInputLayer as the input layer and include at least one lstmLayer.

actorNetwork = [
sequenceInputLayer(numObs,'Normalization','none','Name','state')
fullyConnectedLayer(8,'Name','fc')
reluLayer('Name','relu')

3-342
rlStochasticActorRepresentation

lstmLayer(8,'OutputMode','sequence','Name','lstm')
fullyConnectedLayer(numDiscreteAct,'Name','output')
softmaxLayer('Name','actionProb')];

Create a stochastic actor representation for the network.


actorOptions = rlRepresentationOptions('LearnRate',1e-3,...
'GradientThreshold',1);
actor = rlStochasticActorRepresentation(actorNetwork,obsInfo,actInfo,...
'Observation','state', actorOptions);

Version History
Introduced in R2020a

rlStochasticActorRepresentation is not recommended


Not recommended starting in R2022a

rlStochasticActorRepresentation is not recommended. Use either


rlDiscreteCategoricalActor or rlContinuousGaussianActor instead.

The following table shows some typical uses of rlStochasticActorRepresentation to create


neural network-based actors, and how to update your code with one of the new stochastic actor
approximator objects instead. The first table entry builds an actor with a discrete action space, the
second one builds an actor with a continuous action space.

Network-Based Stochastic Actor Network-Based Stochastic Actor


Representation: Not Recommended Approximator: Recommended
myActor = myActor =
rlStochasticActorRepresentation(net,ob rlDiscreteCategoricalActor(net,obsInfo
sInfo,actInfo,'Observation',obsNames), ,actInfo,'ObservationInputNames',obsNa
with actInfo defining a discrete action space mes). Use this syntax to create a stochastic actor
and net having observations as inputs and a object with a discrete action space. This actor
single output layer with as many elements as the samples its action from a categorical (also known
number of possible discrete actions. as Multinoulli) distribution.
myActor = myActor =
rlStochasticActorRepresentation(net,ob rlContinuousGaussianActor(net,obsInfo,
sInfo,actInfo,'Observation',obsNames), actInfo,'ObservationInputNames',obsNam
with actInfo defining a continuous action space es,'ActionMeanOutputNames',actMeanName
and net having observations as inputs and a s,'ActionStandardDeviationOutputNames'
single output layer with twice as many elements ,actStdNames). Use this syntax to create a
as the number of dimensions of the continuous stochastic actor object with a continuous action
action space (representing, in sequence, all the space. This actor samples its action from a
means and all the standard deviations of every Gaussian distribution, and you must provide the
action dimension). names of the network outputs representing the
mean and standard deviations for the action.

The following table shows a typical use of rlStochasticActorRepresentation to create a


(discrete action space) actor which use a (linear in the learnable parameters) custom basis function,
and how to update your code with rlDiscreteCategoricalActor instead. In these function calls,
the first input argument is a two-element cell array containing both the handle to the custom basis
function and the initial weight vector or matrix.

3-343
3 Objects

Custom Basis Stochastic Actor Custom Basis Function-Based Stochastic


Representation: Not Recommended Actor Approximator: Recommended
rep = rep =
rlStochasticActorRepresentation({basis rlDiscreteCategoricalActor({basisFcn,W
Fcn,W0},obsInfo,actInfo), where the basis 0},obsInfo,actInfo). Use this syntax to
function has observations as inputs and actions create a stochastic actor object with a discrete
as outputs, W0 is a matrix with as many columns action space which returns an action sampled
as the number of possible actions, and actInfo from a categorical (also known as Multinoulli)
defines a discrete action space. distribution.

See Also
Functions
rlDiscreteCategoricalActor | rlContinuousGaussianActor | rlRepresentationOptions
| getActionInfo | getObservationInfo

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-344
rlTable

rlTable
Value table or Q table

Description
Value tables and Q tables are one way to represent critic networks for reinforcement learning. Value
tables store rewards for a finite set of observations. Q tables store rewards for corresponding finite
observation-action pairs.

To create a value function approximator using an rlTable object, use an rlValueFunction,


rlQValueFunction, or rlVectorQValueFunction object.

Creation

Syntax
T = rlTable(obsinfo)
T = rlTable(obsinfo,actinfo)

Description

T = rlTable(obsinfo) creates a value table for the given discrete observations.

T = rlTable(obsinfo,actinfo) creates a Q table for the given discrete observations and


actions.

Input Arguments

obsinfo — Observation specification


rlFiniteSetSpec object

Observation specification, specified as an rlFiniteSetSpec object.

actinfo — Action specification


rlFiniteSetSpec object

Action specification, specified as an rlFiniteSetSpec object.

Properties
Table — Reward table
array

Reward table, returned as an array. When Table is a:

• Value table, it contains NO rows, where NO is the number of finite observation values.
• Q table, it contains NO rows and NA columns, where NA is the number of possible finite actions.

3-345
3 Objects

Object Functions
rlValueFunction Value function approximator object for reinforcement learning agents
rlQValueFunction Q-Value function approximator object for reinforcement learning agents
rlVectorQValueFunction Vector Q-value function approximator for reinforcement learning agents

Examples

Create a Value Table

This example shows how to use rlTable to create a value table. You can use such a table to
represent the critic of an actor-critic agent with a finite observation space.

Create an environment interface, and obtain its observation specifications.

env = rlPredefinedEnv("BasicGridWorld");
obsInfo = getObservationInfo(env)

obsInfo =
rlFiniteSetSpec with properties:

Elements: [25x1 double]


Name: "MDP Observations"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"

Create the value table using the observation specification.

vTable = rlTable(obsInfo)

vTable =
rlTable with properties:

Table: [25x1 double]

Create a Q Table

This example shows how to use rlTable to create a Q table. Such a table could be used to represent
the actor or critic of an agent with finite observation and action spaces.

Create an environment interface, and obtain its observation and action specifications.

env=rlMDPEnv(createMDP(8,["up";"down"]));
obsInfo = getObservationInfo(env)

obsInfo =
rlFiniteSetSpec with properties:

Elements: [8x1 double]


Name: "MDP Observations"
Description: [0x0 string]

3-346
rlTable

Dimension: [1 1]
DataType: "double"

actInfo = getActionInfo(env)

actInfo =
rlFiniteSetSpec with properties:

Elements: [2x1 double]


Name: "MDP Actions"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"

Create the Q table using the observation and action specifications.

qTable = rlTable(obsInfo,actInfo)

qTable =
rlTable with properties:

Table: [8x2 double]

Version History
Introduced in R2019a

See Also
Topics
“Create Policies and Value Functions”

3-347
3 Objects

rlTD3Agent
Twin-delayed deep deterministic policy gradient reinforcement learning agent

Description
The twin-delayed deep deterministic policy gradient (DDPG) algorithm is an actor-critic, model-free,
online, off-policy reinforcement learning method which computes an optimal policy that maximizes
the long-term reward. The action space can only be continuous.

Use rlTD3Agent to create one of the following types of agents.

• Twin-delayed deep deterministic policy gradient (TD3) agent with two Q-value functions. This
agent prevents overestimation of the value function by learning two Q value functions and using
the minimum values for policy updates.
• Delayed deep deterministic policy gradient (delayed DDPG) agent with a single Q value function.
This agent is a DDPG agent with target policy smoothing and delayed policy and target updates.

For more information, see “Twin-Delayed Deep Deterministic Policy Gradient Agents”. For more
information on the different types of reinforcement learning agents, see “Reinforcement Learning
Agents”.

Creation

Syntax
agent = rlTD3Agent(observationInfo,actionInfo)
agent = rlTD3Agent(observationInfo,actionInfo,initOpts)

agent = rlTD3Agent(actor,critics,agentOptions)

agent = rlTD3Agent( ___ ,agentOptions)

Description
Create Agent from Observation and Action Specifications

agent = rlTD3Agent(observationInfo,actionInfo) creates a TD3 agent for an environment


with the given observation and action specifications, using default initialization options. The actor
and critics in the agent use default deep neural networks built from the observation specification
observationInfo and the action specification actionInfo. The ObservationInfo and
ActionInfo properties of agent are set to the observationInfo and actionInfo input
arguments, respectively.

agent = rlTD3Agent(observationInfo,actionInfo,initOpts) creates a deep deterministic


policy gradient agent for an environment with the given observation and action specifications. The
agent uses default networks configured using options specified in the initOpts object. For more
information on the initialization options, see rlAgentInitializationOptions.

3-348
rlTD3Agent

Create Agent from Actor and Critic

agent = rlTD3Agent(actor,critics,agentOptions) creates an agent with the specified actor


and critic. To create a:

• TD3 agent, specify a two-element row vector of critic.


• Delayed DDPG agent, specify a single critic.

Specify Agent Options

agent = rlTD3Agent( ___ ,agentOptions) creates a TD3 agent and sets the AgentOptions
property to the agentOptions input argument. Use this syntax after any of the input arguments in
the previous syntaxes.

Input Arguments

initOpts — Agent initialization options


rlAgentInitializationOptions object

Agent initialization options, specified as an rlAgentInitializationOptions object.

actor — Actor
rlContinuousDeterministicActor object

Actor, specified as an rlContinuousDeterministicActor. For more information on creating


actors, see “Create Policies and Value Functions”.

critics — Critic network


rlQValueFunction object | two-element row vector of rlQValueFunction objects

Critic, specified as one of the following:

• rlQValueFunction object — Create a delayed DDPG agent with a single Q value function. This
agent is a DDPG agent with target policy smoothing and delayed policy and target updates.
• Two-element row vector of rlQValueFunction objects — Create a TD3 agent with two critic
value functions. The two critic networks must be unique rlQValueFunction objects with the
same observation and action specifications. The critics can either have different structures or the
same structure but with different initial parameters.

For more information on creating critics, see “Create Policies and Value Functions”.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data type, and names of the observation
signals.

If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.

3-349
3 Objects

You can extract observationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specification


specification object

Action specifications, specified as a reinforcement learning specification object defining properties


such as dimensions, data type, and names of the action signals.

Since a TD3 agent operates in a continuous action space, you must specify actionInfo as an
rlNumericSpec object.

If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.

You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlNumericSpec.

AgentOptions — Agent options


rlTD3AgentOptions object

Agent options, specified as an rlTD3AgentOptions object.

If you create a TD3 agent with default actor and critic that use recurrent neural networks, the default
value of AgentOptions.SequenceLength is 32.

UseExplorationPolicy — Option to use exploration policy


false (default) | true

Option to use exploration policy when selecting actions, specified as a one of the following logical
values.

• true — Use the base agent exploration policy when selecting actions.
• false — Use the base agent greedy policy when selecting actions.

ExperienceBuffer — Experience buffer


rlReplayMemory object

Experience buffer, specified as an rlReplayMemory object. During training the agent stores each of
its experiences (S,A,R,S',D) in a buffer. Here:

• S is the current observation of the environment.


• A is the action taken by the agent.
• R is the reward for taking action A.
• S' is the next observation after taking action A.
• D is the is-done signal after taking action A.

SampleTime — Sample time of agent


positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.

3-350
rlTD3Agent

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object

Examples

Create TD3 Agent from Observation and Action Specifications

Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”. The observation from the environment is a vector containing the
position and velocity of a mass. The action is a scalar representing a force, applied to the mass,
ranging continuously from -2 to 2 Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.

rng(0)

Create a TD3 agent from the environment observation and action specifications.

agent = rlTD3Agent(obsInfo,actInfo);

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[0.0087]}

3-351
3 Objects

You can now test and train the agent within the environment. You can also use getActor and
getCritic to extract the actor and critic, respectively, and getModel to extract the approximator
model (by default a deep neural network) from the actor or critic.

Create TD3 Agent Using Initialization Options

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.

% load predefined environment


env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");

% obtain observation and action specifications


obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256).

initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);

The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator. To do so, uncomment the following line.

% rng(0)

Create a DDPG agent from the environment observation and action specifications.

agent = rlTD3Agent(obsInfo,actInfo,initOpts);

Extract the deep neural networks from the actor.

actorNet = getModel(getActor(agent));

Extract the deep neural networks from the two critics. Note that getModel(critics) only returns
the first critic network.

critics = getCritic(agent);
criticNet1 = getModel(critics(1));
criticNet2 = getModel(critics(2));

Display the layers of the first critic network, and verify that each hidden fully connected layer has 128
neurons.

criticNet1.Layers

ans =
13x1 Layer array with layers:

1 'concat' Concatenation Concatenation of 3 inputs along dimension 1


2 'relu_body' ReLU ReLU
3 'fc_body' Fully Connected 128 fully connected layer

3-352
rlTD3Agent

4 'body_output' ReLU ReLU


5 'input_1' Image Input 50x50x1 images
6 'conv_1' 2-D Convolution 64 3x3x1 convolutions with stride [1 1] and padding
7 'relu_input_1' ReLU ReLU
8 'fc_1' Fully Connected 128 fully connected layer
9 'input_2' Feature Input 1 features
10 'fc_2' Fully Connected 128 fully connected layer
11 'input_3' Feature Input 1 features
12 'fc_3' Fully Connected 128 fully connected layer
13 'output' Fully Connected 1 fully connected layer

Plot the networks of the actor and of the second critic, and display the number of weights.

plot(layerGraph(actorNet))

summary(actorNet)

Initialized: true

Number of learnables: 18.9M

Inputs:
1 'input_1' 50x50x1 images
2 'input_2' 1 features

plot(layerGraph(criticNet2))

3-353
3 Objects

summary(criticNet2)

Initialized: true

Number of learnables: 18.9M

Inputs:
1 'input_1' 50x50x1 images
2 'input_2' 1 features
3 'input_3' 1 features

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[0.0675]}

You can now test and train the agent within the environment.

Create TD3 Agent from Actor and Critic

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to

3-354
rlTD3Agent

Control Double Integrator System”. The observation from the environment is a vector containing the
position and velocity of a mass. The action is a scalar representing a force ranging continuously from
-2 to 2 Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

A SAC agent uses two Q-value function critics. To approximate each Q-value function, use a neural
network. The network for a single-output Q-value function critic must have two input layers, one for
the observation and the other for the action, and return a scalar value representing the expected
cumulative long-term reward following from the given observation and action.

Define each network path as an array of layer objects, and the dimensions of the observation and
action spaces from the environment specification objects.

% Observation path
obsPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="obsPathIn")
fullyConnectedLayer(32)
reluLayer
fullyConnectedLayer(16,Name="obsPathOut")
];

% Action path
actPath = [
featureInputLayer(prod(actInfo.Dimension),Name="actPathIn")
fullyConnectedLayer(32)
reluLayer
fullyConnectedLayer(16,Name="actPathOut")
];

% Common path
commonPath = [
concatenationLayer(1,2,Name="concat")
reluLayer
fullyConnectedLayer(1)
];

% Add layers to layergraph object


criticNet = layerGraph;
criticNet = addLayers(criticNet,obsPath);
criticNet = addLayers(criticNet,actPath);
criticNet = addLayers(criticNet,commonPath);

% Connect layers
criticNet = connectLayers(criticNet,"obsPathOut","concat/in1");
criticNet = connectLayers(criticNet,"actPathOut","concat/in2");

To initialize the network weights differently for the two critics, create two different dlnetwork
objects. You must do this because if the agent constructor function does not accept two identical
critics.

criticNet1 = dlnetwork(criticNet);
criticNet2 = dlnetwork(criticNet);

Display the number of weights.

3-355
3 Objects

summary(criticNet1)

Initialized: true

Number of learnables: 1.2k

Inputs:
1 'obsPathIn' 2 features
2 'actPathIn' 1 features

Create the two critics using rlQValueFunction, using the two networks with different weights.
Alternatively, if you use exactly the same network with the same weights, you must explicitly initialize
the network each time (to make sure weights are initialized differently) before passing it to
rlQValueFunction. To do so, use initialize.
critic1 = rlQValueFunction(criticNet1,obsInfo,actInfo);
critic2 = rlQValueFunction(criticNet2,obsInfo,actInfo);

Check the critics with a random observation and action input.


getValue(critic1,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})

ans = single
-0.1330

getValue(critic2,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})

ans = single
-0.1526

Create a neural network to be used as approximation model within the actor. For TD3 agents, the
actor executes a deterministic policy, which is implemented by a continuous deterministic actor. In
this case the network must take the observation signal as input and return an action. Therefore the
output layer must have as many elements as the number of possible actions.

Define the network as an array of layer objects, and get the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(400)
reluLayer
fullyConnectedLayer(300)
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension))
tanhLayer
];

Convert to dlnetwork and display the number of weights.


actorNet = dlnetwork(actorNet);
summary(actorNet)

Initialized: true

Number of learnables: 121.8k

Inputs:
1 'input' 2 features

3-356
rlTD3Agent

Create the actor using actorNet. TD3 agents use an rlContinuousDeterministicActor object
to implement the actor.

actor = rlContinuousDeterministicActor(actorNet,obsInfo,actInfo);

Check the actor with a random observation input.

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[0.0221]}

Specify training options for the critics.

criticOptions = rlOptimizerOptions( ...


Optimizer="adam", ...
LearnRate=1e-3,...
GradientThreshold=1, ...
L2RegularizationFactor=2e-4);

Specify training options for the actor.

actorOptions = rlOptimizerOptions( ...


Optimizer="adam", ...
LearnRate=1e-3,...
GradientThreshold=1, ...
L2RegularizationFactor=1e-5);

Specify agent options, including training options for actor and critics.

agentOptions = rlTD3AgentOptions;
agentOptions.DiscountFactor = 0.99;
agentOptions.TargetSmoothFactor = 5e-3;
agentOptions.TargetPolicySmoothModel.Variance = 0.2;
agentOptions.TargetPolicySmoothModel.LowerLimit = -0.5;
agentOptions.TargetPolicySmoothModel.UpperLimit = 0.5;
agentOptions.CriticOptimizerOptions = criticOptions;
agentOptions.ActorOptimizerOptions = actorOptions;

Create TD3 agent using actor, critics, and options.

agent = rlTD3Agent(actor,[critic1 critic2],agentOptions);

You can also create an rlTD3Agent object with a single critic. In this case, the object represents a
DDPG agent with target policy smoothing and delayed policy and target updates.

delayedDDPGAgent = rlTD3Agent(actor,critic1,agentOptions);

To check your agents, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[0.0351]}

getAction(delayedDDPGAgent,{rand(obsInfo.Dimension)})

3-357
3 Objects

ans = 1x1 cell array


{[0.0296]}

You can now test and train either agent within the environment.

Create TD3 Agent with Recurrent Neural Networks

For this example, load the environment used in the example “Train DDPG Agent to Control Double
Integrator System”. The observation from the environment is a vector containing the position and
velocity of a mass. The action is a scalar representing a force ranging continuously from -2 to 2
Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");

Obtain observation and action specifications.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

A SAC agent uses two Q-value function critics. To approximate each Q-value function, use a deep
recurrent neural network. The network for a single-output Q-value function critic must have two
input layers, one for the observation and the other for the action, and return a scalar value
representing the expected cumulative long-term reward following from the given observation and
action.

Define each network path as an array of layer objects, and the dimensions of the observation and
action spaces from the environment specification objects. To create a recurrent neural network, use a
sequenceInputLayer as the input layer and include an lstmLayer as one of the other network
layers.

% Define observation path


obsPath = [
sequenceInputLayer(prod(obsInfo.Dimension),Name="obsIn")
fullyConnectedLayer(400)
reluLayer
fullyConnectedLayer(300,Name="obsOut")
];

% Define action path


actPath = [
sequenceInputLayer(prod(actInfo.Dimension),Name="actIn")
fullyConnectedLayer(300,Name="actOut")
];

% Define common path


commonPath = [
concatenationLayer(1,2,Name="cat")
reluLayer
lstmLayer(16);
fullyConnectedLayer(1)
];

% Add layers to layerGraph object

3-358
rlTD3Agent

criticNet = layerGraph;
criticNet = addLayers(criticNet,obsPath);
criticNet = addLayers(criticNet,actPath);
criticNet = addLayers(criticNet,commonPath);

% Connect paths
criticNet = connectLayers(criticNet,"obsOut","cat/in1");
criticNet = connectLayers(criticNet,"actOut","cat/in2");

To initialize the network weights differently for the two critics, create two different dlnetwork
objects. You must do this because if the agent constructor function does not accept two identical
critics.

criticNet1 = dlnetwork(criticNet);
criticNet2 = dlnetwork(criticNet);

Display the number of weights.

summary(criticNet1)

Initialized: true

Number of learnables: 161.6k

Inputs:
1 'obsIn' Sequence input with 2 dimensions (CTB)
2 'actIn' Sequence input with 1 dimensions (CTB)

Create the critic using rlQValueFunction. Use the same network structure for both critics. The
TD3 agent initializes the two networks using different default parameters.

critic1 = rlQValueFunction(criticNet1,obsInfo,actInfo);
critic2 = rlQValueFunction(criticNet2,obsInfo,actInfo);

Check the critics with a random observation and action input.

getValue(critic1,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})

ans = single
-0.0060

getValue(critic2,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})

ans = single
0.0481

Since the critics has a recurrent network, the actor must have a recurrent network as approximation
model too. For TD3 agents, the actor executes a deterministic policy, which is implemented by a
continuous deterministic actor. In this case the network must take the observation signal as input and
return an action. Therefore the output layer must have as many elements as the number of possible
actions.

Define the network as an array of layer objects, and get the dimension of the observation space and
the number of possible actions from the environment specification objects.

actorNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(400)

3-359
3 Objects

lstmLayer(8)
reluLayer
fullyConnectedLayer(300,'Name','ActorFC2')
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension))
tanhLayer
];

Convert to dlnetwork and display the number of weights.

actorNet = dlnetwork(actorNet);
summary(actorNet)

Initialized: true

Number of learnables: 17.2k

Inputs:
1 'sequenceinput' Sequence input with 2 dimensions (CTB)

Create the actor using actorNet. TD3 agents use an rlContinuousDeterministicActor object
to implement the actor.

actor = rlContinuousDeterministicActor(actorNet,obsInfo,actInfo);

Check the actor with a random observation input.

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1×1 cell array


{[0.0014]}

Specify training options for the critics.

criticOptions = rlOptimizerOptions( ...


Optimizer="adam", ...
LearnRate=1e-3,...
GradientThreshold=1, ...
L2RegularizationFactor=2e-4);

Specify training options for the actor.

actorOptions = rlOptimizerOptions( ...


Optimizer="adam", ...
LearnRate=1e-3,...
GradientThreshold=1, ...
L2RegularizationFactor=1e-5);

Specify agent options, including training options for actor and critics. To use a TD3 agent with
recurrent neural networks, you must specify a SequenceLength greater than 1.

agentOptions = rlTD3AgentOptions;
agentOptions.DiscountFactor = 0.99;
agentOptions.SequenceLength = 32;
agentOptions.TargetSmoothFactor = 5e-3;
agentOptions.TargetPolicySmoothModel.Variance = 0.2;
agentOptions.TargetPolicySmoothModel.LowerLimit = -0.5;
agentOptions.TargetPolicySmoothModel.UpperLimit = 0.5;

3-360
rlTD3Agent

agentOptions.CriticOptimizerOptions = criticOptions;
agentOptions.ActorOptimizerOptions = actorOptions;

Create TD3 agent using actor, critics, and options.

agent = rlTD3Agent(actor,[critic1 critic2],agentOptions);

You can also create an rlTD3Agent object with a single critic. In this case, the object represents a
DDPG agent with target policy smoothing and delayed policy and target updates.

delayedDDPGAgent = rlTD3Agent(actor,critic1,agentOptions);

To check your agents, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1×1 cell array


{[0.0018]}

getAction(delayedDDPGAgent,{rand(obsInfo.Dimension)})

ans = 1×1 cell array


{[0.0015]}

You can now test and train either agent within the environment.

Version History
Introduced in R2020a

See Also
rlAgentInitializationOptions | rlTD3AgentOptions | rlQValueFunction |
rlContinuousDeterministicActor | Deep Network Designer

Topics
“Twin-Delayed Deep Deterministic Policy Gradient Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
“Train Biped Robot to Walk Using Reinforcement Learning Agents”

3-361
3 Objects

rlTD3AgentOptions
Options for TD3 agent

Description
Use an rlTD3AgentOptions object to specify options for twin-delayed deep deterministic policy
gradient (TD3) agents. To create a TD3 agent, use rlTD3Agent

For more information see “Twin-Delayed Deep Deterministic Policy Gradient Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation

Syntax
opt = rlTD3AgentOptions
opt = rlTD3AgentOptions(Name,Value)

Description

opt = rlTD3AgentOptions creates an options object for use as an argument when creating a TD3
agent using all default options. You can modify the object properties using dot notation.

opt = rlTD3AgentOptions(Name,Value) sets option properties on page 3-362 using name-value


pairs. For example, rlTD3AgentOptions('DiscountFactor',0.95) creates an option set with a
discount factor of 0.95. You can specify multiple name-value pairs. Enclose each property name in
quotes.

Properties
ExplorationModel — Exploration noise model options
GaussianActionNoise object (default) | OrnsteinUhlenbeckActionNoise object

Noise model options, specified as a GaussianActionNoise object or an


OrnsteinUhlenbeckActionNoise object. For more information on noise models, see “Noise
Models” on page 3-365.

For an agent with multiple actions, if the actions have different ranges and units, it is likely that each
action requires different noise model parameters. If the actions have similar ranges and units, you
can set the noise parameters for all actions to the same value.

For example, for an agent with two actions, set the standard deviation of each action to a different
value while using the same decay rate for both standard deviations.

3-362
rlTD3AgentOptions

opt = rlTD3AgentOptions;
opt.ExplorationModel.StandardDeviation = [0.1 0.2];
opt.ExplorationModel.StandardDeviationDecayRate = 1e-4;

To use Ornstein-Uhlenbeck action noise, first create a default OrnsteinUhlenbeckActionNoise


object. Then, specify any nondefault model properties using dot notation.

opt = rlTD3AgentOptions;
opt.ExplorationModel = rl.option.OrnsteinUhlenbeckActionNoise;
opt.ExplorationModel.StandardDeviation = 0.05;

TargetPolicySmoothModel — Target smoothing noise model options


GaussianActionNoise object

Target smoothing noise model options, specified as a GaussianActionNoise object. This model
helps the policy exploit actions with high Q-value estimates. For more information on noise models,
see “Noise Models” on page 3-365.

For an agent with multiple actions, if the actions have different ranges and units, it is likely that each
action requires different smoothing noise model parameters. If the actions have similar ranges and
units, you can set the noise parameters for all actions to the same value.

For example, for an agent with two actions, set the standard deviation of each action to a different
value while using the same decay rate for both standard deviations.

opt = rlTD3AgentOptions;
opt.TargetPolicySmoothModel.StandardDeviation = [0.1 0.2];
opt.TargetPolicySmoothModel.StandardDeviationDecayRate = 1e-4;

PolicyUpdateFrequency — Number of steps between policy updates


2 (default) | positive integer

Number of steps between policy updates, specified as a positive integer.

ActorOptimizerOptions — Actor optimizer options


rlOptimizerOptions object

Actor optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the actor approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

CriticOptimizerOptions — Critic optimizer options


rlOptimizerOptions object

Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the critic approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

TargetSmoothFactor — Smoothing factor for target actor and critic updates


0.005 (default) | positive scalar less than or equal to 1

Smoothing factor for target actor and critic updates, specified as a positive scalar less than or equal
to 1. For more information, see “Target Update Methods”.

3-363
3 Objects

TargetUpdateFrequency — Number of steps between target actor and critic updates


2 (default) | positive integer

Number of steps between target actor and critic updates, specified as a positive integer. For more
information, see “Target Update Methods”.

ResetExperienceBufferBeforeTraining — Option for clearing the experience buffer


true (default) | false

Option for clearing the experience buffer before training, specified as a logical value.

SequenceLength — Maximum batch-training trajectory length when using RNN


1 (default) | positive integer

Maximum batch-training trajectory length when using a recurrent neural network, specified as a
positive integer. This value must be greater than 1 when using a recurrent neural network and 1
otherwise.

MiniBatchSize — Size of random experience mini-batch


64 (default) | positive integer

Size of random experience mini-batch, specified as a positive integer. During each training episode,
the agent randomly samples experiences from the experience buffer when computing gradients for
updating the critic properties. Large mini-batches reduce the variance when computing gradients but
increase the computational effort.

NumStepsToLookAhead — Number of future rewards used to estimate the value of the policy
1 (default) | positive integer

Number of future rewards used to estimate the value of the policy, specified as a positive integer. For
more information, see [1], Chapter 7.

Note that if parallel training is enabled (that is if an rlTrainingOptions option object in which the
UseParallel property is set to true is passed to train) then NumStepsToLookAhead must be set
to 1, otherwise an error is generated. This guarantees that experiences are stored contiguously.

ExperienceBufferLength — Experience buffer size


10000 (default) | positive integer

Experience buffer size, specified as a positive integer. During training, the agent computes updates
using a mini-batch of experiences randomly sampled from the buffer.

SampleTime — Sample time of agent


1 (default) | positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience

3-364
rlTD3AgentOptions

returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

DiscountFactor — Discount factor


0.99 (default) | positive scalar less than or equal to 1

Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.

Object Functions
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning agent

Examples

Create TD3 Agent Options Object

This example shows how to create a TD3 agent option object.

Create an rlTD3AgentOptions object that specifies the mini-batch size.

opt = rlTD3AgentOptions('MiniBatchSize',48)

opt =
rlTD3AgentOptions with properties:

ExplorationModel: [1x1 rl.option.GaussianActionNoise]


TargetPolicySmoothModel: [1x1 rl.option.GaussianActionNoise]
PolicyUpdateFrequency: 2
ActorOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
CriticOptimizerOptions: [1x2 rl.option.rlOptimizerOptions]
TargetSmoothFactor: 0.0050
TargetUpdateFrequency: 2
ResetExperienceBufferBeforeTraining: 1
SequenceLength: 1
MiniBatchSize: 48
NumStepsToLookAhead: 1
ExperienceBufferLength: 10000
SampleTime: 1
DiscountFactor: 0.9900
InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the agent sample time to 0.5.

opt.SampleTime = 0.5;

Algorithms
Noise Models
Gaussian Action Noise

A GaussianActionNoise object has the following numeric value properties.

3-365
3 Objects

Property Description Default Value Default Value


(ExplorationModel) (TargetPolicySmooth
Model)
Mean Noise mean value 0 0
StandardDeviationD Decay rate of the 0 0
ecayRate standard deviation
StandardDeviation Initial value of noise sqrt(0.1) sqrt(0.2)
standard deviation
StandardDeviationM Minimum standard 0.01 0.01
in deviation, which must
be less than
StandardDeviation
LowerLimit Noise sample lower -Inf -0.5
limit
UpperLimit Noise sample upper Inf 0.5
limit

At each time step k, the Gaussian noise v is sampled as shown in the following code.

w = Mean + randn(ActionSize).*StandardDeviation(k);
v(k+1) = min(max(w,LowerLimit),UpperLimit);

Where the initial value v(1) is defined by the InitialAction parameter. At each sample time step,
the standard deviation decays as shown in the following code.

decayedStandardDeviation = StandardDeviation(k).*(1 - StandardDeviationDecayRate);


StandardDeviation(k+1) = max(decayedStandardDeviation,StandardDeviationMin);

Ornstein-Uhlenbeck Action Noise

An OrnsteinUhlenbeckActionNoise object has the following numeric value properties.

Property Description Default Value


InitialAction Initial value of action 0
Mean Noise mean value 0
MeanAttractionConstant Constant specifying how quickly 0.15
the noise model output is
attracted to the mean
StandardDeviationDecayRa Decay rate of the standard 0
te deviation
StandardDeviation Initial value of noise standard 0.3
deviation
StandardDeviationMin Minimum standard deviation 0

At each sample time step k, the noise value v(k) is updated using the following formula, where Ts is
the agent sample time, and the initial value v(1) is defined by the InitialAction parameter.

v(k+1) = v(k) + MeanAttractionConstant.*(Mean - v(k)).*Ts


+ StandardDeviation(k).*randn(size(Mean)).*sqrt(Ts)

3-366
rlTD3AgentOptions

At each sample time step, the standard deviation decays as shown in the following code.

decayedStandardDeviation = StandardDeviation(k).*(1 - StandardDeviationDecayRate);


StandardDeviation(k+1) = max(decayedStandardDeviation,StandardDeviationMin);

You can calculate how many samples it will take for the standard deviation to be halved using this
simple formula.

halflife = log(0.5)/log(1-StandardDeviationDecayRate);

For continuous action signals, it is important to set the noise standard deviation appropriately to
encourage exploration. It is common to set StandardDeviation*sqrt(Ts) to a value between 1%
and 10% of your action range.

If your agent converges on local optima too quickly, promote agent exploration by increasing the
amount of noise; that is, by increasing the standard deviation. Also, to increase exploration, you can
reduce the StandardDeviationDecayRate.

Version History
Introduced in R2020a

Properties defining noise probability distribution in the GaussianActionNoise object have


changed
Behavior changed in R2021a

The properties defining the probability distribution of the Gaussian action noise model have changed.
This noise model is used by TD3 agents for exploration and target policy smoothing.

• The Variance property has been replaced by the StandardDeviation property.


• The VarianceDecayRate property has been replaced by the StandardDeviationDecayRate
property.
• The VarianceMin property has been replaced by the StandardDeviationMin property.

When a GaussianActionNoise noise object saved from a previous MATLAB release is loaded, the
value of VarianceDecayRate is copied to StandardDeviationDecayRate, while the square root
of the values of Variance and VarianceMin are copied to StandardDeviation and
StandardDeviationMin, respectively.

The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the Gaussian action noise model, use the new
property names instead.
Update Code

This table shows how to update your code to use the new property names for rlTD3AgentOptions
object td3opt.

Not Recommended Recommended


td3opt.ExplorationModel.Variance = td3opt.ExplorationModel.StandardDeviat
0.5; ion = sqrt(0.5);
td3opt.ExplorationModel.VarianceDecayR td3opt.ExplorationModel.StandardDeviat
ate = 0.1; ionDecayRate = 0.1;

3-367
3 Objects

Not Recommended Recommended


td3opt.ExplorationModel.VarianceMin = td3opt.ExplorationModel.StandardDeviat
0.1; ionMin = sqrt(0.1);

Property names defining noise probability distribution in the


OrnsteinUhlenbeckActionNoise object have changed
Behavior changed in R2021a

The properties defining the probability distribution of the Ornstein-Uhlenbeck (OU) noise model have
been renamed. TD3 agents use OU noise for exploration.

• The Variance property has been renamed StandardDeviation.


• The VarianceDecayRate property has been renamed StandardDeviationDecayRate.
• The VarianceMin property has been renamed StandardDeviationMin.

The default values of these properties remain the same. When an


OrnsteinUhlenbeckActionNoise noise object saved from a previous MATLAB release is loaded,
the values of Variance, VarianceDecayRate, and VarianceMin are copied in the
StandardDeviation, StandardDeviationDecayRate, and StandardDeviationMin,
respectively.

The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the OU noise model, use the new property
names instead.
Update Code

This table shows how to update your code to use the new property names for rlTD3AgentOptions
object td3opt.

Not Recommended Recommended


td3opt.ExplorationModel.Variance = td3opt.ExplorationModel.StandardDeviat
0.5; ion = sqrt(0.5);
td3opt.ExplorationModel.VarianceDecayR td3opt.ExplorationModel.StandardDeviat
ate = 0.1; ionDecayRate = 0.1;
td3opt.ExplorationModel.VarianceMin = td3opt.ExplorationModel.StandardDeviat
0.1; ionMin = sqrt(0.1);
td3opt.TargetPolicySmoothModel.Varianc td3opt.TargetPolicySmoothModel.Standar
e = 0.5; dDeviation = sqrt(0.5);
td3opt.TargetPolicySmoothModel.Varianc td3opt.TargetPolicySmoothModel.Standar
eDecayRate = 0.1; dDeviationDecayRate = 0.1;
td3opt.TargetPolicySmoothModel.Varianc td3opt.TargetPolicySmoothModel.Standar
eMin = 0.1; dDeviationMin = sqrt(0.1);

References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second
edition. Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press,
2018.

3-368
rlTD3AgentOptions

See Also
Topics
“Twin-Delayed Deep Deterministic Policy Gradient Agents”

3-369
3 Objects

rlTrainingOptions
Options for training reinforcement learning agents

Description
Use an rlTrainingOptions object to specify training options for an agent. To train an agent, use
train.

For more information on training agents, see “Train Reinforcement Learning Agents”.

Creation
Syntax
trainOpts = rlTrainingOptions
opt = rlTrainingOptions(Name,Value)

Description

trainOpts = rlTrainingOptions returns the default options for training a reinforcement


learning agent. Use training options to specify parameters for the training session, such as the
maximum number of episodes to train, criteria for stopping training, criteria for saving agents, and
options for using parallel computing. After configuring the options, use trainOpts as an input
argument for train.

opt = rlTrainingOptions(Name,Value) creates a training option set and sets object


“Properties” on page 3-370 using one or more name-value pair arguments.

Properties
MaxEpisodes — Maximum number of episodes to train the agent
500 (default) | positive integer

Maximum number of episodes to train the agent, specified as a positive integer. Regardless of other
criteria for termination, training terminates after MaxEpisodes.
Example: 'MaxEpisodes',1000

MaxStepsPerEpisode — Maximum number of steps to run per episode


500 (default) | positive integer

Maximum number of steps to run per episode, specified as a positive integer. In general, you define
episode termination conditions in the environment. This value is the maximum number of steps to run
in the episode if other termination conditions are not met.
Example: 'MaxStepsPerEpisode',1000

ScoreAveragingWindowLength — Window length for averaging


5 (default) | positive integer scalar | positive integer vector

3-370
rlTrainingOptions

Window length for averaging the scores, rewards, and number of steps for each agent, specified as a
scalar or vector.

If the training environment contains a single agent, specify ScoreAveragingWindowLength as a


scalar.

If the training environment is a multi-agent Simulink environment, specify a scalar to apply the same
window length to all agents.

To use a different window length for each agent, specify ScoreAveragingWindowLength as a


vector. In this case, the order of the elements in the vector correspond to the order of the agents used
during environment creation.

For options expressed in terms of averages, ScoreAveragingWindowLength is the number of


episodes included in the average. For instance, if StopTrainingCriteria is "AverageReward",
and StopTrainingValue is 500 for a given agent, then for that agent, training terminates when the
average reward over the number of episodes specified in ScoreAveragingWindowLength equals or
exceeds 500. For the other agents, training continues until:

• All agents reach their stop criteria.


• The number of episodes reaches MaxEpisodes.
• You stop training by clicking the Stop Training button in Episode Manager or pressing Ctrl-C at
the MATLAB command line.

Example: 'ScoreAveragingWindowLength',10

StopTrainingCriteria — Training termination condition


"AverageSteps" (default) | "AverageReward" | "EpisodeCount" | ...

Training termination condition, specified as one of the following strings:

• "AverageSteps" — Stop training when the running average number of steps per episode equals
or exceeds the critical value specified by the option StopTrainingValue. The average is
computed using the window 'ScoreAveragingWindowLength'.
• "AverageReward" — Stop training when the running average reward equals or exceeds the
critical value.
• "EpisodeReward" — Stop training when the reward in the current episode equals or exceeds the
critical value.
• "GlobalStepCount" — Stop training when the total number of steps in all episodes (the total
number of times the agent is invoked) equals or exceeds the critical value.
• "EpisodeCount" — Stop training when the number of training episodes equals or exceeds the
critical value.

Example: 'StopTrainingCriteria',"AverageReward"

StopTrainingValue — Critical value of training termination condition


500 (default) | scalar | vector

Critical value of the training termination condition, specified as a scalar or a vector.

If the training environment contains a single agent, specify StopTrainingValue as a scalar.

If the training environment is a multi-agent Simulink environment, specify a scalar to apply the same
termination criterion to all agents. To use a different termination criterion for each agent, specify

3-371
3 Objects

StopTrainingValue as a vector. In this case, the order of the elements in the vector corresponds to
the order of the agents used during environment creation.

For a given agent, training ends when the termination condition specified by the
StopTrainingCriteria option equals or exceeds this value. For the other agents, the training
continues until:

• All agents reach their stop criteria.


• The number of episodes reaches maxEpisodes.
• You stop training by clicking the Stop Training button in Episode Manager or pressing Ctrl-C at
the MATLAB command line.

For instance, if StopTrainingCriteria is "AverageReward", and StopTrainingValue is 100


for a given agent, then for that agent, training terminates when the average reward over the number
of episodes specified in ScoreAveragingWindowLength equals or exceeds 100.
Example: 'StopTrainingValue',100

SaveAgentCriteria — Condition for saving agents during training


"none" (default) | "EpisodeReward" | "AverageReward" | "EpisodeCount" | ...

Condition for saving agents during training, specified as one of the following strings:

• "none" — Do not save any agents during training.


• "EpisodeReward" — Save the agent when the reward in the current episode equals or exceeds
the critical value.
• "AverageSteps" — Save the agent when the running average number of steps per episode
equals or exceeds the critical value specified by the option StopTrainingValue. The average is
computed using the window 'ScoreAveragingWindowLength'.
• "AverageReward" — Save the agent when the running average reward over all episodes equals
or exceeds the critical value.
• "GlobalStepCount" — Save the agent when the total number of steps in all episodes (the total
number of times the agent is invoked) equals or exceeds the critical value.
• "EpisodeCount" — Save the agent when the number of training episodes equals or exceeds the
critical value.

Set this option to store candidate agents that perform well according to the criteria you specify. When
you set this option to a value other than "none", the software sets the SaveAgentValue option to
500. You can change that value to specify the condition for saving the agent.

For instance, suppose you want to store for further testing any agent that yields an episode reward
that equals or exceeds 100. To do so, set SaveAgentCriteria to "EpisodeReward" and set the
SaveAgentValue option to 100. When an episode reward equals or exceeds 100, train saves the
corresponding agent in a MAT file in the folder specified by the SaveAgentDirectory option. The
MAT file is called AgentK.mat, where K is the number of the corresponding episode. The agent is
stored within that MAT file as saved_agent.
Example: 'SaveAgentCriteria',"EpisodeReward"

SaveAgentValue — Critical value of condition for saving agents


"none" (default) | 500 | scalar | vector

Critical value of the condition for saving agents, specified as a scalar or a vector.

3-372
rlTrainingOptions

If the training environment contains a single agent, specify SaveAgentValue as a scalar.

If the training environment is a multi-agent Simulink environment, specify a scalar to apply the same
saving criterion to each agent. To save the agents when one meets a particular criterion, specify
SaveAgentValue as a vector. In this case, the order of the elements in the vector corresponds to the
order of the agents used when creating the environment. When a criteria for saving an agent is met,
all agents are saved in the same MAT file.

When you specify a condition for saving candidate agents using SaveAgentCriteria, the software
sets this value to 500. Change the value to specify the condition for saving the agent. See the
SaveAgentCriteria option for more details.
Example: 'SaveAgentValue',100

SaveAgentDirectory — Folder for saved agents


"savedAgents" (default) | string | character vector

Folder for saved agents, specified as a string or character vector. The folder name can contain a full
or relative path. When an episode occurs that satisfies the condition specified by the
SaveAgentCriteria and SaveAgentValue options, the software saves the agents in a MAT file in
this folder. If the folder does not exist, train creates it. When SaveAgentCriteria is "none", this
option is ignored and train does not create a folder.
Example: 'SaveAgentDirectory', pwd + "\run1\Agents"

UseParallel — Flag for using parallel training


false (default) | true

Flag for using parallel training, specified as a logical. Setting this option to true configures
training to use parallel processing to simulate the environment, thereby enabling usage of multiple
cores, processors, computer clusters or cloud resources to speed up training. To specify options for
parallel training, use the ParallelizationOptions property.

When UseParallel is true then for DQN, DDPG, TD3, and SAC the NumStepsToLookAhead
property or the corresponding agent option object must be set to 1, otherwise an error is generated.
This guarantees that experiences are stored contiguously. When AC agents are trained in parallel, a
warning is generated if the StepsUntilDataIsSent property of the ParallelizationOptions
object is set to a different value than the NumStepToLookAhead property of the AC agent option
object.

Note that if you want to speed up deep neural network calculations (such as gradient computation,
parameter update and prediction) using a local GPU, you do not need to set UseParallel to true.
Instead, when creating your actor or critic representation, use an rlRepresentationOptions
object in which the UseDevice option is set to "gpu". Using parallel computing or the GPU requires
Parallel Computing Toolbox software. Using computer clusters or cloud resources additionally
requires MATLAB Parallel Server. For more information about training using multicore processors
and GPUs, see “Train Agents Using Parallel Computing and GPUs”.
Example: 'UseParallel',true

ParallelizationOptions — Options to control parallel training


ParallelTraining object

Parallelization options to control parallel training, specified as a ParallelTraining object. For


more information about training using parallel computing, see “Train Reinforcement Learning
Agents”.

3-373
3 Objects

The ParallelTraining object has the following properties, which you can modify using dot
notation after creating the rlTrainingOptions object.

Mode — Parallel computing mode


"sync" (default) | "async"

Parallel computing mode, specified as one of the following:

• "sync" — Use parpool to run synchronous training on the available workers. In this case,
workers pause execution until all workers are finished. The host updates the actor and critic
parameters based on the results from all the workers and sends the updated parameters to all
workers. Note that synchronous training is required for gradient-based parallelization, that is
when DataToSendFromWorkers is set to "gradients" then Mode must be set to "sync".
• "async" — Use parpool to run asynchronous training on the available workers. In this case,
workers send their data back to the host as soon as they finish and receive updated parameters
from the host. The workers then continue with their task.

WorkerRandomSeeds — Randomizer initialization for workers


–1 (default) | –2 | vector

Randomizer initialization for workers, specified as one of the following:

• –1 — Assign a unique random seed to each worker. The value of the seed is the worker ID.
• –2 — Do not assign a random seed to the workers.
• Vector — Manually specify the random seed for each worker. The number of elements in the
vector must match the number of workers.

TransferBaseWorkspaceVariables — Option to send model and workspace variables to


parallel workers
"on" (default) | "off"

Option to send model and workspace variables to parallel workers, specified as "on" or "off". When
the option is "on", the host sends variables used in models and defined in the base MATLAB
workspace to the workers.

AttachedFiles — Additional files to attach to the parallel pool


[] (default) | string | string array

Additional files to attach to the parallel pool, specified as a string or string array.

SetupFcn — Function to run before training starts


[] (default) | function handle

Function to run before training starts, specified as a handle to a function having no input arguments.
This function is run once per worker before training begins. Write this function to perform any
processing that you need prior to training.

CleanupFcn — Function to run after training ends


[] (default) | function handle

Function to run after training ends, specified as a handle to a function having no input arguments.
You can write this function to clean up the workspace or perform other processing after training
terminates.

3-374
rlTrainingOptions

Verbose — Display training progress on the command line


false (0) (default) | true (1)

Display training progress on the command line, specified as the logical values false (0) or true (1).
Set to true to write information from each training episode to the MATLAB command line during
training.

StopOnError — Option to stop training when error occurs


"on" (default) | "off"

Option to stop training when an error occurs during an episode, specified as "on" or "off". When
this option is "off", errors are captured and returned in the SimulationInfo output of train, and
training continues to the next episode.

Plots — Option to display training progress with Episode Manager


"training-progress" (default) | "none"

Option to display training progress with Episode Manager, specified as "training-progress" or


"none". By default, calling train opens the Reinforcement Learning Episode Manager, which
graphically and numerically displays information about the training progress, such as the reward for
each episode, average reward, number of episodes, and total number of steps. (For more information,
see train.) To turn off this display, set this option to "none".

Object Functions
train Train reinforcement learning agents within a specified environment

Examples

Configure Options for Training

Create an options set for training a reinforcement learning agent. Set the maximum number of
episodes and the maximum number of steps per episode to 1000. Configure the options to stop
training when the average reward equals or exceeds 480, and turn on both the command-line display
and Reinforcement Learning Episode Manager for displaying training results. You can set the options
using name-value pair arguments when you create the options set. Any options that you do not
explicitly set have their default values.

trainOpts = rlTrainingOptions(...
'MaxEpisodes',1000,...
'MaxStepsPerEpisode',1000,...
'StopTrainingCriteria',"AverageReward",...
'StopTrainingValue',480,...
'Verbose',true,...
'Plots',"training-progress")

trainOpts =
rlTrainingOptions with properties:

MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: 5
StopTrainingCriteria: "AverageReward"
StopTrainingValue: 480

3-375
3 Objects

SaveAgentCriteria: "none"
SaveAgentValue: "none"
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelTraining]

Alternatively, create a default options set and use dot notation to change some of the values.
trainOpts = rlTrainingOptions;
trainOpts.MaxEpisodes = 1000;
trainOpts.MaxStepsPerEpisode = 1000;
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 480;
trainOpts.Verbose = true;
trainOpts.Plots = "training-progress";

trainOpts

trainOpts =
rlTrainingOptions with properties:

MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: 5
StopTrainingCriteria: "AverageReward"
StopTrainingValue: 480
SaveAgentCriteria: "none"
SaveAgentValue: "none"
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelTraining]

You can now use trainOpts as an input argument to the train command.

Configure Parallel Computing Options for Training

To turn on parallel computing for training a reinforcement learning agent, set the UseParallel
training option to true.
trainOpts = rlTrainingOptions(UseParallel=true);

To configure your parallel training, configure the fields of the


trainOpts.ParallelizationOptions. For example, specify the asynchronous training mode:
trainOpts.ParallelizationOptions.Mode = "async";
trainOpts.ParallelizationOptions

ans =
ParallelTraining with properties:

3-376
rlTrainingOptions

Mode: "async"
WorkerRandomSeeds: -1
TransferBaseWorkspaceVariables: "on"
AttachedFiles: []
SetupFcn: []
CleanupFcn: []

You can now use trainOpts as an input argument to the train command to perform training with
parallel computing.

Configure Options for A3C Training

To train an agent using the asynchronous advantage actor-critic (A3C) method, you must set the
agent and parallel training options appropriately.

When creating the AC agent, set the NumStepsToLookAhead value to be greater than 1. Common
values are 64 and 128.

agentOpts = rlACAgentOptions(NumStepsToLookAhead=64);

Use agentOpts when creating your agent. Alternatively, create your agent first and then modify its
options, including the actor and critic options later using dot notation.

Configure the training algorithm to use asynchronous parallel training.

trainOpts = rlTrainingOptions(UseParallel=true);
trainOpts.ParallelizationOptions.Mode = "async";

Configure the workers to return gradient data to the host. Also, set the number of steps before the
workers send data back to the host to match the number of steps to look ahead.

trainOpts.ParallelizationOptions.DataToSendFromWorkers = ...
"gradients";
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = ...
agentOpts.NumStepsToLookAhead;

Use trainOpts when training your agent.

For an example on asynchronous advantage actor-critic agent training, see “Train AC Agent to
Balance Cart-Pole System Using Parallel Computing”.

Version History
Introduced in R2019a

Training Parallelization Options: DataToSendFromWorkers and StepsUntilDataIsSent


properties are no longer active
Warns starting in R2022a

3-377
3 Objects

The property DataToSendFromWorkers of the ParallelizationOptions object is no longer


active and will be removed in a future release. The data sent from the workers to the learner is now
automatically determined based on agent type.

The property StepsUntilDataIsSent of the ParallelizationOptions object is no longer active


and will be removed in a future release. Data is now sent from the workers to the learner at the end
each episode.

rlTrainingOptions is not recommended for multi agent training


Not recommended starting in R2022a

rlTrainingOptions is not recommended to concurrently train agents in a multi-agent


environments. Use rlMultiAgentTrainingOptions instead.

rlMultiAgentTrainingOptions is specifically built for multi-agent reinforcement learning, and


allows you to group agents according to a common learning strategy and specify whether their
learning is centralized (that is all agents in a group share experiences) or decentralized (agents do
not share experiences), whereas rlTrainingOptions only allows for decentralized learning.

See Also
train | rlMultiAgentTrainingOptions

Topics
“Train Reinforcement Learning Agents”

3-378
rlTRPOAgent

rlTRPOAgent
Trust region policy optimization reinforcement learning agent

Description
Trust region policy optimization (TRPO) is a model-free, online, on-policy, policy gradient
reinforcement learning method. This algorithm prevents significant performance drops compared to
standard policy gradient methods by keeping the updated policy within a trust region close to the
current policy. The action space can be either discrete or continuous.

For more information on TRPO agents, see “Trust Region Policy Optimization Agents”. For more
information on the different types of reinforcement learning agents, see “Reinforcement Learning
Agents”.

Creation

Syntax
agent = rlTRPOAgent(observationInfo,actionInfo)
agent = rlTRPOAgent(observationInfo,actionInfo,initOpts)

agent = rlTRPOAgent(actor,critic)

agent = rlTRPOAgent( ___ ,agentOptions)

Description

Create Agent from Observation and Action Specifications

agent = rlTRPOAgent(observationInfo,actionInfo) creates a trust region policy


optimization (TRPO) agent for an environment with the given observation and action specifications,
using default initialization options. The actor and critic in the agent use default deep neural networks
built from the observation specification observationInfo and the action specification actionInfo.
The ObservationInfo and ActionInfo properties of agent are set to the observationInfo and
actionInfo input arguments, respectively.

agent = rlTRPOAgent(observationInfo,actionInfo,initOpts) creates a TRPO agent for


an environment with the given observation and action specifications. The agent uses default networks
configured using options specified in the initOpts object. TRPO agents do not support recurrent
neural networks. For more information on the initialization options, see
rlAgentInitializationOptions.

Create Agent from Actor and Critic

agent = rlTRPOAgent(actor,critic) creates a TRPO agent with the specified actor and critic,
using the default options for the agent.

3-379
3 Objects

Specify Agent Options

agent = rlTRPOAgent( ___ ,agentOptions) creates a TRPO agent and sets the AgentOptions
property to the agentOptions input argument. Use this syntax after any of the input arguments in
the previous syntaxes.

Input Arguments

initOpts — Agent initialization options


rlAgentInitializationOptions object

Agent initialization options, specified as an rlAgentInitializationOptions object.

TRPO agents do not support recurrent neural networks. Therefore initOpts.UseRNNmust be


false.

actor — Actor
rlDiscreteCategoricalActor object | rlContinuousGaussianActor object

Actor that implements the policy, specified as an rlDiscreteCategoricalActor or


rlContinuousGaussianActor function approximator object. For more information on creating
actor approximators, see “Create Policies and Value Functions”.

critic — Critic
rlValueFunction object

Critic that estimates the discounted long-term reward, specified as an rlValueFunction object. For
more information on creating critic approximators, see “Create Policies and Value Functions”.

Properties
ObservationInfo — Observation specifications
specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of


specification objects defining properties such as dimensions, data type, and names of the observation
signals.

If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.

You can extract observationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.

ActionInfo — Action specifications


specification object

Action specifications, specified as a reinforcement learning specification object defining properties


such as dimensions, data type, and names of the action signals.

For a discrete action space, you must specify actionInfo as an rlFiniteSetSpec object.

For a continuous action space, you must specify actionInfo as an rlNumericSpec object.

3-380
rlTRPOAgent

If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.

You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

AgentOptions — Agent options


rlTRPOAgentOptions object

Agent options, specified as an rlTRPOAgentOptions object.

UseExplorationPolicy — Option to use exploration policy


true (default) | false

Option to use exploration policy when selecting actions, specified as a one of the following logical
values.

• true — Use the base agent exploration policy when selecting actions in sim and
generatePolicyFunction. In this case, the agent selects its actions by sampling its probability
distribution, the policy is therefore stochastic and the agent explores its observation space.
• false — Use the base agent greedy policy (the action with maximum likelihood) when selecting
actions in sim and generatePolicyFunction. In this case, the simulated agent and generated
policy behave deterministically.

Note This option affects only simulation and deployment; it does not affect training.

SampleTime — Sample time of agent


positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent

3-381
3 Objects

generatePolicyFunction Generate function that evaluates policy of an agent or policy object

Examples

Create Discrete TRPO Agent from Observation and Action Specifications

Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to a swinging pole).

% load predefined environment


env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");

Obtain the observation and action specifications for this environment.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator. To do so, uncomment the following line.

% rng(0)

Create a TRPO agent from the environment observation and action specifications.

agent = rlTRPOAgent(obsInfo,actInfo);

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[-2]}

You can now test and train the agent within the environment.

Create Continuous TRPO Agent Using Initialization Options

Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.

env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");

Obtain observation and action specifications for this environment.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

3-382
rlTRPOAgent

Create an agent initialization options object, specifying that each hidden fully connected layer in the
network must have 128 neurons.

initOpts = rlAgentInitializationOptions('NumHiddenUnit',128);

The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator. To do so, uncomment the following line.

% rng(0)

Create a TRPO agent from the environment observation and action specifications using the specified
initialization options.

agent = rlTRPOAgent(obsInfo,actInfo,initOpts);

Extract the deep neural networks from both the agent actor and critic.

actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));

You can verify that the networks have 128 units in their hidden fully connected layers. For example,
display the layers of the critic network.

criticNet.Layers

ans =
11x1 Layer array with layers:

1 'concat' Concatenation Concatenation of 2 inputs along dimension 1


2 'relu_body' ReLU ReLU
3 'fc_body' Fully Connected 128 fully connected layer
4 'body_output' ReLU ReLU
5 'input_1' Image Input 50x50x1 images
6 'conv_1' 2-D Convolution 64 3x3x1 convolutions with stride [1 1] and padding
7 'relu_input_1' ReLU ReLU
8 'fc_1' Fully Connected 128 fully connected layer
9 'input_2' Feature Input 1 features
10 'fc_2' Fully Connected 128 fully connected layer
11 'output' Fully Connected 1 fully connected layer

To check your agent, use getAction to return the action from a random observation.

getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})

ans = 1x1 cell array


{[0.9228]}

You can now test and train the agent within the environment.

Create Trust Region Policy Optimization Agent

Create an environment interface, and obtain its observation and action specifications.

3-383
3 Objects

env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

For TRPO agents, the critic estimates a value function, therefore it must take the observation signal
as input and return a scalar value. Create a deep neural network to be used as approximation model
within the critic. Define the network as an array of layer objects.

criticNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(100)
reluLayer
fullyConnectedLayer(1)
];

Convert to a dlnetwork object and display the number of parameters.

criticNet = dlnetwork(criticNet);
summary(criticNet)

Initialized: true

Number of learnables: 601

Inputs:
1 'input' 4 features

Create the critic using criticNet. TRPO agents use an rlValueFunction object to implement the
critic.

critic = rlValueFunction(criticNet,obsInfo);

Check the critic with a random observation input.

getValue(critic,{rand(obsInfo.Dimension)})

ans = single
-0.2479

To approximate the policy within the actor use a neural network. For TRPO agents, the actor executes
a stochastic policy, which for discrete action spaces is implemented by a discrete categorical actor. In
this case the approximator must take the observation signal as input and return a probability for each
action. Therefore the output layer must have as many elements as the number of possible actions.

Define the network as an array of layer objects, getting the dimension of the observation space and
the number of possible actions from the environment specification objects.

actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(200)
reluLayer
fullyConnectedLayer(numel(actInfo.Dimension))
];

Convert to a dlnetwork object and display the number of parameters.

actorNet = dlnetwork(actorNet);
summary(actorNet)

3-384
rlTRPOAgent

Initialized: true

Number of learnables: 1.4k

Inputs:
1 'input' 4 features

Create the actor using actorNet. PPO agents use an rlDiscreteCategoricalActor object to
implement the actor for discrete action spaces.

actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);

Check the actor with a random observation input.

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-10]}

Create a TRPO agent using the actor and the critic.

agent = rlTRPOAgent(actor,critic)

agent =
rlTRPOAgent with properties:

AgentOptions: [1x1 rl.option.rlTRPOAgentOptions]


UseExplorationPolicy: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1

Specify agent options, including training options for the actor and the critic.

agent.AgentOptions.ExperienceHorizon = 1024;
agent.AgentOptions.DiscountFactor = 0.95;

agent.AgentOptions.CriticOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;

Check your agent with a random observation input.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-10]}

You can now test and train the agent against the environment.

Create Continuous TRPO Agent

Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the double integrator continuous action space environment used

3-385
3 Objects

in the example “Train DDPG Agent to Control Double Integrator System”. The observation from the
environment is a vector containing the position and velocity of a mass. The action is a scalar
representing a force applied to the mass, ranging continuously from -2 to 2 Newton.

env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env)

obsInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"

actInfo = getActionInfo(env)

actInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"

In this example, the action is a scalar value representing a force ranging from -2 to 2 Newton. To
make sure that the output from the agent is in this range, you perform an appropriate scaling
operation. Store these limits so you can easily access them later.

actInfo.LowerLimit=-2;
actInfo.UpperLimit=2;

The actor and critic networks are initialized randomly. You can ensure reproducibility by fixing the
seed of the random generator.

rng(0)

Create a deep neural network to be used as approximation model within the critic. For TRPO agents,
the critic estimates a value function, therefore it must take the observation signal as input and return
a scalar value.

criticNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(100)
reluLayer
fullyConnectedLayer(1)];

Convert to a dlnetwork object and display the number of parameters.

criticNet = dlnetwork(criticNet);
summary(criticNet)

3-386
rlTRPOAgent

Initialized: true

Number of learnables: 401

Inputs:
1 'input' 2 features

Create the critic using criticNet. PPO agents use an rlValueFunction object to implement the
critic.

critic = rlValueFunction(criticNet,obsInfo);

Check the critic with a random observation input.

getValue(critic,{rand(obsInfo.Dimension)})

ans = single
-0.0899

To approximate the policy within the actor, use a neural network. For TRPO agents, the actor
executes a stochastic policy, which for continuous action spaces is implemented by a continuous
Gaussian actor. In this case the network must take the observation signal as input and return both a
mean value and a standard deviation value for each action. Therefore it must have two output layers
(one for the mean values the other for the standard deviation values), each having as many elements
as the dimension of the action space.

Note that standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.

Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces, and the action range limits from the environment specification objects. Specify a name
for the input and output layers, so you can later explicitly associate them with the appropriate
environment channel.

% Define common input path layer


commonPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="comPathIn")
fullyConnectedLayer(100)
reluLayer
fullyConnectedLayer(1,Name="comPathOut") ];

% Define mean value path


meanPath = [
fullyConnectedLayer(32,Name="meanPathIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension));
tanhLayer;
scalingLayer(Name="meanPathOut",Scale=actInfo.UpperLimit) ];

% Define standard deviation path


sdevPath = [
fullyConnectedLayer(32,'Name',"stdPathIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension));
softplusLayer(Name="stdPathOut") ];

3-387
3 Objects

% Add layers to layerGraph object


actorNet = layerGraph(commonPath);
actorNet = addLayers(actorNet,meanPath);
actorNet = addLayers(actorNet,sdevPath);

% Connect paths
actorNet = connectLayers(actorNet,"comPathOut","meanPathIn/in");
actorNet = connectLayers(actorNet,"comPathOut",'stdPathIn/in');

% Plot network
plot(actorNet)

% Convert to dlnetwork and display number of weights


actorNet = dlnetwork(actorNet);
summary(actorNet)

Initialized: true

Number of learnables: 595

Inputs:
1 'comPathIn' 2 features

Create the actor using actorNet. TRPO agents use an rlContinuousGaussianActor object to
implement the actor for continuous action spaces.

3-388
rlTRPOAgent

actor = rlContinuousGaussianActor(actorNet, obsInfo, actInfo, ...


'ActionMeanOutputNames',"meanPathOut",...
'ActionStandardDeviationOutputNames',"stdPathOut",...
'ObservationInputNames',"comPathIn");

Check the actor with a random observation input.

getAction(actor,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[0.2732]}

Create a TRPO agent using the actor and the critic.

agent = rlTRPOAgent(actor,critic)

agent =
rlTRPOAgent with properties:

AgentOptions: [1x1 rl.option.rlTRPOAgentOptions]


UseExplorationPolicy: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlNumericSpec]
SampleTime: 1

Specify agent options, including training options for the actor and the critic.

agent.AgentOptions.ExperienceHorizon = 1024;
agent.AgentOptions.DiscountFactor = 0.95;

agent.AgentOptions.CriticOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;

Check your agent with a random observation input.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array


{[-0.5005]}

You can now test and train the agent within the environment.

Tips
• For continuous action spaces, this agent does not enforce the constraints set by the action
specification. In this case, you must enforce action space constraints within the environment.
• While tuning the learning rate of the actor network is necessary for PPO agents, it is not
necessary for TRPO agents.
• For high-dimensional observations, such as for images, it is recommended to use PPO, SAC, or
TD3 agents.

3-389
3 Objects

Version History
Introduced in R2021b

See Also
rlTRPOAgentOptions | rlValueFunction | rlDiscreteCategoricalActor |
rlContinuousGaussianActor | Deep Network Designer

Topics
“Trust Region Policy Optimization Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”

3-390
rlTRPOAgentOptions

rlTRPOAgentOptions
Options for TRPO agent

Description
Use an rlTRPOAgentOptions object to specify options for trust region policy optimization (TRPO)
agents. To create a TRPO agent, use rlTRPOAgent.

For more information on TRPO agents, see “Trust Region Policy Optimization Agents”.

For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.

Creation

Syntax
opt = rlTRPOAgentOptions
opt = rlTRPOAgentOptions(Name,Value)

Description

opt = rlTRPOAgentOptions creates an rlPPOAgentOptions object for use as an argument when


creating a TRPO agent using all default settings. You can modify the object properties using dot
notation.

opt = rlTRPOAgentOptions(Name,Value) sets option properties on page 3-391 using name-


value arguments. For example, rlTRPOAgentOptions('DiscountFactor',0.95) creates an
option set with a discount factor of 0.95. You can specify multiple name-value arguments. Enclose
each property name in quotes.

Properties
ExperienceHorizon — Number of steps the agent interacts with the environment before
learning
512 (default) | positive integer

Number of steps the agent interacts with the environment before learning from its experience,
specified as a positive integer.

The ExperienceHorizon value must be greater than or equal to the MiniBatchSize value.

MiniBatchSize — Mini-batch size


128 (default) | positive integer

Mini-batch size used for each learning epoch, specified as a positive integer. When the agent uses a
recurrent neural network, MiniBatchSize is treated as the training trajectory length.

3-391
3 Objects

The MiniBatchSize value must be less than or equal to the ExperienceHorizon value.

EntropyLossWeight — Entropy loss weight


0.01 (default) | scalar value between 0 and 1

Entropy loss weight, specified as a scalar value between 0 and 1. A higher entropy loss weight value
promotes agent exploration by applying a penalty for being too certain about which action to take.
Doing so can help the agent move out of local optima.

When gradients are computed during training, an additional gradient component is computed for
minimizing the entropy loss. For more information, see “Entropy Loss”.

NumEpoch — Number of epochs


1 (default) | positive integer

Number of epochs for which the actor and critic networks learn from the current experience set,
specified as a positive integer.

AdvantageEstimateMethod — Method for estimating advantage values


"gae" (default) | "finite-horizon"

Method for estimating advantage values, specified as one of the following:

• "gae" — Generalized advantage estimator


• "finite-horizon" — Finite horizon estimation

For more information on these methods, see the training algorithm information in “Proximal Policy
Optimization Agents”.

GAEFactor — Smoothing factor for generalized advantage estimator


0.95 (default) | scalar value between 0 and 1

Smoothing factor for generalized advantage estimator, specified as a scalar value between 0 and 1,
inclusive. This option applies only when the AdvantageEstimateMethod option is "gae"

ConjugateGradientDamping — Conjugate gradient damping factor


1e-4 (default) | nonnegative scalar

Conjugate gradient damping factor for numerical stability, specified as a nonnegative scalar.

KLDivergenceLimit — Upper limit for KL divergence


0.01 (default) | positive scalar

Upper limit for the Kullback-Leibler (KL) divergence between the old policy and the current policy,
specified as a positive scalar.

NumIterationsConjugateGradient — Maximum number of iterations for conjugate


gradient decent
10 (default) | positive integer

Maximum number of iterations for conjugate gradient decent, specified as positive integer.

NumIterationsLineSearch — Number of iterations for line search


10 (default) | positive integer

Number of iterations for line search, specified as a positive integer.

3-392
rlTRPOAgentOptions

Typically, the default value works well for most cases.

ConjugateGradientResidualTolerance — Conjugate gradient residual tolerance factor


1e-8 (default) | positive scalar

Conjugate gradient residual tolerance, specified as a positive scalar. Once the residual for the
conjugate gradient algorithm is below this tolerance, the algorithm stops.

Typically, the default value works well for most cases.

NormalizedAdvantageMethod — Method for normalizing advantage function


"none" (default) | "current | "moving"

Method for normalizing advantage function values, specified as one of the following:

• "none" — Do not normalize advantage values


• "current" — Normalize the advantage function using the mean and standard deviation for the
current mini-batch of experiences.
• "moving" — Normalize the advantage function using the mean and standard deviation for a
moving window of recent experiences. To specify the window size, set the
AdvantageNormalizingWindow option.

In some environments, you can improve agent performance by normalizing the advantage function
during training. The agent normalizes the advantage function by subtracting the mean advantage
value and scaling by the standard deviation.

AdvantageNormalizingWindow — Window size for normalizing advantage function


1e6 (default) | positive integer

Window size for normalizing advantage function values, specified as a positive integer. Use this
option when the NormalizedAdvantageMethod option is "moving".

CriticOptimizerOptions — Critic optimizer options


rlOptimizerOptions object

Critic optimizer options, specified as an rlOptimizerOptions object. It allows you to specify


training parameters of the critic approximator such as learning rate, gradient threshold, as well as
the optimizer algorithm and its parameters. For more information, see rlOptimizerOptions and
rlOptimizer.

SampleTime — Sample time of agent


1 (default) | positive scalar | -1

Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.

Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.

Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.

3-393
3 Objects

DiscountFactor — Discount factor


0.99 (default) | positive scalar less than or equal to 1

Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.

Object Functions
rlTRPOAgent Trust region policy optimization reinforcement learning agent

Examples

Create TRPO Agent Options Object

Create a TRPO agent options object, specifying the discount factor.

opt = rlTRPOAgentOptions('DiscountFactor',0.9)

opt =
rlTRPOAgentOptions with properties:

ExperienceHorizon: 512
MiniBatchSize: 128
EntropyLossWeight: 0.0100
NumEpoch: 3
AdvantageEstimateMethod: "gae"
GAEFactor: 0.9500
ConjugateGradientDamping: 0.1000
KLDivergenceLimit: 0.0100
NumIterationsConjugateGradient: 10
NumIterationsLineSearch: 10
ConjugateGradientResidualTolerance: 1.0000e-08
NormalizedAdvantageMethod: "none"
AdvantageNormalizingWindow: 1000000
CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
SampleTime: 1
DiscountFactor: 0.9000
InfoToSave: [1x1 struct]

You can modify options using dot notation. For example, set the agent sample time to 0.1.

opt.SampleTime = 0.1;

Version History
Introduced in R2021b

Simulation and deployment: UseDeterministicExploitation will be removed


Warns starting in R2022a

The property UseDeterministicExploitation of the rlTRPOAgentOptions object will be


removed in a future release. Use the UseExplorationPolicy property of rlTRPOAgent instead.

3-394
rlTRPOAgentOptions

Previously, you set UseDeterministicExploitation as follows.

• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.

agent.AgentOptions.UseDeterministicExploitation = true;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.

agent.AgentOptions.UseDeterministicExploitation = false;

Starting in R2022a, set UseExplorationPolicy as follows.

• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.

agent.UseExplorationPolicy = false;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.

agent.UseExplorationPolicy = true;

Similarly to UseDeterministicExploitation, UseExplorationPolicy affects only simulation


and deployment; it does not affect training.

See Also
Topics
“Trust Region Policy Optimization Agents”

3-395
3 Objects

rlValueFunction
Value function approximator object for reinforcement learning agents

Description
This object implements a value function approximator object that you can use as a critic for a
reinforcement learning agent. A value function maps an environment state to a scalar value. The
output represents the predicted discounted cumulative long-term reward when the agent starts from
the given state and takes the best possible action. After you create an rlValueFunction critic, use
it to create an agent such as an rlACAgent, rlPGAgent, or rlPPOAgent agent. For an example of
this workflow, see “Create Actor and Critic Representations” on page 3-412. For more information on
creating value functions, see “Create Policies and Value Functions”.

Creation
Syntax
critic = rlValueFunction(net,observationInfo)
critic = rlValueFunction(net,ObservationInputNames=netObsNames)

critic = rlValueFunction(tab,observationInfo)

critic = rlValueFunction({basisFcn,W0},observationInfo)

critic = rlValueFunction( ___ ,UseDevice=useDevice)

Description

critic = rlValueFunction(net,observationInfo) creates the value-function object critic


from the deep neural network net and sets the ObservationInfo property of critic to the
observationInfo input argument. The network input layers are automatically associated with the
environment observation channels according to the dimension specifications in observationInfo.

critic = rlValueFunction(net,ObservationInputNames=netObsNames) specifies the


network input layer names to be associated with the environment observation channels. The function
assigns, in sequential order, each environment observation channel specified in observationInfo
to the layer specified by the corresponding name in the string array netObsNames. Therefore, the
network input layers, ordered as the names in netObsNames, must have the same data type and
dimensions as the observation channels, as ordered in observationInfo.

critic = rlValueFunction(tab,observationInfo) creates the value function object critic


with a discrete observation space, from the table tab, which is an rlTable object containing a
column array with as many elements as the number of possible observations. The function sets the
ObservationInfo property of critic to the observationInfo input argument, which in this case
must be a scalar rlFiniteSetSpec object.

critic = rlValueFunction({basisFcn,W0},observationInfo) creates the value function


object critic using a custom basis function as underlying approximator. The first input argument is

3-396
rlValueFunction

a two-element cell array whose first element is the handle basisFcn to a custom basis function and
whose second element is the initial weight vector W0. The function sets the ObservationInfo
property of critic to the observationInfo input argument.

critic = rlValueFunction( ___ ,UseDevice=useDevice) specifies the device used to perform


computations for the critic object, and sets the UseDevice property of critic to the useDevice
input argument. You can use this syntax with any of the previous input-argument combinations.

Input Arguments

net — Deep neural network


array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object |
dlNetwork object (preferred)

Deep neural network used as the underlying approximator within the critic, specified as one of the
following:

• Array of Layer objects


• layerGraph object
• DAGNetwork object
• SeriesNetwork object
• dlnetwork object

Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
Deep Learning Toolbox neural network object. The resulting dlnet is the dlnetwork object that you
use for your critic or actor. This practice allows a greater level of insight and control for cases in
which the conversion is not straightforward and might require additional specifications.

The network must have the environment observation channels as inputs and a single scalar as output.

rlValueFunction objects support recurrent deep neural networks.

The learnable parameters of the critic are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.

netObsNames — Network input layers names corresponding to the environment observation


channels
string array | cell array of character vectors

Network input layers names corresponding to the environment observation channels, specified as a
string array or a cell array of character vectors. When you use this argument after
'ObservationInputNames', the function assigns, in sequential order, each environment
observation channel specified in observationInfo to each network input layer specified by the
corresponding name in the string array netObsNames. Therefore, the network input layers, ordered
as the names in netObsNames, must have the same data type and dimensions as the observation
specifications, as ordered in observationInfo.

3-397
3 Objects

Note Of the information specified in observationInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.

Example: {"NetInput1_airspeed","NetInput2_altitude"}

tab — Value table


rlTable object

Value table, specified as an rlTable object containing a column vector with length equal to the
number of possible observations from the environment. Each element is the predicted discounted
cumulative long-term reward when the agent starts from the given observation and takes the best
possible action. The elements of this vector are the learnable parameters of the representation.

basisFcn — Custom basis function


function handle

Custom basis function, specified as a function handle to a user-defined function. The user defined
function can either be an anonymous function or a function on the MATLAB path. The output of the
critic is the scalar c = W'*B, where W is a weight vector containing the learnable parameters and B
is the column vector returned by the custom basis function.

Your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the
environment observation channels defined in observationInfo.

For an example on how to use a basis function to create a value function critic with a mixed
continuous and discrete observation space, see “Create Mixed Observation Space Value Function
Critic from Custom Basis Function” on page 3-407.
Example: @(obs1,obs2,obs3) [obs3(1)*obs1(1)^2; abs(obs2(5)+obs1(2))]

W0 — Initial value of basis function weights


column vector

Initial value of the basis function weights W, specified as a column vector having the same length as
the vector returned by the basis function.

Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


containing a mix of such objects. Each element in the array defines the properties of an environment
observation channel, such as its dimensions, data type, and name. Note that only the data type and
dimension of a channel are used by the software to create actors or critics, but not its (optional) name
and description.

rlValueFucntion sets the ObservationInfo property of critic to the input argument


observationInfo.

3-398
rlValueFunction

You can extract ObservationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually.

UseDevice — Computation device used for training and simulation


"cpu" (default) | "gpu"

Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.

To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: "gpu"

Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
getValue Obtain estimated value from a critic given environment observations and
actions
evaluate Evaluate function approximator object given observation (or observation-
action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data
accelerate Option to accelerate computation of gradient for approximator object
based on neural network
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
setModel Set function approximation model for actor or critic
getModel Get function approximator model from actor or critic

Examples

Create Value Function Critic from Deep Neural Network

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a

3-399
3 Objects

continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.

obsInfo = rlNumericSpec([4 1]);

Create a deep neural network to approximate the value function within the critic, as a column vector
of layer objects. The network input layer must accept a four-element vector (the observation vector
defined by obsInfo), and the output must be a scalar (the value, representing the expected
cumulative long-term reward when the agent starts from the given observation).

You can also obtain the number of observations from the obsInfo specification (regardless of
whether the observation space is a column vector, row vector, or matrix,
prod(obsInfo.Dimension) is its total number of dimensions, in this case equal to 4).

net = [ featureInputLayer(prod(obsInfo.Dimension));
fullyConnectedLayer(10);
reluLayer;
fullyConnectedLayer(1,Name="value")];

Convert the network to a dlnetwork object.

dlnet = dlnetwork(net);

You can plot the network using plot and display its main characteristics, like the number of weights,
using summary.

plot(dlnet)

3-400
rlValueFunction

summary(dlnet)

Initialized: true

Number of learnables: 61

Inputs:
1 'input' 4 features

Create the critic using the network and the observation specification object. When you use this syntax
the network input layer is automatically associated with the environment observation according to
the dimension specifications in obsInfo.
critic = rlValueFunction(dlnet,obsInfo)

critic =
rlValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


UseDevice: "cpu"

To check your critic, use getValue to return the value of a random observation, using the current
network weights.
v = getValue(critic,{rand(obsInfo.Dimension)})

v = single
0.5196

You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent).

Create Actor and Critic

Create an actor and a critic that you can use to define a reinforcement learning agent such as an
Actor Critic (AC) agent. For this example, create actor and critic for an agent that can be trained
against the cart-pole environment described in “Train AC Agent to Balance Cart-Pole System”.

First, create the environment. Then, extract the observation and action specifications from the
environment. You need these specifications to define the agent and critic.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

A state-value-function critic, such as those used for AC or PG agents, has the current observation as
input and the state value, a scalar, as output. For this example, to approximate the value function
within the critic, create a deep neural network with one output (the value) and four inputs (the
environment observation signals x, xdot, theta, and thetadot).

Create the network as a column vector of layer objects. You can obtain the number of observations
from the obsInfo specification (regardless of whether the observation space is a column vector, row
vector, or matrix, prod(obsInfo.Dimension) is its total number of dimensions). Name the network
input layer criticNetInput.

3-401
3 Objects

criticNetwork = [
featureInputLayer(prod(obsInfo.Dimension),...
Name="criticNetInput");
fullyConnectedLayer(10);
reluLayer;
fullyConnectedLayer(1,Name="CriticFC")];

Convert the network to a dlnetwork object.

criticNetwork = dlnetwork(criticNetwork);

To display the network main characteristics, use summary.

summary(criticNetwork)

Initialized: true

Number of learnables: 61

Inputs:
1 'criticNetInput' 4 features

Create the critic using the specified neural network. Also, specify the action and observation
information for the critic. Set the observation name to observation, which is the name of the
criticNetwork input layer.

critic = rlValueFunction(criticNetwork,obsInfo,...
ObservationInputNames={'criticNetInput'})

critic =
rlValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


UseDevice: "cpu"

Check your critic using getValue to return the value of a random observation, given the current
network weights.

v = getValue(critic,{rand(obsInfo.Dimension)})

v = single
0.5196

Specify the critic optimization options using rlOptimizerOptions. These options control the
learning of the critic network parameters. For this example, set the learning rate to 0.01 and the
gradient threshold to 1.

criticOpts = rlOptimizerOptions( ...


LearnRate=1e-2,...
GradientThreshold=1);

An AC agent decides which action to take given observations using a policy which is represented by
an actor. For an actor, the inputs are the environment observations, and the output depends on
whether the action space is discrete or continuous. The actor in this example has two possible
discrete actions, –10 or 10. To create the actor, use a deep neural network that can output these two
values given the same observation input as the critic.

3-402
rlValueFunction

Create the network using a row vector of two layer objects. You can obtain the number of actions
from the actInfo specification. Name the network output actorNetOutput.

actorNetwork = [
featureInputLayer( ...
prod(obsInfo.Dimension),...
Name="actorNetInput")
fullyConnectedLayer( ...
numel(actInfo.Elements), ...
Name="actorNetOutput")];

Convert the network to a dlnetwork object.

actorNetwork = dlnetwork(actorNetwork);

To display the network main characteristics, use summary.

summary(actorNetwork)

Initialized: true

Number of learnables: 10

Inputs:
1 'actorNetInput' 4 features

Create the actor using rlDiscreteCategoricalActor together with the observation and action
specifications, and the name of the network input layer to be associated with the environment
observation channel.

actor = rlDiscreteCategoricalActor(actorNetwork,obsInfo,actInfo,...
ObservationInputNames={'actorNetInput'})

actor =
rlDiscreteCategoricalActor with properties:

Distribution: [1x1 rl.distribution.rlDiscreteGaussianDistribution]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
UseDevice: "cpu"

To check your actor, use getAction to return a random action from a given observation, using the
current network weights.

a = getAction(actor,{rand(obsInfo.Dimension)})

a = 1x1 cell array


{[-10]}

Specify the actor optimization options using rlOptimizerOptions. These options control the
learning of the critic network parameters. For this example, set the learning rate to 0.05 and the
gradient threshold to 1.

actorOpts = rlOptimizerOptions( ...


LearnRate=5e-2,...
GradientThreshold=1);

3-403
3 Objects

Create an AC agent using the actor and critic. Use the optimizer options objects previously created
for both actor and critic.

agentOpts = rlACAgentOptions(...
NumStepsToLookAhead=32,...
DiscountFactor=0.99,...
CriticOptimizerOptions=criticOpts,...
ActorOptimizerOptions=actorOpts);
agent = rlACAgent(actor,critic,agentOpts)

agent =
rlACAgent with properties:

AgentOptions: [1x1 rl.option.rlACAgentOptions]


UseExplorationPolicy: 1
ObservationInfo: [1x1 rl.util.rlNumericSpec]
ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
SampleTime: 1

To check your agent, use getAction to return a random action from a given observation, using the
current actor and critic network weights.

act = getAction(agent,{rand(obsInfo.Dimension)})

act = 1x1 cell array


{[-10]}

For additional examples showing how to create actors and critics for different agent types, see:

• “Train DDPG Agent to Control Double Integrator System”


• “Train DQN Agent to Balance Cart-Pole System”

Create Value Function Critic from Table

Create a finite set observation specification object (or alternatively use getObservationInfo to
extract the specification object from an environment with a discrete observation space). For this
example, define the observation space as a finite set consisting of four possible values: 1, 3, 4 and 7.

obsInfo = rlFiniteSetSpec([1 3 5 7]);

Create a table to approximate the value function within the critic.

vTable = rlTable(obsInfo);

The table is a column vector in which each entry stores the predicted cumulative long-term reward
for each possible observation as defined by obsInfo. You can access the table using the Table
property of the vTable object. The initial value of each element is zero.

vTable.Table

ans = 4×1

3-404
rlValueFunction

0
0
0

You can also initialize the table to any value, in this case, an array containing all the integers from 1
to 4.

vTable.Table = reshape(1:4,4,1)

vTable =
rlTable with properties:

Table: [4x1 double]

Create the critic using the table and the observation specification object.

critic = rlValueFunction(vTable,obsInfo)

critic =
rlValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlFiniteSetSpec]


UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a given observation, using the
current table entries.

v = getValue(critic,{7})

v = 4

You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent).

Create Value Function Critic from Custom Basis Function

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.

obsInfo = rlNumericSpec([4 1]);

Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations
defined by obsInfo.

myBasisFcn = @(myobs) [myobs(2)^2; myobs(3)+exp(myobs(1)); abs(myobs(4))]

myBasisFcn = function_handle with value:


@(myobs)[myobs(2)^2;myobs(3)+exp(myobs(1));abs(myobs(4))]

3-405
3 Objects

The output of the critic is the scalar W'*myBasisFcn(myobs), where W is a weight column vector
which must have the same size as the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the best
possible action. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = [3;5;2];

Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second argument is the observation specification object.

critic = rlValueFunction({myBasisFcn,W0},obsInfo)

critic =
rlValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a given observation, using the
current parameter vector.

v = getValue(critic,{[2 4 6 8]'})

v = 130.9453

You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent).

Create Value Function Critic from Recurrent Neural Network

Create an environment and obtain observation and action information.

env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);

To approximate the value function within the critic, use create a recurrent deep neural network as a
row vector of layer objects. Use a sequenceInputLayer as the input layer
(obsInfo.Dimension(1) is the dimension of the observation space) and include at least one
lstmLayer.

myNet = [
sequenceInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(8, Name="fc")
reluLayer(Name="relu")
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer(1,Name="output")];

Convert the network to a dlnetwork object.

dlCriticNet = dlnetwork(myNet);

Display a summary of network characteristics.

3-406
rlValueFunction

summary(dlCriticNet)

Initialized: true

Number of learnables: 593

Inputs:
1 'sequenceinput' Sequence input with 4 dimensions

Create a value function representation object for the critic.

critic = rlValueFunction(dlCriticNet,obsInfo)

critic =
rlValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a random observation, using
the current network weights.

v = getValue(critic,{rand(obsInfo.Dimension)})

v = single
0.0017

You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent).

Create Mixed Observation Space Value Function Critic from Custom Basis Function

Create a finite-set observation specification object (or alternatively use getObservationInfo to


extract the specification object from an environment). For this example, define the observation space
as two channels where the first one is a single observation labeled 7, 5, 3, or 1, and the second one is
a vector over a continuous three-dimensional space.

obsInfo = [rlFiniteSetSpec([7 5 3 1]) rlNumericSpec([3 1])];

Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations (in
this case a single number) defined by obsInfo.

myBasisFcn = @(obsA,obsB) [ obsA(1) + norm(obsB);


obsA(1) - norm(obsB);
obsA(1)^2 + obsB(3);
obsA(1)^2 - obsB(3)];

The output of the critic is the scalar W'*myBasisFcn(obsA,obsB), where W is a weight column
vector which must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the best
possible action. The elements of W are the learnable parameters.

Define an initial parameter vector.

3-407
3 Objects

W0 = ones(4,1);

Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second argument is the observation specification object.

critic = rlValueFunction({myBasisFcn,W0},obsInfo)

critic =
rlValueFunction with properties:

ObservationInfo: [2x1 rl.util.RLDataSpec]


UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a given observation, using the
current parameter vector.

v = getValue(critic,{5,[0.1 0.1 0.1]'})

v = 60

Note that the critic does not enforce the set constraint for the discrete set element.

v = getValue(critic,{-3,[0.1 0.1 0.1]'})

v = 12

You can now use the critic (along with an with an actor) to create an agent relying on a discrete value
function critic (such as rlACAgent or rlPGAgent).

Version History
Introduced in R2022a

See Also
Functions
rlQValueFunction | rlVectorQValueFunction | rlTable | getActionInfo |
getObservationInfo

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-408
rlValueRepresentation

rlValueRepresentation
(Not recommended) Value function critic representation for reinforcement learning agents

Note rlValueRepresentation is not recommended. Use rlValueFunction instead. For more


information, see “rlValueRepresentation is not recommended”.

Description
This object implements a value function approximator to be used as a critic within a reinforcement
learning agent. A value function is a function that maps an observation to a scalar value. The output
represents the expected total long-term reward when the agent starts from the given observation and
takes the best possible action. Value function critics therefore only need observations (but not
actions) as inputs. After you create an rlValueRepresentation critic, use it to create an agent
relying on a value function critic, such as an rlACAgent, rlPGAgent, or rlPPOAgent. For an
example of this workflow, see “Create Actor and Critic Representations” on page 3-412. For more
information on creating representations, see “Create Policies and Value Functions”.

Creation

Syntax
critic = rlValueRepresentation(net,observationInfo,'Observation',obsName)
critic = rlValueRepresentation(tab,observationInfo)
critic = rlValueRepresentation({basisFcn,W0},observationInfo)
critic = rlValueRepresentation( ___ ,options)

Description

critic = rlValueRepresentation(net,observationInfo,'Observation',obsName)
creates the value function based critic from the deep neural network net. This syntax sets the
ObservationInfo property of critic to the input observationInfo. obsName must contain the
names of the input layers of net.

critic = rlValueRepresentation(tab,observationInfo) creates the value function based


critic with a discrete observation space, from the value table tab, which is an rlTable object
containing a column array with as many elements as the possible observations. This syntax sets the
ObservationInfo property of critic to the input observationInfo.

critic = rlValueRepresentation({basisFcn,W0},observationInfo) creates the value


function based critic using a custom basis function as underlying approximator. The first input
argument is a two-elements cell in which the first element contains the handle basisFcn to a custom
basis function, and the second element contains the initial weight vector W0. This syntax sets the
ObservationInfo property of critic to the input observationInfo.

critic = rlValueRepresentation( ___ ,options) creates the value function based critic
using the additional option set options, which is an rlRepresentationOptions object. This

3-409
3 Objects

syntax sets the Options property of critic to the options input argument. You can use this syntax
with any of the previous input-argument combinations.

Input Arguments

net — Deep neural network


array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object |
dlNetwork object

Deep neural network used as the underlying approximator within the critic, specified as one of the
following:

• Array of Layer objects


• layerGraph object
• DAGNetwork object
• SeriesNetwork object
• dlnetwork object

The network input layers must be in the same order and with the same data type and dimensions as
the signals defined in ObservationInfo. Also, the names of these input layers must match the
observation names listed in obsName.

rlValueRepresentation objects support recurrent deep neural networks.

For a list of deep neural network layers, see “List of Deep Learning Layers”. For more information on
creating deep neural networks for reinforcement learning, see “Create Policies and Value Functions”.

obsName — Observation names


string | character vector | cell array of character vectors

Observation names, specified as a cell array of strings or character vectors. The observation names
must be the names of the input layers in net. These network layers must be in the same order and
with the same data type and dimensions as the signals defined in ObservationInfo.
Example: {'my_obs'}

tab — Value table


rlTable object

Value table, specified as an rlTable object containing a column vector with length equal to the
number of observations. The element i is the expected cumulative long-term reward when the agent
starts from the given observation s and takes the best possible action. The elements of this vector are
the learnable parameters of the representation.

basisFcn — Custom basis function


function handle

Custom basis function, specified as a function handle to a user-defined function. The user defined
function can either be an anonymous function or a function on the MATLAB path. The output of the
critic is c = W'*B, where W is a weight vector and B is the column vector returned by the custom
basis function. c is the expected cumulative long term reward when the agent starts from the given
observation and takes the best possible action. The learnable parameters of this representation are
the elements of W.

3-410
rlValueRepresentation

When creating a value function critic representation, your basis function must have the following
signature.
B = myBasisFunction(obs1,obs2,...,obsN)

Here obs1 to obsN are observations in the same order and with the same data type and dimensions
as the signals defined in ObservationInfo.
Example: @(obs1,obs2,obs3) [obs3(1)*obs1(1)^2; abs(obs2(5)+obs1(2))]

W0 — Initial value of the basis function weights


column vector

Initial value of the basis function weights, W, specified as a column vector having the same length as
the vector returned by the basis function.

Properties
Options — Representation options
rlRepresentationOptions object

Representation options, specified as an rlRepresentationOptions object. Available options


include the optimizer used for training and the learning rate.

ObservationInfo — Observation specifications


rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


containing a mix of such objects. These objects define properties such as the dimensions, data types,
and names of the observation signals.

rlValueRepresentation sets the ObservationInfo property of critic to the input


observationInfo.

You can extract ObservationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually.

Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
getValue Obtain estimated value from a critic given environment observations and actions

Examples

Create Value Function Critic from Deep Neural Network

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 4
doubles.
obsInfo = rlNumericSpec([4 1]);

3-411
3 Objects

Create a deep neural network to approximate the value function within the critic. The input of the
network (here called myobs) must accept a four-element vector (the observation vector defined by
obsInfo), and the output must be a scalar (the value, representing the expected cumulative long-
term reward when the agent starts from the given observation).

net = [featureInputLayer(4, 'Normalization','none','Name','myobs')


fullyConnectedLayer(1,'Name','value')];

Create the critic using the network, observation specification object, and name of the network input
layer.

critic = rlValueRepresentation(net,obsInfo,'Observation',{'myobs'})

critic =
rlValueRepresentation with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the value of a random observation, using
the current network weights.

v = getValue(critic,{rand(4,1)})

v = single
0.7904

You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent).

Create Actor and Critic Representations

Create an actor representation and a critic representation that you can use to define a reinforcement
learning agent such as an Actor Critic (AC) agent.

For this example, create actor and critic representations for an agent that can be trained against the
cart-pole environment described in “Train AC Agent to Balance Cart-Pole System”. First, create the
environment. Then, extract the observation and action specifications from the environment. You need
these specifications to define the agent and critic representations.

env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

For a state-value-function critic such as those used for AC or PG agents, the inputs are the
observations and the output should be a scalar value, the state value. For this example, create the
critic representation using a deep neural network with one output, and with observation signals
corresponding to x, xdot, theta, and thetadot as described in “Train AC Agent to Balance Cart-
Pole System”. You can obtain the number of observations from the obsInfo specification. Name the
network layer input 'observation'.

numObservation = obsInfo.Dimension(1);
criticNetwork = [

3-412
rlValueRepresentation

featureInputLayer(numObservation,'Normalization','none','Name','observation')
fullyConnectedLayer(1,'Name','CriticFC')];

Specify options for the critic representation using rlRepresentationOptions. These options
control the learning of the critic network parameters. For this example, set the learning rate to 0.05
and the gradient threshold to 1.

repOpts = rlRepresentationOptions('LearnRate',5e-2,'GradientThreshold',1);

Create the critic representation using the specified neural network and options. Also, specify the
action and observation information for the critic. Set the observation name to 'observation',
which is the of the criticNetwork input layer.

critic = rlValueRepresentation(criticNetwork,obsInfo,'Observation',{'observation'},repOpts)

critic =
rlValueRepresentation with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


Options: [1x1 rl.option.rlRepresentationOptions]

Similarly, create a network for the actor. An AC agent decides which action to take given observations
using an actor representation. For an actor, the inputs are the observations, and the output depends
on whether the action space is discrete or continuous. For the actor of this example, there are two
possible discrete actions, –10 or 10. To create the actor, use a deep neural network with the same
observation input as the critic, that can output these two values. You can obtain the number of
actions from the actInfo specification. Name the output 'action'.

numAction = numel(actInfo.Elements);
actorNetwork = [
featureInputLayer(numObservation,'Normalization','none','Name','observation')
fullyConnectedLayer(numAction,'Name','action')];

Create the actor representation using the observation name and specification and the same
representation options.

actor = rlStochasticActorRepresentation(actorNetwork,obsInfo,actInfo,...
'Observation',{'observation'},repOpts)

actor =
rlStochasticActorRepresentation with properties:

ActionInfo: [1x1 rl.util.rlFiniteSetSpec]


ObservationInfo: [1x1 rl.util.rlNumericSpec]
Options: [1x1 rl.option.rlRepresentationOptions]

Create an AC agent using the actor and critic representations.

agentOpts = rlACAgentOptions(...
'NumStepsToLookAhead',32,...
'DiscountFactor',0.99);
agent = rlACAgent(actor,critic,agentOpts)

agent =
rlACAgent with properties:

3-413
3 Objects

AgentOptions: [1x1 rl.option.rlACAgentOptions]

For additional examples showing how to create actor and critic representations for different agent
types, see:

• “Train DDPG Agent to Control Double Integrator System”


• “Train DQN Agent to Balance Cart-Pole System”

Create Value Function Critic from Table

Create a finite set observation specification object (or alternatively use getObservationInfo to
extract the specification object from an environment with a discrete observation space). For this
example, define the observation space as a finite set consisting of 4 possible values.
obsInfo = rlFiniteSetSpec([1 3 5 7]);

Create a table to approximate the value function within the critic.


vTable = rlTable(obsInfo);

The table is a column vector in which each entry stores the expected cumulative long-term reward for
each possible observation as defined by obsInfo. You can access the table using the Table property
of the vTable object. The initial value of each element is zero.
vTable.Table

ans = 4×1

0
0
0
0

You can also initialize the table to any value, in this case, an array containing all the integers from 1
to 4.
vTable.Table = reshape(1:4,4,1)

vTable =
rlTable with properties:

Table: [4x1 double]

Create the critic using the table and the observation specification object.
critic = rlValueRepresentation(vTable,obsInfo)

critic =
rlValueRepresentation with properties:

ObservationInfo: [1x1 rl.util.rlFiniteSetSpec]


Options: [1x1 rl.option.rlRepresentationOptions]

3-414
rlValueRepresentation

To check your critic, use the getValue function to return the value of a given observation, using the
current table entries.

v = getValue(critic,{7})

v = 4

You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent agent).

Create Value Function Critic from Custom Basis Function

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 4
doubles.

obsInfo = rlNumericSpec([4 1]);

Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations
defined by obsInfo.

myBasisFcn = @(myobs) [myobs(2)^2; myobs(3)+exp(myobs(1)); abs(myobs(4))]

myBasisFcn = function_handle with value:


@(myobs)[myobs(2)^2;myobs(3)+exp(myobs(1));abs(myobs(4))]

The output of the critic is the scalar W'*myBasisFcn(myobs), where W is a weight column vector
which must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the best
possible action. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = [3;5;2];

Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second argument is the observation specification object.

critic = rlValueRepresentation({myBasisFcn,W0},obsInfo)

critic =
rlValueRepresentation with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the value of a given observation, using the
current parameter vector.

v = getValue(critic,{[2 4 6 8]'})

3-415
3 Objects

v =
1x1 dlarray

130.9453

You can now use the critic (along with an with an actor) to create an agent relying on a value function
critic (such as rlACAgent or rlPGAgent).

Create Value Function Critic from Recurrent Neural Network

Create an environment and obtain observation and action information.

env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
numObs = obsInfo.Dimension(1);
numDiscreteAct = numel(actInfo.Elements);

Create a recurrent deep neural network for the critic. To create a recurrent neural network, use a
sequenceInputLayer as the input layer and include at least one lstmLayer.

criticNetwork = [
sequenceInputLayer(numObs,'Normalization','none','Name','state')
fullyConnectedLayer(8, 'Name','fc')
reluLayer('Name','relu')
lstmLayer(8,'OutputMode','sequence','Name','lstm')
fullyConnectedLayer(1,'Name','output')];

Create a value function representation object for the critic.

criticOptions = rlRepresentationOptions('LearnRate',1e-2,'GradientThreshold',1);
critic = rlValueRepresentation(criticNetwork,obsInfo,...
'Observation','state',criticOptions);

Version History
Introduced in R2020a

rlValueRepresentation is not recommended


Not recommended starting in R2022a

rlValueRepresentation is not recommended. Use rlValueFunction instead.

The following table shows some typical uses of rlValueRepresentation, and how to update your
code with rlValueFunction instead. Each table entry is related to different approximator objects,
the first one uses a neural network, the second one uses a table, the third one uses a basis function.

3-416
rlValueRepresentation

rlValueRepresentation: Not Recommended rlValueFunction: Recommended


myCritic = myCritic =
rlValueRepresentation(net,obsInfo,'Obs rlValueFunction(net,obsInfo,'Observati
ervation',obsNames), where the neural onInputNames',obsNames). Use this syntax to
network net has observations as inputs and a create a state value function object for a critic
single scalar output. that does not require action input.
rep = rep = rlValueFunction(tab,obsInfo). Use
rlValueRepresentation(tab,obsInfo) this syntax to create a value function object for a
where the table tab contains a column vector critic that does not require action input.
with as many elements as the number of possible
observations.
rep = rep =
rlValueRepresentation({basisFcn,W0},ob rlValueFunction({basisFcn,W0},obsInfo)
sInfo), where the basis function has only . Use this syntax to create a value function object
observations as inputs and W0 is a column vector. for a critic that does not require action input.

See Also
Functions
rlValueFunction | getActionInfo | getObservationInfo

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-417
3 Objects

rlVectorQValueFunction
Vector Q-value function approximator for reinforcement learning agents

Description
This object implements a vector Q-value function approximator that you can use as a critic with a
discrete action space for a reinforcement learning agent. A vector Q-value function is a function that
maps an environment state to a vector in which each element represents the predicted discounted
cumulative long-term reward when the agent starts from the given state and executes the action
corresponding to the element number. A Q-value function critic therefore needs only the environment
state as input. After you create an rlVectorQValueFunction critic, use it to create an agent such
as rlQAgent, rlDQNAgent, rlSARSAAgent, rlDDPGAgent, or rlTD3Agent. For more information
on creating representations, see “Create Policies and Value Functions”.

Creation

Syntax
critic = rlVectorQValueFunction(net,observationInfo,actionInfo)
critic = rlVectorQValueFunction(net,observationInfo,ObservationInputNames=
netObsNames)

critic = rlVectorQValueFunction({basisFcn,W0},observationInfo,actionInfo)

critic = rlVectorQValueFunction( ___ ,UseDevice=useDevice)

Description

critic = rlVectorQValueFunction(net,observationInfo,actionInfo) creates the multi-


output Q-value function critic with a discrete action space. Here, net is the deep neural network
used as an approximator, and must have only the observations as input and a single output layer
having as many elements as the number of possible discrete actions. The network input layers are
automatically associated with the environment observation channels according to the dimension
specifications in observationInfo. This function sets the ObservationInfo and ActionInfo
properties of critic to the observationInfo and actionInfo input arguments, respectively.

critic = rlVectorQValueFunction(net,observationInfo,ObservationInputNames=
netObsNames) specifies the names of the network input layers to be associated with the
environment observation channels. The function assigns, in sequential order, each environment
observation channel specified in observationInfo to the layer specified by the corresponding name
in the string array netObsNames. Therefore, the network input layers, ordered as the names in
netObsNames, must have the same data type and dimensions as the observation channels, as ordered
in observationInfo.

critic = rlVectorQValueFunction({basisFcn,W0},observationInfo,actionInfo)
creates the multi-output Q-value function critic with a discrete action space using a custom basis
function as underlying approximator. The first input argument is a two-element cell array whose first

3-418
rlVectorQValueFunction

element is the handle basisFcn to a custom basis function and whose second element is the initial
weight matrix W0. Here the basis function must have only the observations as inputs, and W0 must
have as many columns as the number of possible actions. The function sets the ObservationInfo and
ActionInfo properties of critic to the input arguments observationInfo and actionInfo,
respectively.

critic = rlVectorQValueFunction( ___ ,UseDevice=useDevice) specifies the device used


to perform computations for the critic object, and sets the UseDevice property of critic to the
useDevice input argument. You can use this syntax with any of the previous input-argument
combinations.

Input Arguments

net — Deep neural network


array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object |
dlNetwork object (preferred)

Deep neural network used as the underlying approximator within the critic,specified as one of the
following:

• Array of Layer objects


• layerGraph object
• DAGNetwork object
• SeriesNetwork object
• dlnetwork object

Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
Deep Learning Toolbox neural network object. The resulting dlnet is the dlnetwork object that you
use for your critic or actor. This practice allows a greater level of insight and control for cases in
which the conversion is not straightforward and might require additional specifications.

The network must have only the observation channels as inputs and a single output layer having as
many elements as the number of possible discrete actions. Each element of the output vector
approximates the value of executing the corresponding action starting from the currently observed
state.

rlQValueFunction objects support recurrent deep neural networks.

The learnable parameters of the critic are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.

netObsNames — Network input layers names corresponding to the environment observation


channels
string array | cell array of character vectors

Network input layers names corresponding to the environment observation channels, specified as a
string array or a cell array of character vectors. When you use this argument after

3-419
3 Objects

'ObservationInputNames', the function assigns, in sequential order, each environment


observation channel specified in observationInfo to each network input layer specified by the
corresponding name in the string array netObsNames. Therefore, the network input layers, ordered
as the names in netObsNames, must have the same data type and dimensions as the observation
specifications, as ordered in observationInfo.

Note Of the information specified in observationInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.

Example: {"NetInput1_airspeed","NetInput2_altitude"}

basisFcn — Custom basis function


function handle

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The output
of the critic is the vector c = W'*B, where W is a matrix containing the learnable parameters, and B
is the column vector returned by the custom basis function. Each element of a approximates the
value of executing the corresponding action from the observed state.

Your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the
channels defined in observationInfo.
Example: @(obs1,obs2) [act(2)*obs1(1)^2; abs(obs2(5))]

W0 — Initial value of basis function weights


matrix

Initial value of the basis function weights W, specified as a matrix having as many rows as the length
of the basis function output vector and as many columns as the number of possible actions.

Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array


containing a mix of such objects. Each element in the array defines the properties of an environment
observation channel, such as its dimensions, data type, and name. Note that only the data type and
dimension of a channel are used by the software to create actors or critics, but not its (optional)
name.

rlVectorQValueFucntion sets the ObservationInfo property of critic to the input argument


observationInfo.

You can extract ObservationInfo from an existing environment or agent using


getObservationInfo. You can also construct the specifications manually.

3-420
rlVectorQValueFunction

ActionInfo — Action specifications


rlFiniteSetSpec object

Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of
the environment action channel, such as its dimensions, data type, and name. Note that the function
does not use the name of the action channel specified in actionInfo.

Note Only one action channel is allowed.

rlVectorQValueFucntion sets the ActionInfo property of critic to the input actionInfo.

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specifications manually.

UseDevice — Computation device used for training and simulation


"cpu" (default) | "gpu"

Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.

Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.

To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: "gpu"

Object Functions
rlDQNAgent Deep Q-network (DQN) reinforcement learning agent
rlQAgent Q-learning reinforcement learning agent
rlSARSAAgent SARSA reinforcement learning agent
getValue Obtain estimated value from a critic given environment observations and
actions
getMaxQValue Obtain maximum estimated value over all possible actions from a Q-value
function critic with discrete action space, given environment observations
evaluate Evaluate function approximator object given observation (or observation-
action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data

3-421
3 Objects

accelerate Option to accelerate computation of gradient for approximator object


based on neural network
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
setModel Set function approximation model for actor or critic
getModel Get function approximator model from actor or critic

Examples

Create Multi-Output Q-Value Function Critic from Deep Neural Network

This example shows how to create a vector Q-value function critic for a discrete action space using a
deep neural network approximator.

This critic takes only the observation as input and produces as output a vector with as many elements
as the possible actions. Each element represents the expected cumulative long term reward when the
agent starts from the given observation and takes the action corresponding to the position of the
element in the output vector.

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.

obsInfo = rlNumericSpec([4 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example, define the
action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).

actInfo = rlFiniteSetSpec([7 5 3]);

To approximate the Q-value function within the critic, use a deep neural network. The input of the
network must accept a four-element vector, as defined by obsInfo. The output must be a single
output layer having as many elements as the number of possible discrete actions (three in this case,
as defined by actInfo).

Create a neural network as a row vector of layer objects.

net = [featureInputLayer(4)
fullyConnectedLayer(3)];

Convert the network to a dlnetwork object.

net = dlnetwork(net);

Summarize the network properties.

summary(net)

Initialized: true

3-422
rlVectorQValueFunction

Number of learnables: 15

Inputs:
1 'input' 4 features

Create the critic using the network, as well as the observation and action specification objects. The
network input layers are automatically associated with the components of the observation signals
according to the dimension specifications in obsInfo.

critic = rlVectorQValueFunction(net,obsInfo,actInfo)

critic =
rlVectorQValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
UseDevice: "cpu"

To check your critic, use getValue to return the values of a random observation, using the current
network weights. There is one value for each of the three possible actions.

v = getValue(critic,{rand(obsInfo.Dimension)})

v = 3x1 single column vector

0.7232
0.8177
-0.2212

You can now use the critic (along with an actor) to create a discrete action space agent relying on a
Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).

Create Multi-Output Q-Value Function Critic from Deep Neural Network Specifying Layer
Names

A vector Q-value function critic takes only the observation as input and produces as output a vector
with as many elements as the possible actions. Each element represents the expected cumulative
long term reward when the agent starts from the given observation and takes the action
corresponding to the position of the element in the output vector.

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.

obsInfo = rlNumericSpec([4 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example, define the
action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).

actInfo = rlFiniteSetSpec([7 5 3]);

3-423
3 Objects

To approximate the Q-value function within the critic, use a deep neural network. The input of the
network must accept a four-element vector, as defined by obsInfo. The output must be a single
output layer having as many elements as the number of possible discrete actions (three in this case,
as defined by actInfo).

Create the network as an array of layer objects. Name the network input netObsIn (so you can later
explicitly associate it with the observation input channel).

net = [featureInputLayer(4,Name="netObsIn")
fullyConnectedLayer(3,Name="value")];

Convert the network to a dlnetwork object and display the number of its learnable parameters.

net = dlnetwork(net)

net =
dlnetwork with properties:

Layers: [2x1 nnet.cnn.layer.Layer]


Connections: [1x2 table]
Learnables: [2x3 table]
State: [0x3 table]
InputNames: {'netObsIn'}
OutputNames: {'value'}
Initialized: 1

View summary with summary.

summary(net)

Initialized: true

Number of learnables: 15

Inputs:
1 'netObsIn' 4 features

Create the critic using the network, the observations specification object, and the name of the
network input layer. The specified network input layer, netObsIn, is associated with the environment
observation, and therefore must have the same data type and dimension as the observation channel
specified in obsInfo.

critic = rlVectorQValueFunction(net, ...


obsInfo,actInfo, ...
ObservationInputNames="netObsIn")

critic =
rlVectorQValueFunction with properties:

ObservationInfo: [1x1 rl.util.rlNumericSpec]


ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
UseDevice: "cpu"

To check your critic, use the getValue function to return the values of a random observation, using
the current network weights. The function returns one value for each of the three possible actions.

3-424
rlVectorQValueFunction

v = getValue(critic, ...
{rand(obsInfo.Dimension)})

v = 3x1 single column vector

0.7232
0.8177
-0.2212

You can now use the critic (along with an actor) to create a discrete action space agent relying on a
Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).

Create Multi-Output Q-Value Function Critic from Custom Basis Function

This critic takes only the observation as input and produces as output a vector with as many elements
as the possible actions. Each element represents the expected cumulative long term reward when the
agent starts from the given observation and takes the action corresponding to the position of the
element in the output vector.

Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as
consisting of two channels, the first a two-by-two continuous matrix and the second is scalar that can
assume only two values, 0 and 1.

obsInfo = [rlNumericSpec([2 2])


rlFiniteSetSpec([0 1])];

Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example, define the
action space as a finite set consisting of three possible vectors, [1 2], [3 4], and [5 6].

actInfo = rlFiniteSetSpec({[1 2],[3 4],[5 6]});

Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations
defined by obsInfo.

myBasisFcn = @(obsA,obsB) [obsA(1,1)+obsB(1)^2;


obsA(2,1)-obsB(1)^2;
obsA(1,2)^2+obsB(1);
obsA(2,2)^2-obsB(1)];

The output of the critic is the vector c = W'*myBasisFcn(obsA,obsB), where W is a weight matrix
which must have as many rows as the length of the basis function output and as many columns as the
number of possible actions.

Each element of c is the expected cumulative long term reward when the agent starts from the given
observation and takes the action corresponding to the position of the considered element. The
elements of W are the learnable parameters.

Define an initial parameter matrix.

W0 = rand(4,3);

3-425
3 Objects

Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial parameter matrix. The second and third arguments are, respectively, the
observation and action specification objects.

critic = rlVectorQValueFunction({myBasisFcn,W0},obsInfo,actInfo)

critic =
rlVectorQValueFunction with properties:

ObservationInfo: [2x1 rl.util.RLDataSpec]


ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
UseDevice: "cpu"

To check your critic, use the getValue function to return the values of a random observation, using
the current parameter matrix. The function returns one value for each of the three possible actions.

v = getValue(critic,{rand(2,2),0})

v = 3×1

1.3192
0.8420
1.5053

Note that the critic does not enforce the set constraint for the discrete set elements.

v = getValue(critic,{rand(2,2),-1})

v = 3×1

2.7890
1.8375
3.0855

You can now use the critic (along with an actor) to create a discrete action space agent relying on a
Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).

Version History
Introduced in R2022a

See Also
Functions
rlValueFunction | rlQValueFunction | getActionInfo | getObservationInfo

Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”

3-426
scalingLayer

scalingLayer
Scaling layer for actor or critic network

Description
A scaling layer linearly scales and biases an input array U, giving an output Y = Scale.*U + Bias.
You can incorporate this layer into the deep neural networks you define for actors or critics in
reinforcement learning agents. This layer is useful for scaling and shifting the outputs of nonlinear
layers, such as tanhLayer and sigmoid.

For instance, a tanhLayer gives bounded output that falls between –1 and 1. If your actor network
output has different bounds (as defined in the actor specification), you can include a ScalingLayer
as an output to scale and shift the actor network output appropriately.

The parameters of a ScalingLayer object are not learnable.

Creation

Syntax
sLayer = scalingLayer
sLayer = scalingLayer(Name,Value)

Description

sLayer = scalingLayer creates a scaling layer with default property values.

sLayer = scalingLayer(Name,Value) sets properties on page 3-427 using name-value pairs. For
example, scalingLayer('Scale',0.5) creates a scaling layer that scales its input by 0.5. Enclose
each property name in quotes.

Properties
Name — Name of layer
'scaling' (default) | character vector

Name of layer, specified as a character vector. To include a layer in a layer graph, you must specify a
nonempty unique layer name. If you train a series network with this layer and Name is set to '', then
the software automatically assigns a name to the layer at training time.

Description — Description of layer


'Scaling layer' (default) | character vector

This property is read-only.

Description of layer, specified as a character vector. When you create the scaling layer, you can use
this property to give it a description that helps you identify its purpose.

3-427
3 Objects

Scale — Element-wise scale on input


1 (default) | scalar | array

Element-wise scale on the input to the scaling layer, specified as one of the following:

• Scalar — Specify the same scale factor for all elements of the input array.
• Array with the same dimensions as the input array — Specify different scale factors for each
element of the input array.

The scaling layer takes an input U and generates the output Y = Scale.*U + Bias.

Bias — Element-wise bias on input


0 (default) | scalar | array

Element-wise bias on the input to the scaling layer, specified as one of the following:

• Scalar — Specify the same bias for all elements of the input array.
• Array with the same dimensions as the input array — Specify a different bias for each element of
the input array.

The scaling layer takes an input U and generates the output Y = Scale.*U + Bias.

Examples

Create Scaling Layer

Create a scaling layer that converts an input array U to the output array Y = 0.1.*U - 0.4.

sLayer = scalingLayer('Scale',0.1,'Bias',-0.4)

sLayer =
ScalingLayer with properties:

Name: 'scaling'
Scale: 0.1000
Bias: -0.4000

Learnable Parameters
No properties.

State Parameters
No properties.

Show all properties

Confirm that the scaling layer scales and offsets an input array as expected.

predict(sLayer,[10,20,30])

ans = 1×3

0.6000 1.6000 2.6000

3-428
scalingLayer

You can incorporate sLayer into an actor network or critic network for reinforcement learning.

Specify Different Scale and Bias for Each Input Element

Assume that the layer preceding the scalingLayer is a tanhLayer with three outputs aligned
along the first dimension, and that you want to apply a different scaling factor and bias to each out
using a scalingLayer.

scale = [2.5 0.4 10]';


bias = [5 0 -50]';

Create the scalingLayer object.

sLayer = scalingLayer('Scale',scale,'Bias',bias);

Confirm that the scaling layer applies the correct scale and bias values to an array with the expected
dimensions.

testData = [10 10 10]';


predict(sLayer,testData)

ans = 3×1

30
4
50

Version History
Introduced in R2019a

Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Code Generation


Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.

See Also
quadraticLayer | softplusLayer

Topics
“Train DDPG Agent to Swing Up and Balance Pendulum”
“Create Policies and Value Functions”

3-429
3 Objects

SimulinkEnvWithAgent
Reinforcement learning environment with a dynamic model implemented in Simulink

Description
The SimulinkEnvWithAgent object represents a reinforcement learning environment that uses a
dynamic model implemented in Simulink. The environment object acts as an interface such that when
you call sim or train, these functions in turn call the Simulink model to generate experiences for
the agents.

Creation
To create a SimulinkEnvWithAgent object, use one of the following functions.

• rlSimulinkEnv — Create an environment using a Simulink model with at least one RL Agent
block.
• createIntegratedEnv — Use a reference model as a reinforcement learning environment.
• rlPredefinedEnv — Create a predefined reinforcement learning environment.

Properties
Model — Simulink model name
string | character vector

Simulink model name, specified as a string or character vector. The specified model must contain one
or more RL Agent blocks.

AgentBlock — Agent block paths


string | string array

Agent block paths, specified as a string or string array.

If Model contains a single RL Agent block for training, then AgentBlock is a string containing the
block path.

If Model contains multiple RL Agent blocks for training, then AgentBlock is a string array, where
each element contains the path of one agent block.

Model can contain RL Agent blocks whose path is not included in AgentBlock. Such agent blocks
behave as part of the environment and select actions based on their current policies. When you call
sim or train, the experiences of these agents are not returned and their policies are not updated.

The agent blocks can be inside of a model reference. For more information on configuring an agent
block for reinforcement learning, see RL Agent.

ResetFcn — Reset behavior for environment


function handle | anonymous function handle

3-430
SimulinkEnvWithAgent

Reset behavior for the environment, specified as a function handle or anonymous function handle.
The function must have a single Simulink.SimulationInput input argument and a single
Simulink.SimulationInput output argument.

The reset function sets the initial state of the Simulink environment. For example, you can create a
reset function that randomizes certain block states such that each training episode begins from
different initial conditions.

If you have an existing reset function myResetFunction on the MATLAB path, set ResetFcn using a
handle to the function.
env.ResetFcn = @(in)myResetFunction(in);

If your reset behavior is simple, you can implement it using an anonymous function handle. For
example, the following code sets the variable x0 to a random value.
env.ResetFcn = @(in) setVariable(in,'x0',rand());

The sim function calls the reset function to reset the environment at the start of each simulation, and
the train function calls it at the start of each training episode.

UseFastRestart — Option to toggle fast restart


"on" (default) | "off"

Option to toggle fast restart, specified as either "on" or "off". Fast restart allows you to perform
iterative simulations without compiling a model or terminating the simulation each time.

For more information on fast restart, see “How Fast Restart Improves Iterative Simulations”
(Simulink).

Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified environment
getObservationInfo Obtain observation data specifications from reinforcement learning
environment, agent, or experience buffer
getActionInfo Obtain action data specifications from reinforcement learning environment,
agent, or experience buffer

Examples

Create Simulink Environment Using Agent in Workspace

Create a Simulink environment using the trained agent and corresponding Simulink model from the
“Create Simulink Environment and Train Agent” example.

Load the agent in the MATLAB® workspace.


load rlWaterTankDDPGAgent

Create an environment for the rlwatertank model, which contains an RL Agent block. Since the
agent used by the block is already in the workspace, you do not need to pass the observation and
action specifications to create the environment.
env = rlSimulinkEnv('rlwatertank','rlwatertank/RL Agent')

3-431
3 Objects

env =
SimulinkEnvWithAgent with properties:

Model : rlwatertank
AgentBlock : rlwatertank/RL Agent
ResetFcn : []
UseFastRestart : on

Validate the environment by performing a short simulation for two sample times.

validateEnvironment(env)

You can now train and simulate the agent within the environment by using train and sim,
respectively.

Create Reinforcement Learning Environment for Simulink Model

For this example, consider the rlSimplePendulumModel Simulink model. The model is a simple
frictionless pendulum that initially hangs in a downward position.

Open the model.

mdl = 'rlSimplePendulumModel';
open_system(mdl)

Create rlNumericSpec and rlFiniteSetSpec objects for the observation and action information,
respectively.

The observation is a vector containing three signals: the sine, cosine, and time derivative of the
angle.

obsInfo = rlNumericSpec([3 1])

obsInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: [0×0 string]
Description: [0×0 string]
Dimension: [3 1]
DataType: "double"

The action is a scalar expressing the torque and can be one of three possible values, -2 Nm, 0 Nm and
2 Nm.

actInfo = rlFiniteSetSpec([-2 0 2])

actInfo =
rlFiniteSetSpec with properties:

Elements: [3×1 double]


Name: [0×0 string]

3-432
SimulinkEnvWithAgent

Description: [0×0 string]


Dimension: [1 1]
DataType: "double"

You can use dot notation to assign property values for the rlNumericSpec and rlFiniteSetSpec
objects.
obsInfo.Name = 'observations';
actInfo.Name = 'torque';

Assign the agent block path information, and create the reinforcement learning environment for the
Simulink model using the information extracted in the previous steps.
agentBlk = [mdl '/RL Agent'];
env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo)

env =
SimulinkEnvWithAgent with properties:

Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on

You can also include a reset function using dot notation. For this example, randomly initialize theta0
in the model workspace.
env.ResetFcn = @(in) setVariable(in,'theta0',randn,'Workspace',mdl)

env =
SimulinkEnvWithAgent with properties:

Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : @(in)setVariable(in,'theta0',randn,'Workspace',mdl)
UseFastRestart : on

Create Simulink Environment for Multiple Agents

Create an environment for the Simulink model from the example “Train Multiple Agents to Perform
Collaborative Task”.

Load the agents in the MATLAB workspace.


load rlCollaborativeTaskAgents

Create an environment for the rlCollaborativeTask model, which has two agent blocks. Since the
agents used by the two blocks (agentA and agentB) are already in the workspace, you do not need
to pass their observation and action specifications to create the environment.
env = rlSimulinkEnv( ...
'rlCollaborativeTask', ...
["rlCollaborativeTask/Agent A","rlCollaborativeTask/Agent B"])

3-433
3 Objects

env =
SimulinkEnvWithAgent with properties:

Model : rlCollaborativeTask
AgentBlock : [
rlCollaborativeTask/Agent A
rlCollaborativeTask/Agent B
]
ResetFcn : []
UseFastRestart : on

You can now simulate or train the agents within the environment using sim or train, respectively.

Create Continuous Simple Pendulum Model Environment

Use the predefined 'SimplePendulumModel-Continuous' keyword to create a continuous simple


pendulum model reinforcement learning environment.
env = rlPredefinedEnv('SimplePendulumModel-Continuous')

env =
SimulinkEnvWithAgent with properties:

Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on

Create Environment from Simulink Model

This example shows how to use createIntegratedEnv to create an environment object starting
from a Simulink model that implements the system with which the agent. Such a system is often
referred to as plant, open-loop system, or reference system, while the whole (integrated) system
including the agent is often referred to as the closed-loop system.

For this example, use the flying robot model described in “Train DDPG Agent to Control Flying
Robot” as the reference (open-loop) system.

Open the flying robot model.


open_system('rlFlyingRobotEnv')

Initialize the state variables and sample time.


% initial model state variables
theta0 = 0;
x0 = -15;
y0 = 0;

% sample time
Ts = 0.4;

3-434
SimulinkEnvWithAgent

Create the Simulink model myIntegratedEnv containing the flying robot model connected in a
closed loop to the agent block. The function also returns the reinforcement learning environment
object env to be used for training.

env = createIntegratedEnv('rlFlyingRobotEnv','myIntegratedEnv')

env =
SimulinkEnvWithAgent with properties:

Model : myIntegratedEnv
AgentBlock : myIntegratedEnv/RL Agent
ResetFcn : []
UseFastRestart : on

The function can also return the block path to the RL Agent block in the new integrated model, as
well as the observation and action specifications for the reference model.

[~,agentBlk,observationInfo,actionInfo] = createIntegratedEnv( ...


'rlFlyingRobotEnv','myIntegratedEnv')

agentBlk =
'myIntegratedEnv/RL Agent'

observationInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "observation"
Description: [0x0 string]
Dimension: [7 1]
DataType: "double"

actionInfo =
rlNumericSpec with properties:

LowerLimit: -Inf
UpperLimit: Inf
Name: "action"
Description: [0x0 string]
Dimension: [2 1]
DataType: "double"

Returning the block path and specifications is useful in cases in which you need to modify
descriptions, limits, or names in observationInfo and actionInfo. After modifying the
specifications, you can then create an environment from the integrated model IntegratedEnv using
the rlSimulinkEnv function.

Version History
Introduced in R2019a

3-435
3 Objects

See Also
Functions
rlSimulinkEnv | rlPredefinedEnv | train | sim | rlNumericSpec | rlFiniteSetSpec

Blocks
RL Agent

Topics
“Create Simulink Reinforcement Learning Environments”

3-436
softplusLayer

softplusLayer
Softplus layer for actor or critic network

Description
A softplus layer applies the softplus activation function Y = log(1 + eX), which ensures that the output
is always positive. This activation function is a smooth continuous version of reluLayer. You can
incorporate this layer into the deep neural networks you define for actors in reinforcement learning
agents. This layer is useful for creating continuous Gaussian policy deep neural networks, for which
the standard deviation output must be positive.

Creation

Syntax
sLayer = softplusLayer
sLayer = softplusLayer(Name,Value)

Description

sLayer = softplusLayer creates a softplus layer with default property values.

sLayer = softplusLayer(Name,Value) sets properties on page 3-437 using name-value pairs.


For example, softplusLayer('Name','softlayer') creates a softplus layer and assigns the
name 'softlayer'.

Properties
Name — Name of layer
'softplus' (default) | character vector

Name of layer, specified as a character vector. To include a layer in a layer graph, you must specify a
nonempty unique layer name. If you train a series network with this layer and Name is set to '', then
the software automatically assigns a name to the layer at training time.

Description — Description of layer


'Softplus layer' (default) | character vector

This property is read-only.

Description of layer, specified as a character vector. When you create the softplus layer, you can use
this property to give it a description that helps you identify its purpose.

Examples

3-437
3 Objects

Create Softplus Layer

Create s softplus layer.

sLayer = softplusLayer;

You can specify the name of the softplus layer. For example, if the softplus layer represents the
standard deviation of a Gaussian policy deep neural network, you can specify an appropriate name.

sLayer = softplusLayer('Name','stddev')

sLayer =
SoftplusLayer with properties:

Name: 'stddev'

Learnable Parameters
No properties.

State Parameters
No properties.

Show all properties

You can incorporate sLayer into an actor network for reinforcement learning.

Version History
Introduced in R2020a

Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

GPU Code Generation


Generate CUDA® code for NVIDIA® GPUs using GPU Coder™.

See Also
quadraticLayer | scalingLayer

Topics
“Create Policies and Value Functions”

3-438
4

Blocks
4 Blocks

RL Agent
Reinforcement learning agent
Library: Reinforcement Learning Toolbox

Description
Use the RL Agent block to simulate and train a reinforcement learning agent in Simulink. You
associate the block with an agent stored in the MATLAB workspace or a data dictionary, such as an
rlACAgent or rlDDPGAgent object. You connect the block so that it receives an observation and a
computed reward. For instance, consider the following block diagram of the
rlSimplePendulumModel model.

The observation input port of the RL Agent block receives a signal that is derived from the
instantaneous angle and angular velocity of the pendulum. The reward port receives a reward
calculated from the same two values and the applied action. You configure the observations and
reward computations that are appropriate to your system.

The block uses the agent to generate an action based on the observation and reward you provide.
Connect the action output port to the appropriate input for your system. For instance, in the

4-2
RL Agent

rlSimplePendulumModel, the action output port is a torque applied to the pendulum system. For
more information about this model, see “Train DQN Agent to Swing Up and Balance Pendulum”.

To train a reinforcement learning agent in Simulink, you generate an environment from the Simulink
model. You then create and configure the agent for training against that environment. For more
information, see “Create Simulink Reinforcement Learning Environments”. When you call train
using the environment, train simulates the model and updates the agent associated with the block.

Ports
Input

observation — Environment observations


scalar | vector | nonvirtual bus

This port receives observation signals from the environment. Observation signals represent
measurements or other instantaneous system data. If you have multiple observations, you can use a
Mux block to combine them into a vector signal. To use a nonvirtual bus signal, use bus2RLSpec.

reward — Reward from environment


scalar

This port receives the reward signal, which you compute based on the observation data. The reward
signal is used during agent training to maximize the expectation of the long-term reward.

isdone — Flag to terminate episode simulation


logical

Use this signal to specify conditions under which to terminate a training episode. You must configure
logic appropriate to your system to determine the conditions for episode termination. One application
is to terminate an episode that is clearly going well or going poorly. For instance, you can terminate
an episode if the agent reaches its goal or goes irrecoverably far from its goal.

external action — External action signal


scalar | vector

Use this signal to provide an external action to the block. This signal can be a control action from a
human expert, which can be used for safe or imitation learning applications. When the value of the
use external action signal is 1, the passes the external action signal to the environment through
the action block output. The block also uses the external action to update the agent policy based on
the resulting observations and rewards.
Dependencies

To enable this port, select the External action inputs parameter.

last action — Last action applied to environment signal


scalar | vector

For some applications, the action applied to the environment can differ from the action output by the
RL Agent block. For example, the Simulink model can contain a saturation block on the action output
signal.

In such cases, to improve learning results, you can enable this input port and connect the actual
action signal that is applied to the environment.

4-3
4 Blocks

Note The last action port should be used only with off-policy agents, otherwise training can produce
unexpected results.

Dependencies

To enable this port, select the Last action input parameter.

use external action — Use external action signal


0|1

Use this signal to pass the external action signal to the environment.

When the value of the use external action signal is 1 the block passes the external action signal to
the environment. The block also uses the external action to update the agent policy.

When the value of the use external action signal is 0 the block does not pass the external action
signal to the environment and does not update the policy using the external action. Instead, the
action from the block uses the action from the agent policy.
Dependencies

To enable this port, select the External action inputs parameter.

Output

action — Agent action


scalar | vector | nonvirtual bus

Action computed by the agent based on the observation and reward inputs. Connect this port to the
inputs of your system. To use a nonvirtual bus signal, use bus2RLSpec.

Note Continuous action-space agents such as rlACAgent, rlPGAgent, or rlPPOAgent (the ones
using an rlContinuousGaussianActor object), do not enforce constraints set by the action
specification. In these cases, you must enforce action space constraints within the environment.

cumulative reward — Total reward


scalar | vector

Cumulative sum of the reward signal during simulation. Observe or log this signal to track how the
cumulative reward evolves over time.
Dependencies

To enable this port, select the Cumulative reward output parameter.

Parameters
Agent object — Agent to train
agentObj (default) | agent object

Enter the name of an agent object stored in the MATLAB workspace or a data dictionary, such as an
rlACAgent or rlDDPGAgent object. For information about agent objects, see “Reinforcement
Learning Agents”.

4-4
RL Agent

If the RL Agent block is within a conditionally executed subsystem, such as a Triggered Subsystem or
a Function-Call Subsystem, you must specify the sample time of the agent object as -1 so that the
block can inherit the sample time of its parent subsystem.
Programmatic Use
Block Parameter: Agent
Type: string, character vector
Default: "agentObj"

Generate greedy policy block — Generate greedy policy block controller


button

Generate a Policy block that implements a greedy policy for the agent specified in Agent object by
calling the generatePolicyBlock block function. To generate a greedy policy, the block sets the
UseExplorationPolicy property of the agent to false before generating the policy block..

The generated block is added to a new Simulink model and the policy data is saved in a MAT-file in
the current working folder.

External action inputs — Add input ports for external action


off (default) | on

Enable the external action and use external action block input ports by selecting this parameter.
Programmatic Use
Block Parameter: ExternalActionAsInput
Type: string, character vector
Values: "off" | "on"
Default: "off"

Last action input — Add input ports for last action applied to environment
off (default) | on

Enable the last action block input port by selecting this parameter.
Programmatic Use
Block Parameter: ProvideLastAction
Type: string, character vector
Values: "off" | "on"
Default: "off"

Cumulative reward output — Add cumulative reward output port


off (default) | on

Enable the cumulative reward block output by selecting this parameter.


Programmatic Use
Block Parameter: ProvideCumRwd
Type: string, character vector
Values: "off" | "on"
Default: "off"

Use strict observation data types — Enforce strict data types for observations
off (default) | on

Select this parameter to enforce the observation data types. In this case, if the data type of the signal
connected to the observation input port does not match the data type in the ObservationInfo

4-5
4 Blocks

property of the agent, the block attempts to cast the signal to the correct data type. If casting the
data type is not possible, the block generates an error.

Enforcing strict data types:

• Lets you validate that the block is getting the correct data types.
• Allows other blocks to inherit their data type from the observation port.

Programmatic Use
Block Parameter: UseStrictObservationDataTypes
Type: string, character vector
Values: "off" | "on"
Default: "off"

Version History
Introduced in R2019a

See Also
Functions
bus2RLSpec | createIntegratedEnv

Blocks
Policy

Topics
“Create Simulink Reinforcement Learning Environments”
“Create Simulink Environment and Train Agent”

4-6
Policy

Policy
Reinforcement learning policy
Library: Reinforcement Learning Toolbox

Description
Use the Policy block to simulate a reinforcement learning policy in Simulink and to generate code
(using Simulink Coder) for deployment purposes. This block takes an observation as input and
outputs an action. You associate the block with a MAT-file that contains the information needed to
fully characterize the policy, and which can be generated by generatePolicyFunction or
generatePolicyBlock.

Ports
Input

observation — Environment observations


scalar | vector | nonvirtual bus

This port receives observation signals from the environment. Observation signals represent
measurements or other instantaneous system data. If you have multiple observations, you can use a
Mux block to combine them into a vector signal. To use a nonvirtual bus signal, use bus2RLSpec.

Output

action — Policy action


scalar | vector | nonvirtual bus

Action computed by the policy based on the observation input. Connect this port to the inputs of your
system. To use a nonvirtual bus signal, use bus2RLSpec.

Note Policy blocks generated from a continuous action-space rlStochasticActorPolicy object or


a continuous action-space rlACAgent, rlPGAgent, or rlPPOAgent object, do not enforce the
constraints set by the action specification. In these cases, you must enforce action space constraints
within the environment.

Parameters
Policy block data MAT file — Policy block data MAT file
blockAgentData.mat (default) | file name

Enter the name of the MAT-file containing the information needed to fully characterize the policy. This
file is generated by generatePolicyFunction or generatePolicyBlock. When you generate the

4-7
4 Blocks

block using generatePolicyBlock and specify a non-default dataFileName argument, then the
generated block has this parameter set to the specified file name, so that the block is associated with
the generated data file.

To use a Policy block within a conditionally executed subsystem, such as a Triggered Subsystem or a
Function-Call Subsystem, you must generate its data file from an agent or policy object which has its
SampleTIme property set to -1. Doing so allows the block to inherit the sample time of its parent
subsystem.
Programmatic Use
Block Parameter: MATFile
Type: string, character vector
Default: "blockAgentData.mat"

Version History
Introduced in R2022b

Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using Simulink® Coder™.

See Also
Functions
bus2RLSpec | createIntegratedEnv | generatePolicyFunction | generatePolicyBlock

Blocks
RL Agent

Topics
“Create Policies and Value Functions”
“Create Simulink Reinforcement Learning Environments”
“Create Simulink Environment and Train Agent”

4-8

You might also like