RL Ref
RL Ref
RL Ref
Reference
R2022b
How to Contact MathWorks
Phone: 508-647-7000
Apps
1
Functions
2
Objects
3
Blocks
4
iii
1
Apps
1 Apps
Description
The Reinforcement Learning Designer app lets you design, train, and simulate agents for existing
environments.
Limitations
The following features are not supported in the Reinforcement Learning Designer app.
• Multi-agent systems
• Q, SARSA, PG, AC, and SAC agents
• Custom agents
• Agents relying on table or custom basis function representations
If your application requires any of these features then design, train, and simulate your agent at the
command line.
1-2
Reinforcement Learning Designer
Examples
• “Create MATLAB Environments for Reinforcement Learning Designer”
• “Create Simulink Environments for Reinforcement Learning Designer”
• “Create Agents Using Reinforcement Learning Designer”
• “Design and Train Agent Using Reinforcement Learning Designer”
Programmatic Use
reinforcementLearningDesigner opens the Reinforcement Learning Designer app. You can
then import an environment and start the design process, or open a saved design session.
Version History
Introduced in R2021a
1-3
1 Apps
See Also
Apps
Deep Network Designer | Simulation Data Inspector
Functions
rlDQNAgent | rlDDPGAgent | rlTD3Agent | rlPPOAgent | analyzeNetwork
Topics
“Create MATLAB Environments for Reinforcement Learning Designer”
“Create Simulink Environments for Reinforcement Learning Designer”
“Create Agents Using Reinforcement Learning Designer”
“Design and Train Agent Using Reinforcement Learning Designer”
“What Is Reinforcement Learning?”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
1-4
2
Functions
2 Functions
accelerate
Package: rl.function
Option to accelerate computation of gradient for approximator object based on neural network
Syntax
newAppx = accelerate(oldAppx,useAcceleration)
Description
newAppx = accelerate(oldAppx,useAcceleration) returns the new neural-network-based
function approximator object newAppx, which has the same configuration as the original object,
oldAppx, and the option to accelerate the gradient computation set to the logical value
useAcceleration.
Examples
Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, define
an observation space with two channels. The first channel carries an observation from a continuous
four-dimensional space. The second carries a discrete scalar observation that can be either zero or
one. Finally, the action space is a three-dimensional vector in a continuous action space.
To approximate the Q-value function within the critic, create a recurrent deep neural network. The
output layer must be a scalar expressing the value of executing the action given the observation.
Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects, and specify a name for the input layers, so
you can later explicitly associate them with the appropriate environment channel. Since the network
is recurrent, use sequenceInputLayer as the input layer and include an lstmLayer as one of the
other network layers.
% Define paths
inPath1 = [ sequenceInputLayer( ...
prod(obsInfo(1).Dimension), ...
Name="netObsIn1")
fullyConnectedLayer(5,Name="infc1") ];
2-2
accelerate
% Connect layers
net = connectLayers(net,"infc1","cct/in1");
net = connectLayers(net,"infc2","cct/in2");
net = connectLayers(net,"infc3","cct/in3");
% Plot network
plot(net)
Initialized: true
2-3
2 Functions
Inputs:
1 'netObsIn1' Sequence input with 4 dimensions
2 'netObsIn2' Sequence input with 1 dimensions
3 'netActIn' Sequence input with 3 dimensions
Create the critic with rlQValueFunction, using the network, and the observation and action
specification objects.
critic = rlQValueFunction(net, ...
obsInfo, ...
actInfo, ...
ObservationInputNames=["netObsIn1","netObsIn2"], ...
ActionInputNames="netActIn");
To return the value of the actions as a function of the current observation, use getValue or
evaluate.
val = evaluate(critic, ...
{ rand(obsInfo(1).Dimension), ...
rand(obsInfo(2).Dimension), ...
rand(actInfo(1).Dimension) })
When you use evaluate, the result is a single-element cell array containing the value of the action in
the input, given the observation.
val{1}
ans = single
0.1360
Calculate the gradients of the sum of the three outputs with respect to the inputs, given a random
observation.
gro = gradient(critic,"output-input", ...
{ rand(obsInfo(1).Dimension) , ...
rand(obsInfo(2).Dimension) , ...
rand(actInfo(1).Dimension) } )
The result is a cell array with as many elements as the number of input channels. Each element
contains the derivatives of the sum of the outputs with respect to each component of the input
channel. Display the gradient with respect to the element of the second channel.
gro{2}
ans = single
0.0243
Obtain the gradient with respect of five independent sequences, each one made of nine sequential
observations.
2-4
accelerate
Display the derivative of the sum of the outputs with respect to the third observation element of the
first input channel, after the seventh sequential observation in the fourth independent batch.
gro_batch{1}(3,4,7)
ans = single
0.0108
critic = accelerate(critic,true);
Calculate the gradients of the sum of the outputs with respect to the parameters, given a random
observation.
Each array within a cell contains the gradient of the sum of the outputs with respect to a group of
parameters.
2-5
2 Functions
{ 5x1 single }
{32x15 single }
{32x8 single }
{32x1 single }
{[2.6325 10.1821 -14.0886 0.4162 2.0677 5.3991 0.3904 -8.9048]}
{[ 45]}
If you use a batch of inputs, gradient uses the whole input sequence (in this case nine steps), and
all the gradients with respect to the independent batch dimensions (in this case five) are added
together. Therefore, the returned gradient always has the same size as the output from
getLearnableParameters.
Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, define
an observation space with two channels. The first channel carries an observation from a continuous
four-dimensional space. The second carries a discrete scalar observation that can be either zero or
one. Finally, the action space consist of a scalar that can be -1, 0, or 1.
Create a deep neural network to be used as approximation model within the actor. The output layer
must have three elements, each one expressing the value of executing the corresponding action,
given the observation. To create a recurrent neural network, use sequenceInputLayer as the input
layer and include an lstmLayer as one of the other network layers.
% Define paths
inPath1 = [ sequenceInputLayer(prod(obsInfo(1).Dimension))
fullyConnectedLayer(prod(actInfo.Dimension),Name="fc1") ];
inPath2 = [ sequenceInputLayer(prod(obsInfo(2).Dimension))
fullyConnectedLayer(prod(actInfo.Dimension),Name="fc2") ];
% Connect layers
net = connectLayers(net,"fc1","cct/in1");
2-6
accelerate
net = connectLayers(net,"fc2","cct/in2");
% Plot network
plot(net)
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
2 'sequenceinput_1' Sequence input with 1 dimensions
Since each element of the output layer must represent the probability of executing one of the possible
actions the software automatically adds a softmaxLayer as a final output layer if you do not specify
it explicitly.
Create the actor with rlDiscreteCategoricalActor, using the network and the observations and
action specification objects. When the network has multiple input layers, they are automatically
associated with the environment observation channels according to the dimension specifications in
obsInfo.
2-7
2 Functions
0.3403
0.3114
0.3483
ans = 1
Each array within a cell contains the gradient of the sum of the outputs with respect to a group of
parameters.
grp_batch = gradient(actor,"output-parameters", ...
{ rand([obsInfo(1).Dimension 5 9]) , ...
rand([obsInfo(2).Dimension 5 9])} )
If you use a batch of inputs, the gradient uses the whole input sequence (in this case nine steps),
and all the gradients with respect to the independent batch dimensions (in this case five) are added
together. Therefore, the returned gradient always has the same size as the output from
getLearnableParameters.
Input Arguments
oldAppx — Function approximator object
function approximator object
2-8
accelerate
• rlValueFunction,
• rlQValueFunction,
• rlVectorQValueFunction,
• rlDiscreteCategoricalActor,
• rlContinuousDeterministicActor,
• rlContinuousGaussianActor,
• rlContinuousDeterministicTransitionFunction,
• rlContinuousGaussianTransitionFunction,
• rlContinuousDeterministicRewardFunction,
• rlContinuousGaussianRewardFunction,
• rlIsDoneFunction.
Option to use acceleration for gradient computations, specified as a logical value. When
useAcceleration is true, the gradient computations are accelerated by optimizing and caching
some inputs needed by the automatic-differentiation computation graph. For more information, see
“Deep Learning Function Acceleration for Custom Training Loops”.
Output Arguments
newAppx — Actor or critic
approximator object
New actor or critic, returned as an approximator object with the same type as oldAppx but with the
gradient acceleration option set to useAcceleration.
Version History
Introduced in R2022a
See Also
evaluate | gradient | getLearnableParameters | rlValueFunction | rlQValueFunction |
rlVectorQValueFunction | rlContinuousDeterministicActor |
rlDiscreteCategoricalActor | rlContinuousGaussianActor |
rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction
Topics
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”
2-9
2 Functions
allExperiences
Package: rl.replay
Syntax
experiences = allExperiences(buffer)
experience = allExperiences(buffer,ConcatenateMode=mode)
Description
experiences = allExperiences(buffer) returns all experiences stored in experience buffer
buffer as individual experiences, each with a batch size of 1 and a sequence length of 1.
Examples
Define observation specifications for the environment. For this example, assume that the environment
has two observation channels: one channel with two continuous observations and one channel with a
three-valued discrete observation
obsContinuous = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[1;5]);
obsDiscrete = rlFiniteSetSpec([1 2 3]);
obsInfo = [obsContinuous obsDiscrete];
Define action specifications for the environment. For this example, assume that the environment has
a single action channel with one continuous action in a specified range.
actInfo = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[5;10]);
2-10
allExperiences
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
experience(i).Reward = 10*rand(1);
experience(i).IsDone = 0;
end
append(buffer,experience);
After appending experiences to the buffer, you extract all of the experiences from the buffer. Extract
all of the experiences as individual experiences, each with a batch size of 1 and sequence size of 1.
experience = allExperiences(buffer)
Alternatively, you can extract all of the experiences as a single experience batch.
expBatch = allExperiences(buffer,ConcatenateMode="batch")
Input Arguments
buffer — Experience buffer
rlReplayMemory object | rlPrioritizedReplayMemory object
• "none" — Return experience as N individual experiences, each with a batch size of 1 and a
sequence length of 1.
• "batch" — Return experience as a single batch with a sequence length of 1.
• "sequence" — Return experience as a single sequence with a batch size of 1.
Output Arguments
experience — All buffered experiences
structure array | structure
All N buffered experiences, returned as a structure array or structure. When mode is:
2-11
2 Functions
• "none", experience is returned as a structure array of length N, where each element contains
one buffered experience (batchSize = 1 and SequenceLength = 1).
• "batch", experience is returned as a structure. Each field of experience contains all buffered
experiences concatenated along the batch dimension (batchSize = N and SequenceLength =
1).
• "sequence", experience is returned as a structure. Each field of experience contains all
buffered experiences concatenated along the batch dimension (batchSize = 1 and
SequenceLength = N).
Observation — Observation
cell array
Observation, returned as a cell array with length equal to the number of observation specifications
specified when creating the buffer. Each element of Observation contains a DO-by-batchSize-by-
SequenceLength array, where DO is the dimension of the corresponding observation specification.
Agent action, returned as a cell array with length equal to the number of action specifications
specified when creating the buffer. Each element of Action contains a DA-by-batchSize-by-
SequenceLength array, where DA is the dimension of the corresponding action specification.
Reward value obtained by taking the specified action from the observation, returned as a 1-by-1-by-
SequenceLength array.
Next observation reached by taking the specified action from the observation, returned as a cell array
with the same format as Observation.
Version History
Introduced in R2022b
2-12
allExperiences
See Also
rlReplayMemory | rlPrioritizedReplayMemory
2-13
2 Functions
append
Package: rl.replay
Syntax
append(buffer,experience)
append(buffer,experience,dataSourceID)
Description
append(buffer,experience) appends the experiences in experience to the replay memory
buffer.
Examples
Define observation specifications for the environment. For this example, assume that the environment
has a single observation channel with three continuous signals in specified ranges.
obsInfo = rlNumericSpec([3 1],...
LowerLimit=0,...
UpperLimit=[1;5;10]);
Define action specifications for the environment. For this example, assume that the environment has
a single action channel with two continuous signals in specified ranges.
actInfo = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[5;10]);
Append a single experience to the buffer using a structure. Each experience contains the following
elements: current observation, action, next observation, reward, and is-done.
For this example, create an experience with random observation, action, and reward values. Indicate
that this experience is not a terminal condition by setting the IsDone value to 0.
exp.Observation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Action = {actInfo.UpperLimit.*rand(2,1)};
exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Reward = 10*rand(1);
exp.IsDone = 0;
2-14
append
append(buffer,exp);
You can also append a batch of experiences to the experience buffer using a structure array. For this
example, append a sequence of 100 random experiences, with the final experience representing a
terminal condition.
for i = 1:100
expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)};
expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Reward = 10*rand(1);
expBatch(i).IsDone = 0;
end
expBatch(100).IsDone = 1;
append(buffer,expBatch);
After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.
miniBatch = sample(buffer,50);
You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive
experiences with a discount factor of 0.95.
horizonSample = sample(buffer,1,...
NStepHorizon=10,...
DiscountFactor=0.95);
• Observation and Action are the observation and action from the first experience in the
horizon.
• NextObservation and IsDone are the next observation and termination signal from the final
experience in the horizon.
• Reward is the cumulative reward across the horizon using the specified discount factor.
You can also sample a sequence of consecutive experiences. In this case, the structure fields contain
arrays with values for all sampled experiences.
sequenceSample = sample(buffer,1,...
SequenceLength=20);
Define observation specifications for the environment. For this example, assume that the environment
has two observation channels: one channel with two continuous observations and one channel with a
three-valued discrete observation.
2-15
2 Functions
Define action specifications for the environment. For this example, assume that the environment has
a single action channel with one continuous action in a specified range.
buffer = rlReplayMemory(obsInfo,actInfo,5000);
for i = 1:50
exp(i).Observation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
exp(i).Action = {actInfo.UpperLimit.*rand(2,1)};
exp(i).NextObservation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
exp(i).Reward = 10*rand(1);
exp(i).IsDone = 0;
end
append(buffer,exp);
After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 10 experiences from the buffer.
miniBatch = sample(buffer,10);
Input Arguments
buffer — Experience buffer
rlReplayMemory object | rlPrioritizedReplayMemeory
Experience to append to the buffer, specified as a structure or structure array with the following
fields.
Observation — Observation
cell array
Observation, specified as a cell array with length equal to the number of observation specifications
specified when creating the buffer. The dimensions of each element in Observation must match the
dimensions in the corresponding observation specification.
2-16
append
Action taken by the agent, specified as a cell array with length equal to the number of action
specifications specified when creating the buffer. The dimensions of each element in Action must
match the dimensions in the corresponding action specification.
Reward value obtained by taking the specified action from the starting observation, specified as a
scalar.
Next observation reached by taking the specified action from the starting observation, specified as a
cell array with the same format as Observation.
If experience is a structure array, specify dataSourceID as an array with length equal to the
length of experience. You can specify different data source indices for each element of
experience. If all elements in experience come from the same data source, you can specify
dataSourceID as a scalar integer.
Version History
Introduced in R2022a
See Also
rlReplayMemory | sample
2-17
2 Functions
barrierPenalty
Logarithmic barrier penalty value for a point with respect to a bounded region
Syntax
p = barrierPenalty(x,xmin,xmax)
p = barrierPenalty( ___ ,maxValue,curvature)
Description
p = barrierPenalty(x,xmin,xmax) calculates the nonnegative (logarithmic barrier) penalty
vector p for the point x with respect to the region bounded by xmin and xmax. p has the same
dimension as x. This syntax uses the default values of 1 and 0.1 for the maxValue and curvature
parameters of the barrier function, respectively.
Examples
This example shows how to use the logarithmic barrierPenalty function to calculate the barrier
penalty for a given point, with respect to a bounded region.
Calculate the penalty value for the point 0.1 within the interval [-2,2] using default values for the
maximum value and curvature parameters.
barrierPenalty(0.1,-2,2)
ans = 2.5031e-04
Calculate the penalty value for the point 4 outside the interval [-2,2].
barrierPenalty(4,-2,2)
ans = 1
Calculate the penalty value for the point 4 outside the interval [-2,2], using a maximum value
parameter of 5.
barrierPenalty(4,-2,2,5)
ans = 5
Calculate the penalty value for the point 0.1 inside the interval [-2,2], using a curvature parameter of
0.5.
barrierPenalty(0.1,-2,2,5,0.5)
2-18
barrierPenalty
ans = 0.0013
Calculate the penalty value for the point [-2,0,4] with respect to the box defined by [0,1], [-1,1], and
[-2,2] along the x, y, and z dimensions, respectively, using the default value for maximum value and a
curvature parameter of 0.
ans = 3×1
1
0
1
x = -5:0.01:5;
Calculate penalties for all the points in the vector, using the default value for the maximum value
parameter and a value of 0.01 for the curvature parameter.
p = barrierPenalty(x,-2,2,1,0.01);
plot(x,p)
grid
xlabel("point position");
ylabel("penalty value");
title("Penalty values over an interval");
2-19
2 Functions
Input Arguments
x — Point for which the penalty is calculated
scalar | vector | matrix
Point for which the penalty is calculated, specified as a numeric scalar, vector or matrix.
Example: [0.5; 1.6]
Lower bounds for x, specified as a numeric scalar, vector or matrix. To use the same minimum value
for all elements in x, specify xmin as a scalar.
Example: -1
Upper bounds for x, specified as a numeric scalar, vector or matrix. To use the same maximum value
for all elements in x, specify xmax as a scalar.
Example: 2
2-20
barrierPenalty
Output Arguments
p — Penalty value
nonnegative vector
Penalty value, returned as a vector of nonnegative elements. Each element pi depends on the position
of xi with respect to the interval specified by xmini and xmaxi. The barrier penalty function returns
the value
2
p(x) = min pmax, C log 0.25 xmax − xmin − log x − xmin xmax − xmin
when xmin < x < xmax, and maxValue otherwise. Here, C is the argument curvature, and pmax is the
argument maxValue. Note that for positive values of C the returned penalty value is always positive.
If C is zero, then the returned penalty is zero inside the interval defined by the bounds, and pmax
outside this interval. If x is multidimensional, then the calculation is applied independently on each
dimension. Penalty functions are typically used to generate negative rewards when constraints are
violated, such as in generateRewardFunction.
Version History
Introduced in R2021b
Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
See Also
Functions
generateRewardFunction | exteriorPenalty | hyperbolicPenalty
Topics
“Generate Reward Function from a Model Predictive Controller for a Servomotor”
“Generate Reward Function from a Model Verification Block for a Water Tank System”
“Define Reward Signals”
2-21
2 Functions
bus2RLSpec
Create reinforcement learning data specifications for elements of a Simulink bus
Syntax
specs = bus2RLSpec(busName)
specs = bus2RLSpec(busName,Name,Value)
Description
specs = bus2RLSpec(busName) creates a set of reinforcement learning data specifications from
the Simulink® bus object specified by busName. One specification element is created for each leaf
element in the corresponding Simulink bus. Use these specifications to define actions and
observations for a Simulink reinforcement learning environment.
Examples
This example shows how to use the function bus2RLSpec to create an observation specification
object from a Simulink® bus object.
obsBus = Simulink.Bus();
obsBus.Elements(1) = Simulink.BusElement;
obsBus.Elements(1).Name = 'sin_theta';
obsBus.Elements(2) = Simulink.BusElement;
obsBus.Elements(2).Name = 'cos_theta';
obsBus.Elements(3) = Simulink.BusElement;
obsBus.Elements(3).Name = 'dtheta';
Create the observation specification objects using the Simulink bus object.
obsInfo = bus2RLSpec('obsBus');
You can then use obsInfo, together with the corresponding Simulink model, to create a
reinforcement learning environment. For an example, see “Train DDPG Agent to Swing Up and
Balance Pendulum with Bus Signal”.
2-22
bus2RLSpec
This example shows how to call the function bus2RLSpec using name and value pairs to create an
action specification object from a Simulink® bus object.
Create the observation specification objects using the Simulink bus object.
actInfo = bus2RLSpec('actBus','DiscreteElements',{'actuator',[-1 1]});
This specifies that the 'actuator' bus element can carry two possible values, -1, and 1.
You can then use actInfo, together with the corresponding Simulink model, to create a
reinforcement learning environment. Specifically the function that creates the environment uses
actInfo to determine the right bus output of the agent block.
For an example, see “Train DDPG Agent to Swing Up and Balance Pendulum with Bus Signal”.
Input Arguments
busName — Name of Simulink bus object
string | character vector
Before R2021a, use commas to separate each name and value, and enclose Name in quotes.
Example: 'DiscreteElements',{'force',[-5 0 5]} sets the 'force' bus element to be a
discrete data specification with three possible values, -5, 0, and 5
Name of the Simulink model, specified as the comma-separated pair consisting of 'Model' and a
string or character vector. Specify the model name when the bus object is defined in the model global
workspace (for example, in a data dictionary) instead of the MATLAB workspace.
Names of bus leaf elements for which to create specifications, specified as the comma-separated pair
consisting of BusElementNames' and a string array. To create observation specifications for a subset
2-23
2 Functions
of the elements in a Simulink bus object, specify BusElementNames. If you do not specify
BusElementNames, a data specification is created for each leaf element in the bus.
Note Do not specify BusElementNames when creating specifications for action signals. The RL
Agent block must output the full bus signal.
Finite values for discrete bus elements, specified as the comma-separated pair consisting of
'DiscreteElements' and a cell array of name-value pairs. Each name-value pair consists of a bus
leaf element name and an array of discrete values. The specified discrete values must be castable to
the data type of the specified action signal.
If you do not specify discrete values for an element specification, the element is continuous.
Example: 'ActionDiscretElements',{'force',[-10 0 10],'torque',[-5 0 5]} specifies
discrete values for the 'force' and 'torque' leaf elements of a bus action signal.
Output Arguments
specs — Data specifications
rlNumericSpec object | rlFiniteSetSpec object | array of data specification objects
Data specifications for reinforcement learning actions or observations, returned as one of the
following:
By default, all data specifications for bus elements are rlNumericSpec objects. To create a discrete
specification for one or more bus elements, specify the element names using the DiscreteElements
name-value pair.
Version History
Introduced in R2019a
See Also
Blocks
RL Agent
Functions
rlSimulinkEnv | createIntegratedEnv | rlNumericSpec | rlFiniteSetSpec
Topics
“Create Simulink Reinforcement Learning Environments”
2-24
cleanup
cleanup
Package: rl.env
Syntax
cleanup(env)
cleanup(lgr)
Description
When you define a custom training loop for reinforcement learning, you can simulate an agent or
policy against an environment using the runEpisode function. Use the cleanup function to clean up
the environment after running simulations using multiple calls to runEpisode. To clean up the
environment after each simulation, you can configure runEpisode to automatically call the cleanup
function at the end of each episode.
Also use cleanup to perform clean up tasks for a FileLogger or MonitorLogger object after
logging data within a custom training loop.
Environment Objects
cleanup(env) cleans up the specified reinforcement learning environment after running multiple
simulations using runEpisode.
Data Logger Objects
cleanup(lgr) cleans up the specified data logger object after logging data within a custom training
loop. This task might involve for example transferring any remaining data from lgr internal memory
to a logging target (either“Log Data to Disk in a Custom Training Loop” on page 2-27 a MAT file or a
trainingProgressMonitor object).
Examples
Create a reinforcement learning environment and extract its observation and action specifications.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
To approximate the Q-value function withing the critic, use a neural network. Create a network as an
array of layer objects.
net = [...
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(24)
reluLayer
2-25
2 Functions
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(2)
softmaxLayer];
Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
net = dlnetwork(net);
summary(net)
Initialized: true
Inputs:
1 'input' 4 features
Set up the environment for running multiple simulations. For this example, configure the training to
log any errors rather than send them to the command window.
setup(env,StopOnError="off")
Simulate multiple episodes using the environment and policy. After each episode, append the
experiences to the buffer. For this example, run 100 episodes.
for i = 1:100
output = runEpisode(env,policy,MaxSteps=300);
append(buffer,output.AgentData.Experiences)
end
Sample a mini-batch of experiences from the buffer. For this example, sample 10 experiences.
batch = sample(buffer,10);
You can then learn from the sampled experiences and update the policy and actor.
2-26
cleanup
This example shows how to log data to disk when training an agent using a custom training loop.
flgr = rlDataLogger();
Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.
setup(flgr);
Within a custom training loop, you can now store data to the logger object memory and write data to
file.
For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.
end
Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.
cleanup(flgr);
Input Arguments
env — Reinforcement learning environment
environment object | ...
2-27
2 Functions
If env is a SimulinkEnvWithAgent object and the associated Simulink model is configured to use
fast restart, then cleanup terminates the model compilation.
Version History
Introduced in R2022a
See Also
Functions
runEpisode | setup | reset | store | write
Objects
rlFunctionEnv | rlMDPEnv | SimulinkEnvWithAgent | rlNeuralNetworkEnvironment |
FileLogger | MonitorLogger
Topics
“Custom Training Loop with Simulink Action Noise”
2-28
createGridWorld
createGridWorld
Create a two-dimensional grid world for reinforcement learning
Syntax
GW = createGridWorld(m,n)
GW = createGridWorld(m,n,moves)
Description
GW = createGridWorld(m,n) creates a grid world GW of size m-by-n with default actions of
['N';'S';'E';'W'].
Examples
For this example, consider a 5-by-5 grid world with the following rules:
1 A 5-by-5 grid world bounded by borders, with 4 possible actions (North = 1, South = 2, East = 3,
West = 4).
2 The agent begins from cell [2,1] (second row, first column).
3 The agent receives reward +10 if it reaches the terminal state at cell [5,5] (blue).
4 The environment contains a special jump from cell [2,4] to cell [4,4] with +5 reward.
5 The agent is blocked by obstacles in cells [3,3], [3,4], [3,5] and [4,3] (black cells).
6 All other actions result in -1 reward.
2-29
2 Functions
GW =
GridWorld with properties:
GridSize: [5 5]
CurrentState: "[1,1]"
States: [25x1 string]
Actions: [4x1 string]
T: [25x25x4 double]
R: [25x25x4 double]
ObstacleStates: [0x1 string]
TerminalStates: [0x1 string]
ProbabilityTolerance: 8.8818e-16
Update the state transition matrix for the obstacle states and set the jump rule over the obstacle
states.
updateStateTranstionForObstacles(GW)
GW.T(state2idx(GW,"[2,4]"),:,:) = 0;
GW.T(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 1;
2-30
createGridWorld
GW.R = -1*ones(nS,nS,nA);
GW.R(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 5;
GW.R(:,state2idx(GW,GW.TerminalStates),:) = 10;
Now, use rlMDPEnv to create a grid world environment using the GridWorld object GW.
env = rlMDPEnv(GW)
env =
rlMDPEnv with properties:
You can visualize the grid world environment using the plot function.
plot(env)
Input Arguments
m — Number of rows of the grid world
scalar
2-31
2 Functions
Output Arguments
GW — Two-dimensional grid world
GridWorld object
Two-dimensional grid world, returned as a GridWorld object with properties listed below. For more
information, see “Create Custom Grid World Environments”.
Action names, specified as a string vector. The length of the Actions vector is determined by the
moves argument.
State transition matrix, specified as a 3-D array, which determines the possible movements of the
agent in an environment. State transition matrix T is a probability matrix that indicates how likely the
agent will move from the current state s to any possible next state s' by performing action a. T is
given by,
2-32
createGridWorld
T s, s′, a = probability s′ s, a .
T is:
Reward transition matrix, specified as a 3-D array, determines how much reward the agent receives
after performing an action in the environment. R has the same shape and size as state transition
matrix T. Reward transition matrix R is given by,
r = R s, s′, a .
R is:
State names that cannot be reached in the grid world, specified as a string vector.
Version History
Introduced in R2019a
See Also
rlMDPEnv | rlPredefinedEnv
Topics
“Create Custom Grid World Environments”
“Train Reinforcement Learning Agent in Basic Grid World”
2-33
2 Functions
createIntegratedEnv
Create Simulink model for reinforcement learning, using reference model as environment
Syntax
env = createIntegratedEnv(refModel,newModel)
[env,agentBlock,obsInfo,actInfo] = createIntegratedEnv( ___ )
Description
env = createIntegratedEnv(refModel,newModel) creates a Simulink model with the name
specified by newModel and returns a reinforcement learning environment object, env, for this model.
The new model contains an RL Agent block and uses the reference model refModel as a
reinforcement learning environment for training the agent specified by this block.
Examples
This example shows how to use createIntegratedEnv to create an environment object starting
from a Simulink model that implements the system with which the agent. Such a system is often
referred to as plant, open-loop system, or reference system, while the whole (integrated) system
including the agent is often referred to as the closed-loop system.
For this example, use the flying robot model described in “Train DDPG Agent to Control Flying
Robot” as the reference (open-loop) system.
% sample time
Ts = 0.4;
2-34
createIntegratedEnv
Create the Simulink model myIntegratedEnv containing the flying robot model connected in a
closed loop to the agent block. The function also returns the reinforcement learning environment
object env to be used for training.
env = createIntegratedEnv('rlFlyingRobotEnv','myIntegratedEnv')
env =
SimulinkEnvWithAgent with properties:
Model : myIntegratedEnv
AgentBlock : myIntegratedEnv/RL Agent
ResetFcn : []
UseFastRestart : on
The function can also return the block path to the RL Agent block in the new integrated model, as
well as the observation and action specifications for the reference model.
agentBlk =
'myIntegratedEnv/RL Agent'
observationInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "observation"
Description: [0x0 string]
Dimension: [7 1]
DataType: "double"
actionInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "action"
Description: [0x0 string]
Dimension: [2 1]
DataType: "double"
Returning the block path and specifications is useful in cases in which you need to modify
descriptions, limits, or names in observationInfo and actionInfo. After modifying the
specifications, you can then create an environment from the integrated model IntegratedEnv using
the rlSimulinkEnv function.
This example shows how to call createIntegratedEnv using name-value pairs to specify port
names.
2-35
2 Functions
The first argument of createIntegratedEnv is the name of the reference Simulink model that
contains the system with which the agent must interact. Such a system is often referred to as plant,
or open-loop system. For this example, the reference system is the model of a water tank.
open_system('rlWatertankOpenloop')
Set the sample time of the discrete integrator block used to generate the observation, so the
simulation can run.
Ts = 1;
The input port is called u (instead of action), and the first and third output ports are called y and
stop (instead of observation and isdone). Specify the port names using name-value pairs.
env = createIntegratedEnv('rlWatertankOpenloop','IntegratedWatertank',...
'ActionPortName','u','ObservationPortName','y','IsDonePortName','stop')
env =
SimulinkEnvWithAgent with properties:
Model : IntegratedWatertank
AgentBlock : IntegratedWatertank/RL Agent
ResetFcn : []
UseFastRestart : on
The new model IntegratedWatertank contains the reference model connected in a closed-loop
with the agent block. The function also returns the reinforcement learning environment object to be
used for training.
Input Arguments
refModel — Reference model name
string | character vector
Reference model name, specified as a string or character vector. This is the Simulink model
implementing the system that the agent needs to interact with. Such a system is often referred to as
plant, open loop system or reference system, while the whole (integrated) system including the agent
is often referred to as the closed loop system. The new Simulink model uses this reference model as
the dynamic model of the environment for reinforcement learning.
2-36
createIntegratedEnv
Before R2021a, use commas to separate each name and value, and enclose Name in quotes.
Example: 'IsDonePortName',"stopSim" sets the stopSim port of the reference model as the
source of the isdone signal.
Reference model observation output port name, specified as the comma-separated pair consisting of
'ObservationPortName' and a string or character vector. Specify ObservationPortName when
the name of the observation output port of the reference model is not "observation".
Reference model action input port name, specified as the comma-separated pair consisting of
'ActionPortName' and a string or character vector. Specify ActionPortName when the name of
the action input port of the reference model is not "action".
Reference model reward output port name, specified as the comma-separated pair consisting of
'RewardPortName' and a string or character vector. Specify RewardPortName when the name of
the reward output port of the reference model is not "reward".
Reference model done flag output port name, specified as the comma-separated pair consisting of
'IsDonePortName' and a string or character vector. Specify IsDonePortName when the name of
the done flag output port of the reference model is not "isdone".
Names of observation bus leaf elements for which to create specifications, specified as a string array.
To create observation specifications for a subset of the elements in a Simulink bus object, specify
BusElementNames. If you do not specify BusElementNames, a data specification is created for each
leaf element in the bus.
ObservationBusElementNames is applicable only when the observation output port is a bus signal.
Example: 'ObservationBusElementNames',["sin" "cos"] creates specifications for the
observation bus elements with the names "sin" and "cos".
Finite values for discrete observation specification elements, specified as the comma-separated pair
consisting of 'ObservationDiscreteElements' and a cell array of name-value pairs. Each name-
value pair consists of an element name and an array of discrete values.
2-37
2 Functions
• A bus signal, specify the name of one of the leaf elements of the bus specified in by
ObservationBusElementNames
• Nonbus signal, specify the name of the observation port, as specified by ObservationPortName
The specified discrete values must be castable to the data type of the specified observation signal.
If you do not specify discrete values for an observation specification element, the element is
continuous.
Example: 'ObservationDiscretElements',{'observation',[-1 0 1]} specifies discrete
values for a nonbus observation signal with default port name observation.
Example: 'ObservationDiscretElements',{'gear',[-1 0 1 2],'direction',[1 2 3 4]}
specifies discrete values for the 'gear' and 'direction' leaf elements of a bus action signal.
Finite values for discrete action specification elements, specified as the comma-separated pair
consisting of 'ActionDiscreteElements' and a cell array of name-value pairs. Each name-value
pair consists of an element name and an array of discrete values.
The specified discrete values must be castable to the data type of the specified action signal.
If you do not specify discrete values for an action specification element, the element is continuous.
Example: 'ActionDiscretElements',{'action',[-1 0 1]} specifies discrete values for a
nonbus action signal with default port name 'action'.
Example: 'ActionDiscretElements',{'force',[-10 0 10],'torque',[-5 0 5]} specifies
discrete values for the 'force' and 'torque' leaf elements of a bus action signal.
Output Arguments
env — Reinforcement learning environment
SimulinkEnvWithAgent object
Block path to the agent block in the new model, returned as a character vector. To train an agent in
the new Simulink model, you must create an agent and specify the agent name in the RL Agent block
indicated by agentBlock.
2-38
createIntegratedEnv
Version History
Introduced in R2019a
See Also
Blocks
RL Agent
Functions
rlSimulinkEnv | bus2RLSpec | rlNumericSpec | rlFiniteSetSpec
Topics
“Create Simulink Reinforcement Learning Environments”
2-39
2 Functions
createMDP
Create Markov decision process model
Syntax
MDP = createMDP(states,actions)
Description
MDP = createMDP(states,actions) creates a Markov decision process model with the specified
states and actions.
Examples
Create an MDP model with eight states and two possible actions.
MDP = createMDP(8,["up";"down"]);
2-40
createMDP
Input Arguments
states — Model states
positive integer | string vector
• Positive integer — Specify the number of model states. In this case, each state has a default name,
such as "s1" for the first state.
• String vector — Specify the state names. In this case, the total number of states is equal to the
length of the vector.
• Positive integer — Specify the number of model actions. In this case, each action has a default
name, such as "a1" for the first action.
• String vector — Specify the action names. In this case, the total number of actions is equal to the
length of the vector.
Output Arguments
MDP — MDP model
GenericMDP object
2-41
2 Functions
State names, specified as a string vector with length equal to the number of states.
Action names, specified as a string vector with length equal to the number of actions.
State transition matrix, specified as a 3-D array, which determines the possible movements of the
agent in an environment. State transition matrix T is a probability matrix that indicates how likely the
agent will move from the current state s to any possible next state s' by performing action a. T is an
S-by-S-by-A array, where S is the number of states and A is the number of actions. It is given by:
T s, s′, a = probability s′ s, a .
The sum of the transition probabilities out from a nonterminal state s following a given action must
sum up to one. Therefore, all stochastic transitions out of a given state must be specified at the same
time.
For example, to indicate that in state 1 following action 4 there is an equal probability of moving to
states 2 or 3, use the following:
You can also specify that, following an action, there is some probability of remaining in the same
state. For example:
Reward transition matrix, specified as a 3-D array, which determines how much reward the agent
receives after performing an action in the environment. R has the same shape and size as state
transition matrix T. The reward for moving from state s to state s' by performing action a is given
by:
r = R s, s′, a .
Terminal state names in the grid world, specified as a string vector of state names.
Version History
Introduced in R2019a
2-42
createMDP
See Also
rlMDPEnv | createGridWorld
Topics
“Train Reinforcement Learning Agent in MDP Environment”
2-43
2 Functions
evaluate
Package: rl.function
Evaluate function approximator object given observation (or observation-action) input data
Syntax
outData = evaluate(fcnAppx,inData)
[outData,state] = evaluate(fcnAppx,inData)
Description
outData = evaluate(fcnAppx,inData) evaluates the function approximator object (that is, the
actor or critic) fcnAppx given the input value inData. It returns the output value outData.
Examples
This example shows you how to evaluate a function approximator object (that is, an actor or a critic).
For this example, the function approximator object is a discrete categorical actor and you evaluate it
given some observation data, obtaining in return the action probability distribution and the updated
network state.
Load the same environment used in “Train PG Agent to Balance Cart-Pole System”, and obtain the
observation and action specifications.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env)
obsInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "CartPole States"
Description: "x, dx, theta, dtheta"
Dimension: [4 1]
DataType: "double"
actInfo = getActionInfo(env)
actInfo =
rlFiniteSetSpec with properties:
2-44
evaluate
To approximate the policy within the actor, use a recurrent deep neural network. Define the network
as an array of layer objects. Get the dimensions of the observation space and the number of possible
actions directly from the environment specification objects.
net = [
sequenceInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(8)
reluLayer
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer(numel(actInfo.Elements)) ];
Convert the network to a dlnetwork object and display the number of weights.
net = dlnetwork(net);
summary(net)
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
Use evaluate to return the probability of each of the two possible actions. Note that the type of the
returned numbers is single, not double.
[prob,state] = evaluate(actor,{rand(obsInfo.Dimension)});
prob{1}
0.4847
0.5153
Since a recurrent neural network is used for the actor, the second output argument, representing the
updated state of the neural network, is not empty. In this case, it contains the updated (cell and
hidden) states for the eight units of the lstm layer used in the network.
state{:}
-0.0833
0.0619
-0.0066
-0.0651
0.0714
-0.0957
0.0614
2-45
2 Functions
-0.0326
-0.1367
0.1142
-0.0158
-0.1820
0.1305
-0.1779
0.0947
-0.0833
You can use getState and setState to extract and set the current state of the actor.
getState(actor)
You can obtain action probabilities and updated states for a batch of observations. For example, use a
batch of five independent observations.
obsBatch = reshape(1:20,4,1,5,1);
[prob,state] = evaluate(actor,{obsBatch})
The output arguments contain action probabilities and updated states for each observation in the
batch.
Note that the actor treats observation data along the batch length dimension independently, not
sequentially.
prob{1}
2-46
evaluate
To evaluate the actor using sequential observations, use the sequence length (time) dimension. For
example, obtain action probabilities for five independent sequences, each one made of nine
sequential observations.
The first output argument contains a vector of two probabilities (first dimension) for each element of
the observation batch (second dimension) and for each time element of the sequence length (third
dimension).
The second output argument contains two vectors of final states for each observation batch (that is,
the network maintains a separate state history for each observation batch).
Display the probability of the second action, after the seventh sequential observation in the fourth
independent batch.
prob{1}(2,4,7)
ans = single
0.5675
For more information on input and output format for recurrent neural networks, see the Algorithms
section of lstmLayer.
Input Arguments
fcnAppx — Function approximator object
function approximator object
• rlValueFunction,
• rlQValueFunction,
• rlVectorQValueFunction,
• rlDiscreteCategoricalActor,
• rlContinuousDeterministicActor,
• rlContinuousGaussianActor,
2-47
2 Functions
• rlContinuousDeterministicTransitionFunction,
• rlContinuousGaussianTransitionFunction,
• rlContinuousDeterministicRewardFunction,
• rlContinuousGaussianRewardFunction,
• rlIsDoneFunction object.
Input data for the function approximator, specified as a cell array with as many elements as the
number of input channels of fcnAppx. In the following section, the number of observation channels is
indicated by NO.
• If fcnAppx is an rlQValueFunction, an
rlContinuousDeterministicTransitionFunction or an
rlContinuousGaussianTransitionFunction object, then each of the first NO elements of
inData must be a matrix representing the current observation from the corresponding
observation channel. They must be followed by a final matrix representing the action.
• If fcnAppx is a function approximator object representing an actor or critic (but not an
rlQValueFunction object), inData must contain NO elements, each one a matrix representing
the current observation from the corresponding observation channel.
• If fcnAppx is an rlContinuousDeterministicRewardFunction, an
rlContinuousGaussianRewardFunction, or an rlIsDoneFunction object, then each of the
first NO elements of inData must be a matrix representing the current observation from the
corresponding observation channel. They must be followed by a matrix representing the action,
and finally by NO elements, each one being a matrix representing the next observation from the
corresponding observation channel.
For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.
Example: {rand(8,3,64,1),rand(4,1,64,1),rand(2,1,64,1)}
Output Arguments
outData — Output data from evaluation of function approximator object
cell array
2-48
evaluate
Output data from the evaluation of the function approximator object, returned as a cell array. The size
and contents of outData depend on the type of object you use for fcnAppx, and are shown in the
following list. Here, NO is the number of observation channels.
• D is the vector of dimensions of the corresponding output channel of fcnAppx. Depending on the
type of approximator function, this channel can carry a predicted observation (or its mean value or
standard deviation), an action (or its mean value or standard deviation), the value (or values) of an
observation (or observation-action couple), a predicted reward, or a predicted termination status.
• LB is the batch size (length of a batch of independent inputs).
• LS is the sequence length (length of the sequence of inputs along the time dimension) for a
recurrent neural network. If fcnAppx does not use a recurrent neural network (which is the case
for environment function approximators, as they do not support recurrent neural networks), then
LS = 1.
Note If fcnAppx is a critic, then evaluate behaves identically to getValue except that it returns
results inside a single-cell array. If fcnAppx is an rlContinuousDeterministicActor actor, then
evaluate behaves identically to getAction. If fcnAppx is a stochastic actor such as an
rlDiscreteCategoricalActor or rlContinuousGaussianActor, then evaluate returns the
action probability distribution, while getAction returns a sample action. Specifically, for an
rlDiscreteCategoricalActor actor object, evaluate returns the probability of each possible
action. For an rlContinuousGaussianActor actor object, evaluate returns the mean and
standard deviation of the Gaussian distribution. For these kinds of actors, see also the note in
getAction regarding the enforcement of constraints set by the action specification.
2-49
2 Functions
Next state of the function approximator object, returned as a cell array. If fcnAppx does not use a
recurrent neural network (which is the case for environment function approximators), then state is
an empty cell array.
You can set the state of the representation to state using the setState function. For example:
critic = setState(critic,state);
Version History
Introduced in R2022a
See Also
getValue | getAction | getMaxQValue | rlValueFunction | rlQValueFunction |
rlVectorQValueFunction | rlContinuousDeterministicActor |
rlDiscreteCategoricalActor | rlContinuousGaussianActor |
rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction | accelerate | gradient | predict
Topics
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”
2-50
exteriorPenalty
exteriorPenalty
Exterior penalty value for a point with respect to a bounded region
Syntax
p = exteriorPenalty(x,xmin,xmax,method)
Description
p = exteriorPenalty(x,xmin,xmax,method) uses the specified method to calculate the
nonnegative (exterior) penalty vector p for the point x with respect to the region bounded by xmin
and xmax. p has the same dimension as x.
Examples
This example shows how to use the exteriorPenalty function to calculate the exterior penalty for
a given point, with respect to a bounded region.
Calculate the penalty value for the point 0.1 within the interval [-2,2], using the step method.
exteriorPenalty(0.1,-2,2,'step')
ans = 0
Calculate the penalty value for the point 4 outside the interval [-2,2], using the step method.
exteriorPenalty(4,-2,2,'step')
ans = 1
Calculate the penalty value for the point 4 outside the interval [-2,2], using the quadratic method.
exteriorPenalty(4,-2,2,'quadratic')
ans = 4
Calculate the penalty value for the point [-2,0,4] with respect to the box defined by the intervals
[0,1], [-1,1], and [-2,2] along the x, y, and z dimensions, respectively, using the quadratic method.
ans = 3×1
4
0
4
2-51
2 Functions
x = -5:0.01:5;
Calculate penalties for all the points in the vector, using the quadratic method.
p = exteriorPenalty(x,-2,2,'quadratic');
plot(x,p)
grid
xlabel("point position");
ylabel("penalty value");
title("Penalty values over an interval");
Input Arguments
x — Point for which penalty is calculated
scalar | vector | matrix
Point for which the exterior penalty is calculated, specified as a numeric scalar, vector, or matrix.
Example: [-0.1, 1.3]
2-52
exteriorPenalty
Lower bounds for x, specified as a numeric scalar, vector, or matrix. To use the same minimum value
for all elements in x, specify xmin as a scalar.
Example: -2
Upper bounds for x, specified as a numeric scalar, vector, or matrix. To use the same maximum value
for all elements in x, specify xmax as a scalar.
Example: [5 10]
Function used to calculate the penalty, specified either as 'step' or 'quadratic'. You can also use
strings instead of character vectors.
Example: "quadratic"
Output Arguments
p — Penalty value
nonnegative vector
Penalty value, returned as a vector of nonnegative elements. With either of the two methods, each
element pi is zero if the corresponding xi is within the region specified by xmini and xmaxi, and it is
positive otherwise. Penalty functions are typically used to generate negative rewards when
constraints are violated, such as in generateRewardFunction.
Version History
Introduced in R2021b
Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
See Also
Functions
generateRewardFunction | hyperbolicPenalty | barrierPenalty
Topics
“Generate Reward Function from a Model Predictive Controller for a Servomotor”
“Define Reward Signals”
2-53
2 Functions
generatePolicyBlock
Generate Simulink block that evaluates policy of an agent or policy object
Syntax
generatePolicyBlock(agent)
generatePolicyBlock(policy)
Description
This function generates a Simulink Policy evaluation block from an agent or policy object. It also
creates a data file which stores policy information. The generated policy block loads this data file to
properly initialize itself prior to simulation. You can use the block to simulate the policy and generate
code for deployment purposes.
For more information on policies and value functions, see “Create Policies and Value Functions”.
generatePolicyBlock(agent) creates a block that evaluates the policy of the specified agent
using the default block name, policy name, and data file name.
generatePolicyBlock(policy) creates a block that evaluates the learned policy of the specified
policy object using the default block name, policy name, and data file name.
Examples
First, create and train a reinforcement learning agent. For this example, load the PG agent trained in
“Train PG Agent to Balance Cart-Pole System”.
load("MATLABCartpolePG.mat","agent")
Then, create a policy evaluation block from this agent using default names.
generatePolicyBlock(agent);
This command creates an untitled Simulink® model, containing the policy block, and the
blockAgentData.mat file, containing information needed to create and initialize the policy block,
(such as the trained deep neural network used by the actor within the agent). The block loads this
data file to properly initialize itself prior to simulation.
2-54
generatePolicyBlock
You can now drag and drop the block in a Simulink® model and connect it so that it takes the
observation from the environment as input and so that the calculated action is returned to the
environment. This allows you to simulate the policy in a closed loop. You can then generate code for
deployment purposes. For more information, see “Deploy Trained Reinforcement Learning Policies”.
bdclose("untitled")
Create observation and action specification objects. For this example, define the observation and
action spaces as continuous four- and two-dimensional spaces, respectively.
Create a continuous deterministic actor. This actor must accept an observation as input and return an
action as output.
To approximate the policy function within the actor, use a recurrent deep neural network model.
Define the network as an array of layer objects, and get the dimension of the observation and action
spaces from the environment specification objects. To create a recurrent network, use a
sequenceInputLayer as the input layer (with size equal to the number of dimensions of the
observation channel) and include at least one lstmLayer.
layers = [
sequenceInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(10)
reluLayer
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer(20)
fullyConnectedLayer(actInfo.Dimension(1))
tanhLayer
];
2-55
2 Functions
Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions (CTB)
Create the actor using model, and the observation and action specifications.
actor = rlContinuousDeterministicActor(model,obsInfo,actInfo)
actor =
rlContinuousDeterministicActor with properties:
act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}
-0.0742
0.0158
policy = rlDeterministicActorPolicy(actor)
policy =
rlDeterministicActorPolicy with properties:
You can access the policy options using dot notation. Check the policy with a random observation
input.
act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}
ans = 2×1
-0.0060
-0.0161
2-56
generatePolicyBlock
Then, create a policy evaluation block from this policy object using the default name for the
generated MAT-file.
generatePolicyBlock(policy);
This command creates an untitled Simulink® model, containing the policy block, and the
blockAgentData.mat file, containing information needed to create and initialize the policy block,
(such as the trained deep neural network used by the actor within the agent). The block loads this
data file to properly initialize itself prior to simulation.
You can now drag and drop the block in a Simulink® model and connect it so that it takes the
observation from the environment as input and so that the calculated action is returned to the
environment. This allows you to simulate the policy in a closed loop. You can then generate code for
deployment purposes. For more information, see “Deploy Trained Reinforcement Learning Policies”.
bdclose("untitled")
Input Arguments
agent — Reinforcement learning agent
reinforcement learning agent object
Trained reinforcement learning agent, specified as one of the following agent objects. To train your
agent, use the train function.
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlPGAgent
2-57
2 Functions
• rlPPOAgent
• rlTRPOAgent
• rlSACAgent
For agents with a stochastic actor (PG, PPO, SAC, TRPO, AC), the action returned by the generated
policy function depends on the value of the UseExplorationPolicy property of the agent. By
default, UseExplorationPolicy is false and the generated action is deterministic. If
UseExplorationPolicy is true, the generated action is stochastic.
• rlMaxQPolicy
• rlDeterministicActorPolicy
• rlStochasticActorPolicy
Name of generated data file, specified as a string or character vector. If a file with the specified name
already exists in the current MATLAB folder, then an appropriate digit is added to the name so that
no existing file is overwritten.
The generated data file contains four structures that store data needed to fully characterize the
policy. Prior to simulation, the block (which is generated with the data file name as mask parameter)
loads this data file to properly initialize itself.
Version History
Introduced in R2019a
See Also
Policy | generatePolicyFunction | train | dlnetwork
Topics
“Generate Policy Block for Deployment”
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
“Deploy Trained Reinforcement Learning Policies”
2-58
generatePolicyFunction
generatePolicyFunction
Package: rl.policy
Syntax
generatePolicyFunction(agent)
generatePolicyFunction(policy)
Description
This function generates a policy evaluation function which you can use to:
• Generate code for deployment purposes using MATLAB Coder™ or GPU Coder™. For more
information, see “Deploy Trained Reinforcement Learning Policies”.
• Simulate the trained agent in Simulink using a MATLAB Function block.
This function also creates a data file which stores policy information. The evaluation function loads
this data file to properly initialize itself the first time it is called.
For more information on policies and value functions, see “Create Policies and Value Functions”.
generatePolicyFunction( ___ ,Name=Value) specifies the function name, policy name, and data
file name using one or more name-value pair arguments.
Examples
This example shows how to create a policy evaluation function for a PG Agent.
First, create and train a reinforcement learning agent. For this example, load the PG agent trained in
“Train PG Agent to Balance Cart-Pole System”.
load("MATLABCartpolePG.mat","agent")
Then, create a policy evaluation function for this agent using default names.
generatePolicyFunction(agent);
2-59
2 Functions
This command creates the evaluatePolicy.m file, which contains the policy function, and the
agentData.mat file, which contains the trained deep neural network actor.
type evaluatePolicy.m
persistent policy;
if isempty(policy)
policy = coder.loadRLPolicy("agentData.mat");
end
% evaluate the policy
action1 = getAction(policy,observation1);
evaluatePolicy(rand(agent.ObservationInfo.Dimension))
ans = 10
You can now generate code for this policy function using MATLAB® Coder™. For more information,
see “Deploy Trained Reinforcement Learning Policies”.
You can create and train a policy object in a custom training loop or extract a trained object from a
trained agent. For this example, load the PG agent trained in “Train PG Agent to Balance Cart-Pole
System”, and extract its greedy policy using getGreedyPolicy. Alternatively, you can extract an
explorative policy using getExplorationPolicy.
load("MATLABCartpolePG.mat","agent")
policy = getGreedyPolicy(agent)
policy =
rlStochasticActorPolicy with properties:
Then, create a policy evaluation function for this policy using default names.
generatePolicyFunction(policy);
This command creates the evaluatePolicy.m file, which contains the policy function, and the
agentData.mat file, which contains the trained deep neural network actor.
2-60
generatePolicyFunction
type evaluatePolicy.m
persistent policy;
if isempty(policy)
policy = coder.loadRLPolicy("agentData.mat");
end
% evaluate the policy
action1 = getAction(policy,observation1);
evaluatePolicy(rand(policy.ObservationInfo.Dimension))
ans = 10
You can now generate code for this policy function using MATLAB® Coder™. For more information,
see “Deploy Trained Reinforcement Learning Policies”.
This example shows how to create a policy evaluation function for a Q-Learning Agent.
For this example, load the Q-learning agent trained in “Train Reinforcement Learning Agent in Basic
Grid World”
load("basicGWQAgent.mat","qAgent")
Create a policy evaluation function for this agent and specify the name of the agent data file.
generatePolicyFunction(qAgent,"MATFileName","policyFile.mat")
This command creates the evaluatePolicy.m file, which contains the policy function, and the
policyFile.mat file, which contains the trained Q table value function.
type evaluatePolicy.m
persistent policy;
if isempty(policy)
policy = coder.loadRLPolicy("policyFile.mat");
end
2-61
2 Functions
evaluatePolicy(randi(25))
ans = 3
You can now generate code for this policy function using MATLAB® Coder™. For more information,
see “Deploy Trained Reinforcement Learning Policies”.
Input Arguments
agent — Reinforcement learning agent
reinforcement learning agent object
Trained reinforcement learning agent, specified as one of the following agent objects. To train your
agent, use the train function.
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlPGAgent
• rlPPOAgent
• rlTRPOAgent
• rlSACAgent
For agents with a stochastic actor (PG, PPO, SAC, TRPO, AC), the action returned by the generated
policy function depends on the value of the UseExplorationPolicy property of the agent. By
default, UseExplorationPolicy is false and the generated action is deterministic. If
UseExplorationPolicy is true, the generated action is stochastic.
• rlMaxQPolicy
• rlDeterministicActorPolicy
• rlStochasticActorPolicy
2-62
generatePolicyFunction
Before R2021a, use commas to separate each name and value, and enclose Name in quotes.
Example: FunctionName="computeAction"
Name of the policy object within the generated function, specified as a string or character vector.
Name of generated data file, specified as a string or character vector. If a file with the specified name
already exists in the current MATLAB folder, then an appropriate digit is added to the name so that
no existing file is overwritten.
The generated data file contains four structures that store data needed to fully characterize the
policy. The evaluation function loads this data file to properly initialize itself the first time it is called.
Version History
Introduced in R2019a
The code generated by generatePolicyFunction now loads a deployable policy object from a
reinforcement learning agent. The results from running the generated policy function remain the
same.
See Also
generatePolicyBlock | Policy | train | dlnetwork
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
“Deploy Trained Reinforcement Learning Policies”
2-63
2 Functions
generateRewardFunction
Generate a reward function from control specifications to train a reinforcement learning agent
Syntax
generateRewardFunction(mpcobj)
generateRewardFunction(blks)
generateRewardFunction( ___ ,'FunctionName',myFcnName)
Description
generateRewardFunction(mpcobj) generates a MATLAB reward function based on the cost and
constraints defined in the linear or nonlinear MPC object mpcobj. The generated reward function is
displayed in a new editor window and you can use it as a starting point for reward design. You can
tune the weights, use a different penalty function, and then use the resulting reward function within
an environment to train an agent.
Examples
This example shows how to generate a reinforcement learning reward function from an MPC object.
Create a random plant using the rss function and set the feedthrough matrix to zero.
plant = rss(4,3,2);
plant.d = 0;
Specify which of the plant signals are manipulated variables, measured disturbances, measured
outputs and unmeasured outputs.
Create an MPC controller with a sample time of 0.1 and prediction and control horizons of 10 and 3
steps, respectively.
2-64
generateRewardFunction
mpcobj = mpc(plant,0.1,10,3);
Generate the reward function code from specifications in the mpc object using
generateRewardFunction. The code is displayed in the MATLAB Editor.
generateRewardFunction(mpcobj)
For this example, the is saved in the MATLAB function file myMpcRewardFcn.m. Display the
generated reward function.
type myMpcRewardFcn.m
%#codegen
2-65
2 Functions
mvmin = -2;
mvmax = 2;
mvratemin = -Inf;
mvratemax = Inf;
%% Compute cost
dy = (refy(:)-y(:)) ./ Sy';
dmv = (refmv(:)-mv(:)) ./ Smv';
dmvrate = (mv(:)-lastmv(:)) ./ Smv';
Jy = dy' * diag(Qy.^2) * dy;
Jmv = dmv' * diag(Qmv.^2) * dmv;
Jmvrate = dmvrate' * diag(Qmvrate.^2) * dmvrate;
Cost = Jy + Jmv + Jmvrate;
%% Compute penalty
% Penalty is computed for violation of linear bound constraints.
%
% To compute exterior bound penalty, use the exteriorPenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% Alternaltively, use the hyperbolicPenalty or barrierPenalty function for
% computing hyperbolic and barrier penalties.
%
% For more information, see help for these functions.
%
% Set Pmv value to 0 if the RL agent action specification has
% appropriate 'LowerLimit' and 'UpperLimit' values.
Py = Wy * exteriorPenalty(y,ymin,ymax,'step');
Pmv = Wmv * exteriorPenalty(mv,mvmin,mvmax,'step');
Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'step');
Penalty = Py + Pmv + Pmvrate;
%% Compute reward
reward = -(Cost + Penalty);
end
The calculated reward depends only on the current values of the plant input and output signals and
their reference values, and it is composed of two parts.
The first is a negative cost that depends on the squared difference between desired and current plant
inputs and outputs. This part uses the cost function weights specified in the MPC object. The second
2-66
generateRewardFunction
part is a penalty that acts as a negative reward whenever the current plant signals violate the
constraints.
The generated reward function is a starting point for reward design. You can tune the weights or use
a different penalty function to define a more appropriate reward for your reinforcement learning
agent.
This example shows how to generate a reinforcement learning reward function from a Simulink
Design Optimization model verification block.
For this example, open the Simulink model LevelCheckBlock.slx, which contains a Check Step
Response Characteristics block named Level Check.
open_system('LevelCheckBlock')
2-67
2 Functions
Generate the reward function code from specifications in the Level Check block, using
generateRewardFunction. The code is displayed in the MATLAB Editor.
generateRewardFunction('LevelCheckBlock/Level Check')
For this example, the code is saved in the MATLAB function file myBlockRewardFcn.m.
type myBlockRewardFcn.m
%#codegen
2-68
generateRewardFunction
Block1_InitialValue = 1;
Block1_FinalValue = 2;
Block1_StepTime = 0;
Block1_StepRange = Block1_FinalValue - Block1_InitialValue;
Block1_MinRise = Block1_InitialValue + Block1_StepRange * 80/100;
Block1_MaxSettling = Block1_InitialValue + Block1_StepRange * (1+2/100);
Block1_MinSettling = Block1_InitialValue + Block1_StepRange * (1-2/100);
Block1_MaxOvershoot = Block1_InitialValue + Block1_StepRange * (1+10/100);
Block1_MinUndershoot = Block1_InitialValue - Block1_StepRange * 5/100;
if t >= Block1_StepTime
if Block1_InitialValue <= Block1_FinalValue
Block1_UpperBoundTimes = [0,5; 5,max(5+1,t+1)];
Block1_UpperBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettlin
Block1_LowerBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
Block1_LowerBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,
else
Block1_UpperBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)];
Block1_UpperBoundAmplitudes = [Block1_MinUndershoot,Block1_MinUndershoot; Block1_MinRise,
Block1_LowerBoundTimes = [0,5; 5,max(5+1,t+1)];
Block1_LowerBoundAmplitudes = [Block1_MaxOvershoot,Block1_MaxOvershoot; Block1_MaxSettlin
end
Block1_xmax = zeros(1,size(Block1_UpperBoundTimes,1));
for idx = 1:numel(Block1_xmax)
tseg = Block1_UpperBoundTimes(idx,:);
xseg = Block1_UpperBoundAmplitudes(idx,:);
Block1_xmax(idx) = interp1(tseg,xseg,t,'linear',NaN);
end
if all(isnan(Block1_xmax))
Block1_xmax = Inf;
else
Block1_xmax = max(Block1_xmax,[],'omitnan');
end
Block1_xmin = zeros(1,size(Block1_LowerBoundTimes,1));
for idx = 1:numel(Block1_xmin)
tseg = Block1_LowerBoundTimes(idx,:);
xseg = Block1_LowerBoundAmplitudes(idx,:);
Block1_xmin(idx) = interp1(tseg,xseg,t,'linear',NaN);
end
if all(isnan(Block1_xmin))
Block1_xmin = -Inf;
else
Block1_xmin = max(Block1_xmin,[],'omitnan');
end
else
Block1_xmin = -Inf;
Block1_xmax = Inf;
end
%% Compute penalty
2-69
2 Functions
%% Compute reward
reward = -Weight * Penalty;
end
The generated reward function takes as input arguments the current value of the verification block
input signals and the simulation time. A negative reward is calculated using a weighted penalty that
acts whenever the current block input signals violate the linear bound constraints defined in the
verification block.
The generated reward function is a starting point for reward design. You can tune the weights or use
a different penalty function to define a more appropriate reward for your reinforcement learning
agent.
close_system('LevelCheckBlock')
Input Arguments
mpcobj — Linear or nonlinear MPC object
mpc object | nlmpc object
Linear or nonlinear MPC object, specified as an mpc object or an nlmpc object, respectively.
Note that:
• The generated function calculates rewards using signal values at the current time only. Predicted
future values, signal previewing, and control horizon settings are not used in the reward
calculation.
• Using time-varying cost weights and constraints, or updating them online, is not supported.
• Only the standard quadratic cost function, as described in “Optimization Problem” (Model
Predictive Control Toolbox), is supported. Therefore, for mpc objects, using mixed constraint
specifications is not supported. Similarly, for nlmpc objects, custom cost and constraint
specifications are not supported.
Path to model verification blocks, specified as character array, cell array or string array. The
supported Simulink Design Optimization model verification blocks are the following ones.
2-70
generateRewardFunction
The generated reward function takes as input arguments the current value of the verification block
input signals and the simulation time. A negative reward is calculated using a weighted penalty that
acts whenever the current block input signals violate the linear bound constraints defined in the
verification block.
Example: "mySimulinkModel02/Check Against Reference"
Tips
By default, the exterior bound penalty function exteriorPenalty is used to calculate the penalty.
Alternatively, to calculate hyperbolic and barrier penalties, you can use the hyperbolicPenalty or
barrierPenalty functions.
Version History
Introduced in R2021b
See Also
Functions
exteriorPenalty | hyperbolicPenalty | barrierPenalty
Objects
mpc | nlmpc
Topics
“Generate Reward Function from a Model Predictive Controller for a Servomotor”
“Generate Reward Function from a Model Verification Block for a Water Tank System”
“Define Reward Signals”
“Create MATLAB Reinforcement Learning Environments”
“Create Simulink Reinforcement Learning Environments”
2-71
2 Functions
getActionInfo
Obtain action data specifications from reinforcement learning environment, agent, or experience
buffer
Syntax
actInfo = getActionInfo(env)
actInfo = getActionInfo(agent)
actInfo = getActionInfo(buffer)
Description
actInfo = getActionInfo(env) extracts action information from reinforcement learning
environment env.
Examples
Extract action and observation information that you can use to create other environments or agents.
The reinforcement learning environment for this example is the simple longitudinal dynamics for ego
car and lead car. The training goal is to make the ego car travel at a set velocity while maintaining a
safe distance from lead car by controlling longitudinal acceleration (and braking). This example uses
the same vehicle model as the “Adaptive Cruise Control System Using Model Predictive Control”
(Model Predictive Control Toolbox) example.
mdl = 'rlACCMdl';
open_system(mdl);
agentblk = [mdl '/RL Agent'];
% create the observation info
obsInfo = rlNumericSpec([3 1],'LowerLimit',-inf*ones(3,1),'UpperLimit',inf*ones(3,1));
obsInfo.Name = 'observations';
obsInfo.Description = 'information on velocity error and ego velocity';
% action Info
actInfo = rlNumericSpec([1 1],'LowerLimit',-3,'UpperLimit',2);
actInfo.Name = 'acceleration';
% define environment
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo)
env =
SimulinkEnvWithAgent with properties:
2-72
getActionInfo
Model : rlACCMdl
AgentBlock : rlACCMdl/RL Agent
ResetFcn : []
UseFastRestart : on
The reinforcement learning environment env is a SimulinkWithAgent object with the above
properties.
Extract the action and observation information from the reinforcement learning environment env.
actInfoExt = getActionInfo(env)
actInfoExt =
rlNumericSpec with properties:
LowerLimit: -3
UpperLimit: 2
Name: "acceleration"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"
obsInfoExt = getObservationInfo(env)
obsInfoExt =
rlNumericSpec with properties:
The action information contains acceleration values while the observation information contains the
velocity and velocity error values of the ego vehicle.
Input Arguments
env — Reinforcement learning environment
rlFunctionEnv object | SimulinkEnvWithAgent object | rlNeuralNetworkEnvironment object
| predefined MATLAB environment object
Reinforcement learning environment from which to extract the action information, specified as one of
the following:
• rlFunctionEnv
• rlNeuralNetworkEnvironment
• Predefined MATLAB environment created using rlPredefinedEnv
2-73
2 Functions
For more information on reinforcement learning environments, see “Create MATLAB Reinforcement
Learning Environments” and “Create Simulink Reinforcement Learning Environments”.
Reinforcement learning agent from which to extract the action information, specified as one of the
following objects.
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlPGAgent
• rlACAgent
• rlPPOAgent
• rlTRPOAgent
• rlSACAgent
• rlMBPOAgent
For more information on reinforcement learning agents, see “Reinforcement Learning Agents”.
Experience buffer from which to extract the action information, specified as an rlReplayMemory or
rlPrioritizedReplayMemory object.
Output Arguments
actInfo — Action data specifications
array of rlNumericSpec objects | array of rlFiniteSetSpec objects
Action data specifications extracted from the reinforcement learning environment, returned as an
array of one of the following:
• rlNumericSpec objects
• rlFiniteSetSpec objects
• A mix of rlNumericSpec and rlFiniteSetSpec objects
Version History
Introduced in R2019a
2-74
getActionInfo
See Also
rlNumericSpec | rlFiniteSetSpec | getObservationInfo | rlQAgent | rlSARSAAgent |
rlDQNAgent | rlPGAgent | rlACAgent | rlDDPGAgent
Topics
“Create Simulink Reinforcement Learning Environments”
“Reinforcement Learning Agents”
2-75
2 Functions
getAction
Package: rl.policy
Obtain action from agent, actor, or policy object given environment observations
Syntax
action = getAction(agent,obs)
[action,agent] = getAction(agent,obs)
action = getAction(actor,obs)
[action,nextState] = getAction(actor,obs)
action = getAction(policy,obs)
[action,updatedPolicy] = getAction(policy,obs)
Description
Agent
action = getAction(actor,obs) returns the action generated from the policy represented by
the actor actor, given environment observations obs.
action = getAction(policy,obs) returns the action generated from the policy object policy,
given environment observations.
Examples
Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”.
2-76
getAction
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a TRPO agent from the environment observation and action specifications.
agent = rlTRPOAgent(obsInfo,actInfo);
getAction(agent, ...
{rand(obsInfo(1).Dimension), ...
rand(obsInfo(2).Dimension)})
You can also obtain actions for a batch of observations. For example, obtain actions for a batch of 10
observations.
ans = 1×3
1 1 10
actBatch{1}(1,1,7)
ans = -2
Create observation and action information. You can also obtain these specifications from an
environment.
2-77
2 Functions
fullyConnectedLayer(20,'Name','CriticStateFC2')
fullyConnectedLayer(actinfo.Dimension(1),'Name','fc2')
tanhLayer('Name','tanh1')];
net = dlnetwork(net);
act = getAction(actor,{rand(4,1,10)})
act is a single cell array that contains the two computed actions for all 10 observations in the batch.
act{1}(:,1,7)
0.2643
-0.2934
Create observation and action specification objects. For this example, define the observation and
action spaces as continuous four- and two-dimensional spaces, respectively.
Alternatively, you can use getObservationInfo and getActionInfo to extract the specification
objects from an environment.
Create a continuous deterministic actor. This actor must accept an observation as input and return an
action as output.
To approximate the policy function within the actor, use a recurrent deep neural network model.
Define the network as an array of layer objects, and get the dimension of the observation and action
spaces from the environment specification objects. To create a recurrent network, use a
sequenceInputLayer as the input layer (with size equal to the number of dimensions of the
observation channel) and include at least one lstmLayer.
layers = [
sequenceInputLayer(obsInfo.Dimension(1))
lstmLayer(2)
reluLayer
fullyConnectedLayer(actInfo.Dimension(1))
];
2-78
getAction
Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)
Initialized: true
Number of learnables: 62
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
Create the actor using model, and the observation and action specifications.
actor = rlContinuousDeterministicActor(model,obsInfo,actInfo)
actor =
rlContinuousDeterministicActor with properties:
0.0568
0.0691
policy =
rlAdditiveNoisePolicy with properties:
Use getAction to generate an action from the policy, given a random observation input.
act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}
2-79
2 Functions
ans = 2×1
0.5922
-0.3745
Display the state of the recurrent neural network in the policy object.
xNN = getRNNState(policy);
xNN{1}
0
0
Display the state of the recurrent neural network in the updated policy object.
xpNN = getRNNState(updatedPolicy);
xpNN{1}
0.3327
-0.2479
Input Arguments
agent — Reinforcement learning agent
reinforcement learning agent object
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• Custom agent — For more information, see “Create Custom Reinforcement Learning Agents”.
2-80
getAction
Note agent is an handle object, so it is updated whether it is returned as an output argument or not.
For more information about handle objects, see “Handle Object Behavior”.
actor — Actor
rlContinousDeterministicActor object | rlContinousGaussianActor object |
rlDiscreteCategoricalActor object
• rlMaxQPolicy
• rlEpsilonGreedyPolicy
• rlDeterministicActorPolicy
• rlAdditiveNoisePolicy
• rlStochasticActorPolicy
Environment observations, specified as a cell array with as many elements as there are observation
input channels. Each element of obs contains an array of observations for a single observation input
channel.
For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.
Output Arguments
action — Action
single-element cell array
Action, returned as single-element cell array containing an array with dimensions MA-by-LB-by-LS,
where:
2-81
2 Functions
Note The following continuous action-space actor, policy and agent objects do not enforce the
constraints set by the action specification:
• rlContinuousDeterministicActor
• rlStochasticActorPolicy
• rlACAgent
• rlPGAgent
• rlPPOAgent
In these cases, you must enforce action space constraints within the environment.
Next state of the actor, returned as a cell array. If actor does not use a recurrent neural network,
then state is an empty cell array.
You can set the state of the representation to state using the setState function. For example:
actor = setState(actor,state);
Updated agent, returned as the same agent object as the agent in the input argument. Note that
agent is an handle object. Therefore, its internal states (if any) are updated whether agent is
returned as an output argument or not. For more information about handle objects, see “Handle
Object Behavior”.
Updated policy object. It is identical to the policy object supplied as first input argument, except that
its internal states (if any) are updated.
Tips
The function evaluate behaves, for actor objects, similarly to getAction except for the following
differences.
para
2-82
getAction
Version History
Introduced in R2020a
See Also
evaluate | getValue | getMaxQValue
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”
2-83
2 Functions
getActor
Package: rl.agent
Syntax
actor = getActor(agent)
Description
actor = getActor(agent) returns the actor object from the specified reinforcement learning
agent.
Examples
Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.
load('DoubleIntegDDPG.mat','agent')
actor = getActor(agent);
params = getLearnableParameters(actor)
Modify the parameter values. For this example, simply multiply all of the parameters by 2.
Set the parameter values of the actor to the new modified values.
actor = setLearnableParameters(actor,modifiedParams);
setActor(agent,actor);
getLearnableParameters(getActor(agent))
2-84
getActor
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.
agent = rlPPOAgent(obsInfo,actInfo);
To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.
actor = getActor(agent);
critic = getCritic(agent);
Extract the deep neural networks from both the actor and critic function approximators.
actorNet = getModel(actor);
criticNet = getModel(critic);
The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.
plot(layerGraph(actorNet))
2-85
2 Functions
To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)
You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))
In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.
To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.
For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
createModifiedNetworks
2-86
getActor
plot(layerGraph(modifiedActorNet))
After exporting the networks, insert the networks into the actor and critic function approximators.
actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);
Finally, insert the modified actor and critic function approximators into the actor and critic objects.
agent = setActor(agent,actor);
agent = setCritic(agent,critic);
Input Arguments
agent — Reinforcement learning agent
rlDDPGAgent object | rlTD3Agent object | rlPGAgent object | rlACAgent object | rlPPOAgent
object | rlSACAgent object
Reinforcement learning agent that contains an actor, specified as one of the following:
• rlPGAgent object
• rlDDPGAgent object
2-87
2 Functions
• rlTD3Agent object
• rlACAgent object
• rlSACAgent object
• rlPPOAgent object
• rlTRPOAgent object
Output Arguments
actor — Actor
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object
Version History
Introduced in R2019a
See Also
getCritic | setActor | setCritic | getModel | setModel | getLearnableParameters |
setLearnableParameters
Topics
“Create Policies and Value Functions”
“Import Neural Network Models”
2-88
getCritic
getCritic
Package: rl.agent
Syntax
critic = getCritic(agent)
Description
critic = getCritic(agent) returns the critic object from the specified reinforcement learning
agent.
Examples
Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.
load('DoubleIntegDDPG.mat','agent')
critic = getCritic(agent);
params = getLearnableParameters(critic)
Modify the parameter values. For this example, simply multiply all of the parameters by 2.
Set the parameter values of the critic to the new modified values.
critic = setLearnableParameters(critic,modifiedParams);
setCritic(agent,critic);
getLearnableParameters(getCritic(agent))
2-89
2 Functions
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.
agent = rlPPOAgent(obsInfo,actInfo);
To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.
actor = getActor(agent);
critic = getCritic(agent);
Extract the deep neural networks from both the actor and critic function approximators.
actorNet = getModel(actor);
criticNet = getModel(critic);
The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.
plot(layerGraph(actorNet))
2-90
getCritic
To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)
You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))
In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.
To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.
For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
createModifiedNetworks
2-91
2 Functions
plot(layerGraph(modifiedActorNet))
After exporting the networks, insert the networks into the actor and critic function approximators.
actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);
Finally, insert the modified actor and critic function approximators into the actor and critic objects.
agent = setActor(agent,actor);
agent = setCritic(agent,critic);
Input Arguments
agent — Reinforcement learning agent
rlQAgent | rlSARSAAgent | rlDQNAgent | rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent
| rlSACAgent | rlPPOAgent | rlTRPOAgent
Reinforcement learning agent that contains a critic, specified as one of the following objects:
• rlQAgent
• rlSARSAAgent
2-92
getCritic
• rlDQNAgent
• rlPGAgent (when using a critic to estimate a baseline value function)
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
Output Arguments
critic — Critic
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object | two-
element row vector of rlQValueFunction objects
Version History
Introduced in R2019a
See Also
getActor | setActor | setCritic | getModel | setModel | getLearnableParameters |
setLearnableParameters
Topics
“Create Policies and Value Functions”
“Import Neural Network Models”
2-93
2 Functions
getLearnableParameters
Package: rl.policy
Obtain learnable parameter values from agent, function approximator, or policy object
Syntax
pars = getLearnableParameters(agent)
pars = getLearnableParameters(fcnAppx)
pars = getLearnableParameters(policy)
Description
Agent
Examples
Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.
load('DoubleIntegDDPG.mat','agent')
critic = getCritic(agent);
params = getLearnableParameters(critic)
2-94
getLearnableParameters
Modify the parameter values. For this example, simply multiply all of the parameters by 2.
Set the parameter values of the critic to the new modified values.
critic = setLearnableParameters(critic,modifiedParams);
setCritic(agent,critic);
getLearnableParameters(getCritic(agent))
Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.
load('DoubleIntegDDPG.mat','agent')
actor = getActor(agent);
params = getLearnableParameters(actor)
Modify the parameter values. For this example, simply multiply all of the parameters by 2.
Set the parameter values of the actor to the new modified values.
actor = setLearnableParameters(actor,modifiedParams);
setActor(agent,actor);
getLearnableParameters(getActor(agent))
2-95
2 Functions
Input Arguments
agent — Reinforcement learning agent
reinforcement learning agent object
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• Custom agent — For more information, see “Create Custom Reinforcement Learning Agents”.
To create an actor or critic function object, use one of the following methods.
2-96
getLearnableParameters
• rlMaxQPolicy
• rlEpsilonGreedyPolicy
• rlDeterministicActorPolicy
• rlAdditiveNoisePolicy
• rlStochasticActorPolicy
Output Arguments
pars — Learnable parameters
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object
Learnable parameter values for the function object, returned as a cell array. You can modify these
parameter values and set them in the original agent or a different agent using the
setLearnableParameters function.
Version History
Introduced in R2019a
Using representation objects to create actors and critics for reinforcement learning agents is no
longer recommended. Therefore, getLearnableParameters now uses function approximator
objects instead.
See Also
setLearnableParameters | getActor | getCritic | setActor | setCritic
Topics
“Create Policies and Value Functions”
“Import Neural Network Models”
2-97
2 Functions
getMaxQValue
Obtain maximum estimated value over all possible actions from a Q-value function critic with discrete
action space, given environment observations
Syntax
[maxQ,maxActionIndex] = getMaxQValue(qValueFcnObj,obs)
[maxQ,maxActionIndex,state] = getMaxQValue( ___ )
Description
[maxQ,maxActionIndex] = getMaxQValue(qValueFcnObj,obs) evaluates the discrete-action-
space Q-value function critic qValueFcnObj and returns the maximum estimated value over all
possible actions maxQ, with the corresponding action index maxActionIndex, given environment
observations obs.
Examples
Create an observation and action specification objects (or alternatively use getObservationInfo
and getActionInfo to extract the specification objects from an environment. For this example,
define the observation space as a continuous three-dimensional space, and the action space as a finite
set consisting of three possible values (named -1, 0, and 1).
obsInfo = rlNumericSpec([3 1]);
actInfo = rlFiniteSetSpec([-1 0 1]);
Create a custom basis function to approximate the Q-value function within the critic, and define an
initial parameter vector.
myBasisFcn = @(myobs,myact) [ ...
ones(4,1);
myobs(:); myact;
myobs(:).^2; myact.^2;
sin(myobs(:)); sin(myact);
cos(myobs(:)); cos(myact) ];
W0 = rand(20,1);
Use getMaxQValue to return the maximum value, among the possible actions, given a random
observation. Also return the index corresponding to the action that maximizes the value.
[v,i] = getMaxQValue(critic,{rand(3,1)})
2-98
getMaxQValue
v = 9.0719
i = 3
Create a batch set of 64 random independent observations. The third dimension is the batch size,
while the fourth is the sequence length for any recurrent neural network used by the critic (in this
case not used).
batchobs = rand(3,1,64,1);
bv = getMaxQValue(critic,{batchobs});
size(bv)
ans = 1×2
1 64
bv(44)
ans = 10.4138
Input Arguments
qValueFcnObj — Q-value function critic
rlQValueFunction object | rlVectorQValueFunction object
Environment observations, specified as a cell array with as many elements as there are observation
input channels. Each element of obs contains an array of observations for a single observation input
channel.
For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.
2-99
2 Functions
Output Arguments
maxQ — Maximum Q-value estimate
array
Maximum Q-value estimate across all possible discrete actions, returned as a 1-by-LB-by-LS array,
where:
Action index corresponding to the maximum Q value, returned as a 1-by-LB-by-LS array, where:
Updated state of qValueFcnObj, returned as a cell array. If qValueFcnObj does not use a recurrent
neural network, then state is an empty cell array.
You can set the state of the critic to state using the setState function. For example:
qValueFcnObj = setState(qValueFcnObj,state);
Version History
Introduced in R2020a
See Also
getValue | evaluate | getAction
Topics
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”
2-100
getModel
getModel
Package: rl.function
Syntax
model = getModel(fcnAppx)
Description
model = getModel(fcnAppx) returns the function approximation model used by the actor or critic
function object fcnAppx.
Examples
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.
agent = rlPPOAgent(obsInfo,actInfo);
To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.
actor = getActor(agent);
critic = getCritic(agent);
Extract the deep neural networks from both the actor and critic function approximators.
actorNet = getModel(actor);
criticNet = getModel(critic);
The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.
2-101
2 Functions
plot(layerGraph(actorNet))
To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)
You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))
In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.
To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.
For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
2-102
getModel
createModifiedNetworks
plot(layerGraph(modifiedActorNet))
After exporting the networks, insert the networks into the actor and critic function approximators.
actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);
Finally, insert the modified actor and critic function approximators into the actor and critic objects.
agent = setActor(agent,actor);
agent = setCritic(agent,critic);
Input Arguments
fcnAppx — Actor or critic function object
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object
2-103
2 Functions
To create an actor or critic function object, use one of the following methods.
Note For agents with more than one critic, such as TD3 and SAC agents, you must call getModel for
each critic representation individually, rather than calling getModel for the array returned by
getCritic.
critics = getCritic(myTD3Agent);
criticNet1 = getModel(critics(1));
criticNet2 = getModel(critics(2));
Output Arguments
model — Function approximation model
dlnetwork object | rlTable object | 1-by-2 cell array
Version History
Introduced in R2020b
Using representation objects to create actors and critics for reinforcement learning agents is no
longer recommended. Therefore, getModel now uses function approximator objects instead.
2-104
getModel
Starting from R2021b, built-in agents use dlnetwork objects as actor and critic representations, so
getModel returns a dlnetwork object.
• Due to numerical differences in the network calculations, previously trained agents might behave
differently. If this happens, you can retrain your agents.
• To use Deep Learning Toolbox™ functions that do not support dlnetwork, you must convert the
network to layerGraph. For example, to use deepNetworkDesigner, replace
deepNetworkDesigner(network) with deepNetworkDesigner(layerGraph(network)).
See Also
getActor | setActor | getCritic | setCritic | setModel | dlnetwork
Topics
“Create Policies and Value Functions”
2-105
2 Functions
getObservationInfo
Obtain observation data specifications from reinforcement learning environment, agent, or
experience buffer
Syntax
obsInfo = getObservationInfo(env)
obsInfo = getObservationInfo(agent)
obsInfo = getObservationInfo(buffer)
Description
obsInfo = getObservationInfo(env) extracts observation information from reinforcement
learning environment env.
Examples
Extract action and observation information that you can use to create other environments or agents.
The reinforcement learning environment for this example is the simple longitudinal dynamics for ego
car and lead car. The training goal is to make the ego car travel at a set velocity while maintaining a
safe distance from lead car by controlling longitudinal acceleration (and braking). This example uses
the same vehicle model as the “Adaptive Cruise Control System Using Model Predictive Control”
(Model Predictive Control Toolbox) example.
mdl = 'rlACCMdl';
open_system(mdl);
agentblk = [mdl '/RL Agent'];
% create the observation info
obsInfo = rlNumericSpec([3 1],'LowerLimit',-inf*ones(3,1),'UpperLimit',inf*ones(3,1));
obsInfo.Name = 'observations';
obsInfo.Description = 'information on velocity error and ego velocity';
% action Info
actInfo = rlNumericSpec([1 1],'LowerLimit',-3,'UpperLimit',2);
actInfo.Name = 'acceleration';
% define environment
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo)
env =
SimulinkEnvWithAgent with properties:
2-106
getObservationInfo
Model : rlACCMdl
AgentBlock : rlACCMdl/RL Agent
ResetFcn : []
UseFastRestart : on
The reinforcement learning environment env is a SimulinkWithAgent object with the above
properties.
Extract the action and observation information from the reinforcement learning environment env.
actInfoExt = getActionInfo(env)
actInfoExt =
rlNumericSpec with properties:
LowerLimit: -3
UpperLimit: 2
Name: "acceleration"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"
obsInfoExt = getObservationInfo(env)
obsInfoExt =
rlNumericSpec with properties:
The action information contains acceleration values while the observation information contains the
velocity and velocity error values of the ego vehicle.
Input Arguments
env — Reinforcement learning environment
rlFunctionEnv object | SimulinkEnvWithAgent object | rlNeuralNetworkEnvironment object
| predefined MATLAB environment object
Reinforcement learning environment from which to extract the observation information, specified as
one of the following objects.
• rlFunctionEnv
• rlNeuralNetworkEnvironment
• Predefined MATLAB environment created using rlPredefinedEnv
2-107
2 Functions
For more information on reinforcement learning environments, see “Create MATLAB Reinforcement
Learning Environments” and “Create Simulink Reinforcement Learning Environments”.
Reinforcement learning agent from which to extract the observation information, specified as one of
the following objects.
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlDDPGAgent
• rlTD3Agent
• rlPGAgent
• rlACAgent
• rlPPOAgent
• rlTRPOAgent
• rlSACAgent
• rlMBPOAgent
For more information on reinforcement learning agents, see “Reinforcement Learning Agents”.
Output Arguments
obsInfo — Observation data specifications
array of rlNumericSpec objects | array of rlFiniteSetSpec objects
Observation data specifications extracted from the reinforcement learning environment, returned as
an array of one of the following:
• rlNumericSpec objects
• rlFiniteSetSpec objects
• A mix of rlNumericSpec and rlFiniteSetSpec objects
Version History
Introduced in R2019a
2-108
getObservationInfo
See Also
rlNumericSpec | rlFiniteSetSpec | getActionInfo | rlQAgent | rlSARSAAgent |
rlDQNAgent | rlPGAgent | rlACAgent | rlDDPGAgent
Topics
“Create Simulink Reinforcement Learning Environments”
“Reinforcement Learning Agents”
2-109
2 Functions
getValue
Obtain estimated value from a critic given environment observations and actions
Syntax
value = getValue(valueFcnAppx,obs)
value = getValue(vqValueFcnAppx,obs)
value = getValue(qValueFcnAppx,obs,act)
Description
Value Function Critic
[value,state] = getValue( ___ ) also returns the updated state of the critic object when it
contains a recurrent neural network.
Examples
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
To approximate the value function within the critic, create a neural network. Define a single path
from the network input (the observation) to its output (the value), as an array of layer objects.
2-110
getValue
Convert the network to a dlnetwork object and display the number of weights.
net = dlnetwork(net);
summary(net);
Initialized: true
Number of learnables: 5
Inputs:
1 'input' 4 features
Create a critic using the network and the observation specification object. When you use this syntax
the network input layer is automatically associated with the environment observation according to
the dimension specifications in obsInfo.
critic = rlValueFunction(net,obsInfo);
Obtain a value function estimate for a random single observation. Use an observation array with the
same dimensions as the observation specification.
val = getValue(critic,{rand(4,1)})
val = single
0.7904
You can also obtain value function estimates for a batch of observations. For example obtain value
functions for a batch of 20 observations.
batchVal = getValue(critic,{rand(4,1,20)});
size(batchVal)
ans = 1×2
1 20
valBatch contains one value function estimate for each observation in the batch.
Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, define
the observation space as a continuous four-dimensional space, so that a single observation is a
column vector containing four doubles, and the action space as a finite set consisting of three
possible values (named 7, 5, and 3 in this case).
To approximate the Q-value function within the critic, create a neural network. Define a single path
from the network input to its output as array of layer objects. The input of the network must accept a
2-111
2 Functions
four-element vector, as defined by obsInfo. The output must be a single output layer having as many
elements as the number of possible discrete actions (three in this case, as defined by actInfo).
net = [featureInputLayer(4)
fullyConnectedLayer(3)];
Convert the network to a dlnetwork object and display the number of weights.
net = dlnetwork(net);
summary(net)
Initialized: true
Number of learnables: 15
Inputs:
1 'input' 4 features
Create the critic using the network, as well as the names of the observation and action specification
objects. The network input layers are automatically associated with the components of the
observation signals according to the dimension specifications in obsInfo.
critic = rlVectorQValueFunction(net,obsInfo,actInfo);
Use getValue to return the values of a random observation, using the current network weights.
v = getValue(critic,{rand(obsInfo.Dimension)})
0.7232
0.8177
-0.2212
v contains three value function estimates, one for each possible discrete action.
You can also obtain value function estimates for a batch of observations. For example, obtain value
function estimates for a batch of 10 observations.
ans = 1×2
3 10
batchV contains three value function estimates for each observation in the batch.
Create observation and action specification objects (or alternatively use getObservationInfo and
getObservationInfo to extract the specification object from an environment). For this example,
define the observation space as having two continuous channels, the first one carrying an 8 by 3
2-112
getValue
matrix, and the second one a continuous four-dimensional vector. The action specification is a
continuous column vector containing 2 doubles.
The output of the critic is the scalar W'*myBasisFcn(obs,act), representing the Q-value function
to be approximated.
Use getValue to return the value of a random observation-action pair, using the current parameter
matrix.
v = getValue(critic,{rand(8,3),(1:4)'},{rand(2,1)})
v = 68.8628
Create a random observation set of batch size 64 for each channel. The third dimension is the batch
size, while the fourth is the sequence length for any recurrent neural network used by the critic (in
this case not used).
batchobs_ch1 = rand(8,3,64,1);
batchobs_ch2 = rand(4,1,64,1);
batchact = rand(2,1,64,1);
Obtain the state-action value function estimate for the batch of observations and actions.
bv = getValue(critic,{batchobs_ch1,batchobs_ch2},{batchact});
size(bv)
ans = 1×2
1 64
bv(23)
ans = 46.6310
2-113
2 Functions
Input Arguments
valueFcnAppx — Value function critic
rlValueFunction object
obs — Observations
cell array
Observations, specified as a cell array with as many elements as there are observation input
channels. Each element of obs contains an array of observations for a single observation input
channel.
For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.
act — Action
single-element cell array
Action, specified as a single-element cell array that contains an array of action values.
2-114
getValue
For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.
Output Arguments
value — Estimated value function
array
Updated state of the critic, returned as a cell array. If the critic does not use a recurrent neural
network, then state is an empty cell array.
You can set the state of the critic to state using the setState function. For example:
valueFcnAppx = setState(valueFcnAppx,state);
Tips
The more general function evaluate behaves, for critic objects, similarly to getValue except that
evaluate returns results inside a single-cell array.
para
Version History
Introduced in R2020a
See Also
evaluate | getAction | getMaxQValue
Topics
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”
2-115
2 Functions
gradient
Package: rl.function
Evaluate gradient of function approximator object given observation and action input data
Syntax
grad = gradient(fcnAppx,'output-input',inData)
grad = gradient(fcnAppx,'output-parameters',inData)
grad = gradient(fcnAppx,lossFcn,inData,fcnData)
Description
grad = gradient(fcnAppx,'output-input',inData) evaluates the gradient of the sum of the
outputs of the function approximator object fcnAppx with respect to its inputs. It returns the value of
the gradient grad when the input of fcnAppx is inData.
Examples
Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, define
an observation space with of three channels. The first channel carries an observation from a
continuous three-dimensional space, so that a single observation is a column vector containing three
doubles. The second channel carries a discrete observation made of a two-dimensional row vector
that can take one of five different values. The third channel carries a discrete scalar observation that
can be either zero or one. Finally, the action space is a continuous four-dimensional space, so that a
single action is a column vector containing four doubles, each between -10 and 10.
To approximate the policy within the actor, use a recurrent deep neural network. For a continuous
Gaussian actor, the network must have two output layers (one for the mean values the other for the
standard deviation values), each having as many elements as the dimension of the action space.
2-116
gradient
Create a the network, defining each path as an array of layer objects. Use sequenceInputLayer as
the input layer and include an lstmLayer as one of the other network layers. Also use a softplus
layer to enforce nonnegativity of the standard deviations and a ReLU layer to scale the mean values
to the desired output range. Get the dimensions of the observation and action spaces from the
environment specification objects, and specify a name for the input layers, so you can later explicitly
associate them with the appropriate environment channel.
2-117
2 Functions
% Connect layers
net = connectLayers(net,"infc1","cat/in1");
net = connectLayers(net,"infc2","cat/in2");
net = connectLayers(net,"infc3","cat/in3");
net = connectLayers(net,"jntfc","tanhMean/in");
net = connectLayers(net,"jntfc","tanhStdv/in");
% Plot network
plot(net)
% Convert to dlnetwork
net = dlnetwork(net);
Initialized: true
Inputs:
1 'netObsIn1' Sequence input with 3 dimensions
2 'netObsIn2' Sequence input with 2 dimensions
3 'netObsIn3' Sequence input with 1 dimensions
2-118
gradient
Create the actor with rlContinuousGaussianActor, using the network, the observations and
action specification objects, as well as the names of the network input layer and the options object.
To return mean value and standard deviations of the Gaussian distribution as a function of the
current observation, use evaluate.
The result is a cell array with two elements, the first one containing a vector of mean values, and the
second containing a vector of standard deviations.
prob{1}
-1.5454
0.4908
-0.1697
0.8081
prob{2}
0.6913
0.6141
0.7291
0.6475
-3.2003
-0.0534
-1.0700
-0.4032
Calculate the gradients of the sum of the outputs (all the mean values plus all the standard
deviations) with respect to the inputs, given a random observation.
2-119
2 Functions
rand(obsInfo(2).Dimension) , ...
rand(obsInfo(3).Dimension)} )
The result is a cell array with as many elements as the number of input channels. Each element
contains the derivatives of the sum of the outputs with respect to each component of the input
channel. Display the gradient with respect to the element of the second channel.
gro{2}
-1.3404
0.6642
Obtain the gradient with respect of five independent sequences, each one made of nine sequential
observations.
Display the derivative of the sum of the outputs with respect to the third observation element of the
first input channel, after the seventh sequential observation in the fourth independent batch.
gro_batch{1}(3,4,7)
ans = single
0.2020
actor = accelerate(actor,true);
Calculate the gradients of the sum of the outputs with respect to the parameters, given a random
observation.
2-120
gradient
{ 4x1 single}
{ 4x1 single}
{ 4x1 single}
{32x12 single}
{32x8 single}
{32x1 single}
{ 4x8 single}
{ 4x1 single}
{ 4x4 single}
{ 4x1 single}
{ 4x4 single}
{ 4x1 single}
Each array within a cell contains the gradient of the sum of the outputs with respect to a group of
parameters.
grp_batch = gradient(actor,"output-parameters", ...
{rand([obsInfo(1).Dimension 5 9]) , ...
rand([obsInfo(2).Dimension 5 9]) , ...
rand([obsInfo(3).Dimension 5 9])} )
If you use a batch of inputs, gradient uses the whole input sequence (in this case nine steps), and
all the gradients with respect to the independent batch dimensions (in this case five) are added
together. Therefore, the returned gradient has always the same size as the output from
getLearnableParameters.
Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, define
an observation space made of two channels. The first channel carries an observation from a
continuous four-dimensional space. The second carries a discrete scalar observation that can be
either zero or one. Finally, the action space consist of a scalar that can be -1, 0, or 1.
obsInfo = [rlNumericSpec([4 1])
rlFiniteSetSpec([0 1])];
2-121
2 Functions
To approximate the vector Q-value function within the critic, use a recurrent deep neural network.
The output layer must have three elements, each one expressing the value of executing the
corresponding action, given the observation.
Create the neural network, defining each network path as an array of layer objects. Get the
dimensions of the observation and action spaces from the environment specification objects, use
sequenceInputLayer as the input layer, and include an lstmLayer as one of the other network
layers.
% Connect layers
net = connectLayers(net,"infc1","cct/in1");
net = connectLayers(net,"infc2","cct/in2");
% Plot network
plot(net)
2-122
gradient
% Convert to dlnetwork
net = dlnetwork(net);
Initialized: true
Inputs:
1 'netObsIn1' Sequence input with 4 dimensions
2 'netObsIn2' Sequence input with 1 dimensions
Create the critic with rlVectorQValueFunction, using the network and the observation and action
specification objects.
critic = rlVectorQValueFunction(net,obsInfo,actInfo);
To return the value of the actions as a function of the current observation, use getValue or
evaluate.
2-123
2 Functions
When you use evaluate, the result is a single-element cell array, containing a vector with the values
of all the possible actions, given the observation.
val{1}
-0.0054
-0.0943
0.0177
Calculate the gradients of the sum of the outputs with respect to the inputs, given a random
observation.
The result is a cell array with as many elements as the number of input channels. Each element
contains the derivative of the sum of the outputs with respect to each component of the input
channel. Display the gradient with respect to the element of the second channel.
gro{2}
ans = single
-0.0396
Obtain the gradient with respect of five independent sequences each one made of nine sequential
observations.
Display the derivative of the sum of the outputs with respect to the third observation element of the
first input channel, after the seventh sequential observation in the fourth independent batch.
gro_batch{1}(3,4,7)
ans = single
0.0443
2-124
gradient
critic = accelerate(critic,true);
Calculate the gradients of the sum of the outputs with respect to the parameters, given a random
observation.
Each array within a cell contains the gradient of the sum of the outputs with respect to a group of
parameters.
If you use a batch of inputs, gradient uses the whole input sequence (in this case nine steps), and
all the gradients with respect to the independent batch dimensions (in this case five) are added
together. Therefore, the returned gradient always has the same size as the output from
getLearnableParameters.
Input Arguments
fcnAppx — Function approximator object
function approximator object
• rlValueFunction,
• rlQValueFunction,
• rlVectorQValueFunction,
2-125
2 Functions
• rlDiscreteCategoricalActor,
• rlContinuousDeterministicActor,
• rlContinuousGaussianActor,
• rlContinuousDeterministicTransitionFunction,
• rlContinuousGaussianTransitionFunction,
• rlContinuousDeterministicRewardFunction,
• rlContinuousGaussianRewardFunction,
• rlIsDoneFunction.
Input data for the function approximator, specified as a cell array with as many elements as the
number of input channels of fcnAppx. In the following section, the number of observation channels is
indicated by NO.
• If fcnAppx is an rlQValueFunction, an
rlContinuousDeterministicTransitionFunction or an
rlContinuousGaussianTransitionFunction object, then each of the first NO elements of
inData must be a matrix representing the current observation from the corresponding
observation channel. They must be followed by a final matrix representing the action.
• If fcnAppx is a function approximator object representing an actor or critic (but not an
rlQValueFunction object), inData must contain NO elements, each one being a matrix
representing the current observation from the corresponding observation channel.
• If fcnAppx is an rlContinuousDeterministicRewardFunction, an
rlContinuousGaussianRewardFunction, or an rlIsDoneFunction object, then each of the
first NO elements of inData must be a matrix representing the current observation from the
corresponding observation channel. They must be followed by a matrix representing the action,
and finally by NO elements, each one being a matrix representing the next observation from the
corresponding observation channel.
For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.
Example: {rand(8,3,64,1),rand(4,1,64,1),rand(2,1,64,1)}
Loss function, specified as a function handle to a user-defined function. The user defined function can
either be an anonymous function or a function on the MATLAB path. The function first input
2-126
gradient
parameter must be a cell array like the one returned from the evaluation of fcnAppx. For more
information, see the description of outData in evaluate. The second, optional, input argument of
lossFcn contains additional data that might be needed for the gradient calculation, as described
below in fcnData. For an example of the signature that this function must have, see “Train
Reinforcement Learning Policy Using Custom Training Loop”.
Additional data for the loss function, specified as any MATLAB data type, typically a structure or cell
array. For an example see “Train Reinforcement Learning Policy Using Custom Training Loop”.
Output Arguments
grad — Value of the gradient
cell array
When the type of gradient is from the sum of the outputs with respect to the inputs of fcnAppx, then
grad is a cell array in which each element contains the gradient of the sum of all the outputs with
respect to the corresponding input channel.
When the type of gradient is from the output with respect to the parameters of fcnAppx, then grad
is a cell array in which each element contains the gradient of the sum of outputs belonging to an
output channel with respect to the corresponding group of parameters. The gradient is calculated
using the whole history of LS inputs, and all the LB gradients with respect to the independent input
sequences are added together in grad. Therefore, grad has always the same size as the result from
getLearnableParameters.
For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.
Version History
Introduced in R2022a
See Also
evaluate | accelerate | getLearnableParameters | rlValueFunction | rlQValueFunction
| rlVectorQValueFunction | rlContinuousDeterministicActor |
rlDiscreteCategoricalActor | rlContinuousGaussianActor |
rlContinuousDeterministicTransitionFunction |
2-127
2 Functions
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction
Topics
“Create Custom Reinforcement Learning Agents”
“Train Reinforcement Learning Policy Using Custom Training Loop”
2-128
hyperbolicPenalty
hyperbolicPenalty
Hyperbolic penalty value for a point with respect to a bounded region
Syntax
p = hyperbolicPenalty(x,xmin,xmax)
p = hyperbolicPenalty( ___ ,lambda,tau)
Description
p = hyperbolicPenalty(x,xmin,xmax) calculates the nonnegative (hyperbolic) penalty vector p
for the point x with respect to the region bounded by xmin and xmax. p has the same dimension as x.
This syntax uses the default values of 1 and 0.1 for the lambda and tau parameters of the
hyperbolic function, respectively.
p = hyperbolicPenalty( ___ ,lambda,tau) specifies both the lambda and tau parameters of
the hyperbolic function. If lambda is an empty matrix its default value is used. Likewise if tau is an
empty matrix or it is omitted, its default value is used instead.
Examples
This example shows how to use the hyperbolicPenalty function to calculate the hyperbolic
penalty for a given point with respect to a bounded region.
Calculate the penalty value for the point 0.1 within the interval [-2,2], using default values for the
lambda and tau parameters.
hyperbolicPenalty(0.1,-2,2)
ans = 0.0050
Calculate the penalty value for the point 4 outside the interval [-2,2].
hyperbolicPenalty(4,-2,2)
ans = 4.0033
Calculate the penalty value for the point 0.1 within the interval [-2,2], using a lambda parameter of
5.
hyperbolicPenalty(0.1,-2,2,5)
ans = 0.0010
Calculate the penalty value for the point 4 outside the interval [-2,2], using a lambda parameter of 5.
hyperbolicPenalty(4,-2,2,5)
ans = 20.0007
2-129
2 Functions
Calculate the penalty value for the point 4 outside the interval [-2,2], using a tau parameter of 0.5.
hyperbolicPenalty(4,-2,2,5,0.5)
ans = 20.0167
Calculate the penalty value for the point [-2,0,4] with respect to the box defined by the intervals
[0,1], [-1,1], and [-2,2] along the x, y, and z dimensions, respectively, using the default value for
lambda and a tau parameter of 0.
ans = 3×1
4
0
4
x = -5:0.01:5;
Calculate penalties for all the points in the vector, using default values for the lambda and tau
parameters.
p = hyperbolicPenalty(x,-2,2);
plot(x,p)
grid
xlabel("point position");
ylabel("penalty value");
title("Penalty values over an interval");
2-130
hyperbolicPenalty
Input Arguments
x — Point for which the penalty is calculated
scalar | vector | matrix
Point for which the penalty is calculated, specified as a numeric scalar, vector or matrix.
Example: [0.5; 1.6]
Lower bounds for x, specified as a numeric scalar, vector or matrix. To use the same minimum value
for all elements in x specify xmin as a scalar.
Example: -1
Upper bounds for x, specified as a numeric scalar, vector or matrix. To use the same maximum value
for all elements in x specify xmax as a scalar.
Example: 2
2-131
2 Functions
Output Arguments
p — Penalty value
nonnegative vector
Penalty value, returned as a vector of nonnegative elements. Each element pi depends on the position
of xi with respect to the interval specified by xmini and xmaxi. The hyperbolic penalty function
returns the value:
2 2 2 2
p(x) = − λ x − xmin + λ x − xmin + τ2 − λ xmax − x + λ xmax − x + τ2
Here, λ is the argument lambda, and τ is the argument tau. Note that for positive values of τ the
returned penalty value is always positive, because on the right side of the equation the magnitude of
the second term is always greater than that of the first, and the magnitude of the fourth term is
always greater than that of the third. If τ is zero, then the returned penalty is zero inside the interval
defined by the bounds, and it grows linearly with x outside this interval. If x is multidimensional, then
the calculation is applied independently on each dimension. Penalty functions are typically used to
generate negative rewards when constraints are violated, such as in generateRewardFunction.
Version History
Introduced in R2021b
Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
See Also
Functions
generateRewardFunction | exteriorPenalty | barrierPenalty
Topics
“Generate Reward Function from a Model Predictive Controller for a Servomotor”
“Define Reward Signals”
2-132
inspectTrainingResult
inspectTrainingResult
Plot training information from a previous training session
Syntax
inspectTrainingResult(trainResults)
inspectTrainingResult(agentResults)
Description
By default, the train function shows the training progress and results in the Episode Manager
during training. If you configure training to not show the Episode Manager or you close the Episode
Manager after training, you can view the training results using the inspectTrainingResult
function, which opens the Episode Manager. You can also use inspectTrainingResult to view the
training results for agents saved during training.
Examples
For this example, assume that you have trained the agent in the “Train Reinforcement Learning
Agent in MDP Environment” example and subsequently closed the Episode Manager.
inspectTrainingResult(trainingStats)
2-133
2 Functions
For this example, load the environment and agent for the “Train Reinforcement Learning Agent in
MDP Environment” example.
load mdpAgentAndEnvironment
Specify options for training the agent. Configure the SaveAgentCriteria and SaveAgentValue
options to save all agents after episode 30.
trainOpts = rlTrainingOptions;
trainOpts.MaxStepsPerEpisode = 50;
trainOpts.MaxEpisodes = 50;
trainOpts.Plots = "none";
trainOpts.SaveAgentCriteria = "EpisodeCount";
trainOpts.SaveAgentValue = 30;
Train the agent. During training, when an episode has a reward greater than or equal to 13, a copy of
the agent is saved in a savedAgents folder.
Load the training results for one of the saved agents. This command loads both the agent and a
structure that contains the corresponding training results.
2-134
inspectTrainingResult
load savedAgents/Agent50
View the training results from the saved agent result structure.
inspectTrainingResult(savedAgentResult)
The Episode Manager shows the training progress up to the episode in which the agent was saved.
Input Arguments
trainResults — Training episode data
structure | structure array
Training episode data, specified as a structure or structure array returned by the train function.
Saved agent results, specified as a structure previously saved by the train function. The train
function saves agents when you specify the SaveAgentCriteria and SaveAgentValue options in
the rlTrainingOptions object used during training.
When you load a saved agent, the agent and its training results are added to the MATLAB workspace
as saved_agent and savedAgentResultStruct, respectively. To plot the training data for this
agent, use the following command.
inspectTrainingResult(savedAgentResultStruct)
For multi-agent training, savedAgentResultStruct contains structure fields with training results
for all the trained agents.
Version History
Introduced in R2021a
See Also
Functions
train
Topics
“Train Reinforcement Learning Agents”
2-135
2 Functions
predict
Package: rl.function
Predict next observation, next reward, or episode termination given observation and action input data
Syntax
predNextObs = predict(tsnFcnAppx,obs,act)
predReward = predict(rwdFcnAppx,obs,act,nextObs)
predIsDone = predict(idnFcnAppx,obs,act)
Description
predNextObs = predict(tsnFcnAppx,obs,act) evaluates the environment transition function
approximator object tsnFcnAppx and returns the predicted next observation nextObs, given the
current observation obs and the action act.
Examples
Create observation and action specification objects (or alternatively use getObservationInfo and
getActionInfo to extract the specification objects from an environment). For this example, two
observation channels carry vectors in a four- and two-dimensional space, respectively. The action is a
continuous three-dimensional vector.
Create a deep neural network to use as approximation model for the transition function approximator.
For a continuous Gaussian transition function approximator, the network must have two output layers
for each observation (one for the mean values the other for the standard deviation values).
Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects, and specify a name for the input layers, so
you can later explicitly associate them with the appropriate environment channel.
2-136
predict
prod(obsInfo(1).Dimension), ...
Name="netObsIn1")
fullyConnectedLayer(5,Name="infc1") ];
2-137
2 Functions
net = addLayers(net,sdevPath2);
% Connect layers
net = connectLayers(net,"infc1","concat/in1");
net = connectLayers(net,"infc2","concat/in2");
net = connectLayers(net,"infc3","concat/in3");
net = connectLayers(net,"jntfc","tanhMean1/in");
net = connectLayers(net,"jntfc","tanhStdv1/in");
net = connectLayers(net,"jntfc","tanhMean2/in");
net = connectLayers(net,"jntfc","tanhStdv2/in");
% Plot network
plot(net)
% Convert to dlnetwork
net=dlnetwork(net);
Initialized: true
Inputs:
1 'netObsIn1' 4 features
2 'netObsIn2' 2 features
3 'netActIn' 3 features
2-138
predict
Create a continuous Gaussian transition function approximator object, specifying the names of all the
input and output layers.
tsnFcnAppx = rlContinuousGaussianTransitionFunction(...
net,obsInfo,actInfo,...
ObservationInputNames=["netObsIn1","netObsIn2"], ...
ActionInputNames="netActIn", ...
NextObservationMeanOutputNames=["scale1","scale2"], ...
NextObservationStandardDeviationOutputNames=["splus1","splus2"] );
Each element of the resulting cell array represents the prediction for the corresponding observation
channel.
To display the mean values and standard deviations of the Gaussian probability distribution for the
predicted observations, use evaluate.
predDst = evaluate(tsnFcnAppx, ...
{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension), ...
rand(actInfo(1).Dimension)})
The result is a cell array in which the first and second element represent the mean values for the
predicted observations in the first and second channel, respectively. The third and fourth element
represent the standard deviations for the predicted observations in the first and second channel,
respectively.
Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
To approximate the reward function, create a deep neural network. For this example, the network has
two input channels, one for the current action and one for the next observations. The single output
channel contains a scalar, which represents the value of the predicted reward.
Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specifications, and specify a name for the input layers, so you can
later explicitly associate them with the appropriate environment channel.
2-139
2 Functions
commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64,Name="FC2")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(64,Name="FC3")
reluLayer(Name="CriticCommonRelu3")
fullyConnectedLayer(1,Name="reward")];
net = layerGraph(nextStatePath);
net = addLayers(net,actionPath);
net = addLayers(net,commonPath);
net = connectLayers(net,"nextState","concat/in1");
net = connectLayers(net,"action","concat/in2");
plot(net)
2-140
predict
net = dlnetwork(net);
summary(net);
Initialized: true
Inputs:
1 'nextState' 4 features
2 'action' 1 features
Using this reward function object, you can predict the next reward value based on the current action
and next observation. For example, predict the reward for a random action and next observation.
Since, for this example, only the action and the next observation influence the reward, use an empty
cell array for the current observation.
act = rand(actInfo.Dimension);
nxtobs = rand(obsInfo.Dimension);
reward = predict(rwdFcnAppx,{}, {act}, {nxtobs})
reward = single
0.1034
Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
To approximate the is-done function, use a deep neural network. The network has one input channel
for the next observations. The single output channel is for the predicted termination signal.
2-141
2 Functions
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64,Name="FC3")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(2,Name="isdone0")
softmaxLayer(Name="isdone")];
net = layerGraph(commonPath);
plot(net)
Covert the network to a dlnetwork object and display the number of weights.
net = dlnetwork(net);
summary(net);
Initialized: true
Inputs:
1 'nextState' 4 features
isDoneFcnAppx = rlIsDoneFunction(...
net,obsInfo,actInfo,...
NextObservationInputNames="nextState");
2-142
predict
Using this is-done function approximator object, you can predict the termination signal based on the
next observation. For example, predict the termination signal for a random next observation. Since
for this example the termination signal only depends on the next observation, use empty cell arrays
for the current action and observation inputs.
nxtobs = rand(obsInfo.Dimension);
predIsDone = predict(isDoneFcnAppx,{},{},{nxtobs})
predIsDone = 0
predIsDoneProb{1}
0.5405
0.4595
The first number is the probability of obtaining a 0 (no termination predicted), the second one is the
probability of obtaining a 1 (termination predicted).
Input Arguments
tsnFcnAppx — Environment transition function approximator object
rlContinuousDeterministicTransitionFunction object |
rlContinuousGaussianTransitionFunction object
• rlContinuousDeterministicTransitionFunction object
• rlContinuousGaussianTransitionFunction object
• rlContinuousDeterministicRewardFunction object
• rlContinuousGaussianRewardFunction object
• Function handle object. For more information about function handle objects, see “What Is a
Function Handle?”.
2-143
2 Functions
obs — Observations
cell array
Observations, specified as a cell array with as many elements as there are observation input
channels. Each element of obs contains an array of observations for a single observation input
channel.
For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.
act — Action
single-element cell array
Action, specified as a single-element cell array that contains an array of action values.
For more information on input and output formats for recurrent neural networks, see the Algorithms
section of lstmLayer.
Next observations, that is the observation following the action act from the observation obs,
specified as a cell array of the same dimension as obs.
Output Arguments
predNextObs — Predicted next observation
cell array
Predicted next observation, that is the observation predicted by the transition function approximator
tsnFcnAppx given the current observation obs and the action act, retuned as a cell array of the
same dimension as obs.
2-144
predict
Predicted reward, that is the reward predicted by the reward function approximator rwdFcnAppx
given the current observation obs, the action act, and the following observation nextObs, retuned
as a single.
Predicted is-done episode status, that is the episode termination status predicted by the is-done
function approximator rwdFcnAppx given the current observation obs, the action act, and the
following observation nextObs, returned as a double.
Version History
Introduced in R2022a
See Also
Objects
rlNeuralNetworkEnvironment | rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction | evaluate | accelerate | gradient
Topics
“Model-Based Policy Optimization Agents”
2-145
2 Functions
reset
Package: rl.policy
Syntax
initialObs = reset(env)
reset(agent)
agent = reset(agent)
resetPolicy = reset(policy)
reset(buffer)
Description
initialObs = reset(env) resets the specified MATLAB environment to an initial state and
returns the resulting initial observation value.
Do not use reset for Simulink environments, which are implicitly reset when running a new
simulation. Instead, customize the reset behavior using the ResetFcn property of the environment.
reset(agent) resets the specified agent. Resetting a built-in agent performs the following actions,
if applicable.
resetPolicy = reset(policy) returns the policy object resetPolicy in which any recurrent
neural network states are set to zero and any noise model states are set to their initial conditions.
This syntax has no effect if the policy object does not use a recurrent neural network and does not
have a noise model with state.
reset(buffer) resets the specified replay memory buffer by removing all the experiences.
Examples
Reset Environment
Create a reinforcement learning environment. For this example, create a continuous-time cart-pole
system.
env = rlPredefinedEnv("CartPole-Continuous");
2-146
reset
initialObs = reset(env)
initialObs = 4×1
0
0
0.0315
0
Reset Agent
initOptions = rlAgentInitializationOptions(UseRNN=true);
agent = rlDDPGAgent(obsInfo,actInfo,initOptions);
agent = reset(agent);
buffer = rlReplayMemory(obsInfo,actInfo,10000);
Add experiences to the buffer. For this example, add 20 random experiences.
for i = 1:20
expBatch(i).Observation = {obsInfo.UpperLimit.*rand(4,1)};
expBatch(i).Action = {actInfo.UpperLimit.*rand(1,1)};
expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(4,1)};
expBatch(i).Reward = 10*rand(1);
expBatch(i).IsDone = 0;
end
expBatch(20).IsDone = 1;
append(buffer,expBatch);
2-147
2 Functions
reset(buffer)
Reset Policy
To approximate the Q-value function within the critic, use a deep neural network. Create each
network path as an array of layer objects.
% Create Paths
obsPath = [featureInputLayer(4)
fullyConnectedLayer(1,Name="obsout")];
actPath = [featureInputLayer(1)
fullyConnectedLayer(1,Name="actout")];
% Add Layers
net = layerGraph;
net = addLayers(net,obsPath);
net = addLayers(net,actPath);
net = addLayers(net,comPath);
net = connectLayers(net,"obsout","add/in1");
net = connectLayers(net,"actout","add/in2");
Initialized: true
Number of learnables: 9
Inputs:
1 'input' 4 features
2 'input_1' 1 features
policy =
rlEpsilonGreedyPolicy with properties:
2-148
reset
Input Arguments
env — Reinforcement learning environment
environment object | ...
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• rlMBPOAgent
• Custom agent — For more information, see “Create Custom Reinforcement Learning Agents”.
Note agent is a handle object, so it is reset whether is returned as an output argument or not. For
more information about handle objects, see “Handle Object Behavior”.
2-149
2 Functions
• rlMaxQPolicy
• rlEpsilonGreedyPolicy
• rlDeterministicActorPolicy
• rlAdditiveNoisePolicy
• rlStochasticActorPolicy
Output Arguments
initialObs — Initial environment observation
array | cell array
• Array with dimensions matching the observation specification for an environment with a single
observation channel.
• Cell array with length equal to the number of observation channel for an environment with
multiple observation channels. Each element of the cell array contains an array with dimensions
matching the corresponding element of the environment observation specifications.
Reset policy, returned as a policy object of the same type as agent but with its recurrent neural
network states set to zero.
Reset agent, returned as an agent object. Note that agent is a handle object. Therefore, if it contains
any recurrent neural network, its state is reset whether agent is returned as an output argument or
not. For more information about handle objects, see “Handle Object Behavior”.
Version History
Introduced in R2022a
See Also
runEpisode | setup | cleanup
2-150
resize
resize
Package: rl.replay
Syntax
resize(buffer,maxLength)
Description
resize(buffer,maxLength) resizes experience buffer buffer to have a maximum length of
maxLength.
• If maxLength is greater than or equal to the number of experiences stored in the buffer, then
buffer retains its stored experiences.
• If maxLength is less than the number of experiences stored in the buffer, then buffer retains
only the maxLength most recent experiences.
Examples
Create an environment for training the agent. For this example, load a predefined environment.
env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
agent = rlDQNAgent(obsInfo,actInfo);
By default, the agent uses an experience buffer with a maximum size of 10,000.
agent.ExperienceBuffer
ans =
rlReplayMemory with properties:
MaxLength: 10000
Length: 0
resize(agent.ExperienceBuffer,20000)
2-151
2 Functions
agent.ExperienceBuffer
ans =
rlReplayMemory with properties:
MaxLength: 20000
Length: 0
Input Arguments
buffer — Experience buffer
rlReplayMemory object | rlPrioritizedReplayMemory object
Version History
Introduced in R2022b
See Also
rlReplayMemory | rlPrioritizedReplayMemory
2-152
rlCreateEnvTemplate
rlCreateEnvTemplate
Create custom reinforcement learning environment template
Syntax
rlCreateEnvTemplate(className)
Description
rlCreateEnvTemplate(className) creates and opens a MATLAB script that contains a template
class representing a reinforcement learning environment. The template class contains an
implementation of a simple cart-pole balancing environment. To define your custom environment,
modify this template class. For more information, see “Create Custom MATLAB Environment from
Template”.
Examples
This example shows how to create and open a template file for a reinforcement learning environment.
rlCreateEnvTemplate("myEnvClass")
This function opens a MATLAB® script that contains the class. By default, this template class
describes a simple cart-pole environment.
Input Arguments
className — Name of environment class
string | character vector
Name of environment class, specified as a string or character vector. This name defines the name of
the class and the name of the MATLAB script.
Version History
Introduced in R2019a
See Also
Topics
“Create MATLAB Reinforcement Learning Environments”
2-153
2 Functions
rlDataLogger
Creates either a file logger object or a monitor logger object to log training data
Syntax
fileLgr = rlDataLogger()
monLgr = rlDataLogger(tpm)
Description
fileLgr = rlDataLogger() creates the FileLogger object fileLgr for logging training data to
disk.
monLgr = rlDataLogger(tpm) creates the MonitorLogger object monLgr for logging training
data to the TrainingProgressMonitor object tpm, and its associated window.
Examples
This example shows how to log data to disk when using train.
logger = rlDataLogger();
logger.LoggingOptions.LoggingDirectory = "myDataLog";
Create callback functions to log the data (for this example, see the helper function section), and
specify the appropriate callback functions in the logger object. For a related example, see “Log
Training Data To Disk”.
logger.EpisodeFinishedFcn = @myEpisodeFinishedFcn;
logger.AgentStepFinishedFcn = @myAgentStepFinishedFcn;
logger.AgentLearnFinishedFcn = @myAgentLearnFinishedFcn;
To train the agent, you can now call train, passing logger as an argument such as in the following
command.
While the training progresses, data will be logged to the specified directory, according to the rule
specified in the FileNameRule property of logger.LoggingOptions.
logger.LoggingOptions.FileNameRule
ans =
"loggedData<id>"
2-154
rlDataLogger
This example shows how to log and visualize data to the window of a trainingProgressMonitor
object when using train.
Create a trainingProgressMonitor object. Creating the object also opens a window associated
with the object.
monitor = trainingProgressMonitor();
2-155
2 Functions
logger = rlDataLogger(monitor);
Create callback functions to log the data (for this example, see the helper function section), and
specify the appropriate callback functions in the logger object.
logger.AgentLearnFinishedFcn = @myAgentLearnFinishedFcn;
To train the agent, you can now call train, passing logger as an argument such as in the following
command.
trainResult = train(agent, env, trainOpts, Logger=logger);
While the training progresses, data will be logged to the training monitor object, and visualized in the
associated window.
Note that only scalar data can be logged with a monitor logger object.
Define a logging function that logs data periodically at the completion of the learning subroutine.
function dataToLog = myAgentLearnFinishedFcn(data)
if mod(data.AgentLearnCount, 2) == 0
dataToLog.ActorLoss = data.ActorLoss;
dataToLog.CriticLoss = data.CriticLoss;
else
dataToLog = [];
end
end
This example shows how to log data to disk when training an agent using a custom training loop.
Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.
setup(flgr);
Within a custom training loop, you can now store data to the logger object memory and write data to
file.
For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.
for iter = 1:10
2-156
rlDataLogger
end
Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.
cleanup(flgr);
Input Arguments
tpm — Training progress monitor object
trainingProgressMonitor object
Output Arguments
fileLgr — File logger object
FileLogger object
Limitations
• Only scalar data is supported when logging data with a MonitorLogger object. The structure
returned by the callback functions must contain fields with scalar data.
• Resuming of training from a previous training result is not supported when logging data with a
MonitorLogger object.
• Logging data using the AgentStepFinishedFcn callback is not supported when training agents
in parallel with the train function.
2-157
2 Functions
Version History
Introduced in R2022b
See Also
Functions
train
Objects
FileLogger | MonitorLogger | trainingProgressMonitor
Topics
“Log Training Data To Disk”
“Monitor Custom Training Loop Progress”
2-158
rlOptimizer
rlOptimizer
Creates an optimizer object for actors and critics
Syntax
algobj = rlOptimizer
algobj = rlOptimizer(algOptions)
Description
Create an optimizer object that updates the learnable parameters of an actor or critic in a custom
training loop
algobj = rlOptimizer creates a default optimizer object. You can modify the object properties
using dot notation.
Examples
Use rlOprimizer to create a default optimizer algorithm object to use for the training of an actor or
critic in a custom training loop.
myAlg = rlOptimizer
myAlg =
rlADAMOptimizer with properties:
GradientDecayFactor: 0.9000
SquaredGradientDecayFactor: 0.9990
Epsilon: 1.0000e-08
LearnRate: 0.0100
L2RegularizationFactor: 1.0000e-04
GradientThreshold: Inf
GradientThresholdMethod: "l2norm"
By default, the function returns an rlADAMOptimizer object with default options. You can use dot
notation to change some parameters.
myAlg.LearnRate = 0.1;
You can now create a structure and set its CriticOptimizer or ActorOptimizer field to myAlg.
When you call runEpisode, pass the structure as an input parameter. The runEpisode function can
then use the update method of myAlg to update the learnable parameters of your actor or critic.
2-159
2 Functions
Use rlOptimizer to create an optimizer algorithm object to use for the training of an actor or critic
in a custom training loop. Specify the optimizer option set myOptions as input parameter.
myAlg=rlOptimizer(myOptions)
myAlg =
rlRMSPropOptimizer with properties:
SquaredGradientDecayFactor: 0.9990
Epsilon: 1.0000e-08
LearnRate: 0.2000
L2RegularizationFactor: 1.0000e-04
GradientThreshold: Inf
GradientThresholdMethod: "l2norm"
The function returns an rlRMSPropOptimizer object with default options. You can use dot notation
to change some parameters.
myAlg.GradientThreshold = 2;
You can now create a structure and set its CriticOptimizer or ActorOptimizer field to myAlg.
When you call runEpisode, pass the structure as an input parameter. The runEpisode function can
then use the update method of myAlg to update the learnable parameters of your actor or critic.
Input Arguments
algOptions — Algorithm options object
default Adam option set (default) | rlOptimizerOptions object
Output Arguments
algobj — Algorithm optimizer object
rlADAMOptimizer object | rlSGDMOptimizer object | rlRMSPropOptimizer object
Version History
Introduced in R2022a
2-160
rlOptimizer
See Also
Functions
rlOptimizerOptions
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
2-161
2 Functions
rlPredefinedEnv
Create a predefined reinforcement learning environment
Syntax
env = rlPredefinedEnv(keyword)
Description
env = rlPredefinedEnv(keyword) takes a predefined keyword keyword representing the
environment name to create a MATLAB or Simulink reinforcement learning environment env. The
environment env models the dynamics with which the agent interacts, generating rewards and
observations in response to agent actions.
Examples
Use the predefined 'BasicGridWorld' keyword to create a basic grid world reinforcement learning
environment.
env = rlPredefinedEnv('BasicGridWorld')
env =
rlMDPEnv with properties:
env =
DoubleIntegratorContinuousAction with properties:
Gain: 1
Ts: 0.1000
MaxDistance: 5
GoalThreshold: 0.0100
Q: [2x2 double]
R: 0.0100
MaxForce: Inf
State: [2x1 double]
2-162
rlPredefinedEnv
You can visualize the environment using the plot function and interact with it using the reset and
step functions.
plot(env)
observation = reset(env)
observation = 2×1
4
0
[observation,reward,isDone] = step(env,16)
observation = 2×1
4.0800
1.6000
reward = -16.5559
isDone = logical
0
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on
2-163
2 Functions
Input Arguments
keyword — Predefined keyword representing the environment name
'BasicGridWorld' | 'CartPole-Discrete' | 'DoubleIntegrator-Continuous' |
'SimplePendulumWithImage-Discrete' | 'SimplePendulumModel-Discrete' |
'SimplePendulumModel-Continuous' | 'CartPoleSimscapeModel-Continuous' | ...
Predefined keyword representing the environment name, specified as one of the following:
MATLAB Environment
• 'BasicGridWorld'
• 'CartPole-Discrete'
• 'CartPole-Continuous'
• 'DoubleIntegrator-Discrete'
• 'DoubleIntegrator-Continuous'
• 'SimplePendulumWithImage-Discrete'
• 'SimplePendulumWithImage-Continuous'
• 'WaterFallGridWorld-Stochastic'
• 'WaterFallGridWorld-Deterministic'
Simulink Environment
• 'SimplePendulumModel-Discrete'
• 'SimplePendulumModel-Continuous'
• 'CartPoleSimscapeModel-Discrete'
• 'CartPoleSimscapeModel-Continuous'
Output Arguments
env — MATLAB or Simulink environment object
rlMDPEnv object | CartPoleDiscreteAction object | CartPoleContinuousAction object |
DoubleIntegratorDiscreteAction object | DoubleIntegratorContinuousAction object |
SimplePendlumWithImageDiscreteAction object |
SimplePendlumWithImageContinuousAction object | SimulinkEnvWithAgent object
• 'BasicGridWorld'
• 'WaterFallGridWorld-Stochastic'
• 'WaterFallGridWorld-Deterministic'
• CartPoleDiscreteAction object, when you use the 'CartPole-Discrete' keyword.
• CartPoleContinuousAction object, when you use the 'CartPole-Continuous' keyword.
• DoubleIntegratorDiscreteAction object, when you use the 'DoubleIntegrator-
Discrete' keyword.
2-164
rlPredefinedEnv
• 'SimplePendulumModel-Discrete'
• 'SimplePendulumModel-Continuous'
• 'CartPoleSimscapeModel-Discrete'
• 'CartPoleSimscapeModel-Continuous'
Version History
Introduced in R2019a
See Also
Topics
“Create MATLAB Reinforcement Learning Environments”
“Create Simulink Reinforcement Learning Environments”
“Load Predefined Control System Environments”
“Load Predefined Simulink Environments”
2-165
2 Functions
rlRepresentation
(Not recommended) Model representation for reinforcement learning agents
Syntax
rep = rlRepresentation(net,obsInfo,'Observation',obsNames)
rep = rlRepresentation(net,obsInfo,actInfo,'Observation',obsNames,'Action',
actNames)
tableCritic = rlRepresentation(tab)
critic = rlRepresentation(basisFcn,W0,obsInfo)
critic = rlRepresentation(basisFcn,W0,oaInfo)
actor = rlRepresentation(basisFcn,W0,obsInfo,actInfo)
Description
Use rlRepresentation to create a function approximator representation for the actor or critic of a
reinforcement learning agent. To do so, you specify the observation and action signals for the training
environment and options that affect the training of an agent that uses the representation. For more
information on creating representations, see “Create Policies and Value Functions”.
rep = rlRepresentation(net,obsInfo,actInfo,'Observation',obsNames,'Action',
actNames) creates a representation with action signals specified by the names actNames and
specification actInfo. Use this syntax to create a representation for any actor, or for a critic that
takes both observation and action as input, such as a critic for an rlDQNAgent or rlDDPGAgent
agent.
2-166
rlRepresentation
this syntax to create a representation for a critic that does not require action inputs, such as a critic
for an rlACAgent or rlPGAgent agent.
rep = rlRepresentation( ___ ,repOpts) creates a representation using additional options that
specify learning parameters for the representation when you train an agent. Available options include
the optimizer used for training and the learning rate. Use rlRepresentationOptions to create the
options set repOpts. You can use this syntax with any of the previous input-argument combinations.
Examples
Create an actor representation and a critic representation that you can use to define a reinforcement
learning agent such as an Actor Critic (AC) agent.
For this example, create actor and critic representations for an agent that can be trained against the
cart-pole environment described in “Train AC Agent to Balance Cart-Pole System”. First, create the
environment. Then, extract the observation and action specifications from the environment. You need
these specifications to define the agent and critic representations.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
For a state-value-function critic such as those used for AC or PG agents, the inputs are the
observations and the output should be a scalar value, the state value. For this example, create the
critic representation using a deep neural network with one output, and with observation signals
corresponding to x,xdot,theta,thetadot as described in “Train AC Agent to Balance Cart-Pole
System”. You can obtain the number of observations from the obsInfo specification. Name the
network layer input 'observation'.
numObservation = obsInfo.Dimension(1);
criticNetwork = [
imageInputLayer([numObservation 1 1],'Normalization','none','Name','observation')
fullyConnectedLayer(1,'Name','CriticFC')];
Specify options for the critic representation using rlRepresentationOptions. These options
control parameters of critic network learning, when you train an agent that incorporates the critic
representation. For this example, set the learning rate to 0.05 and the gradient threshold to 1.
repOpts = rlRepresentationOptions('LearnRate',5e-2,'GradientThreshold',1);
Create the critic representation using the specified neural network and options. Also, specify the
action and observation information for the critic. Set the observation name to 'observation',
which is the name you used when you created the network input layer for criticNetwork.
2-167
2 Functions
critic = rlRepresentation(criticNetwork,obsInfo,'Observation',{'observation'},repOpts)
critic =
rlValueRepresentation with properties:
Similarly, create a network for the actor. An AC agent decides which action to take given observations
using an actor representation. For an actor, the inputs are the observations, and the output depends
on whether the action space is discrete or continuous. For the actor of this example, there are two
possible discrete actions, –10 or 10. Thus, to create the actor, use a deep neural network with the
same observation input as the critic, that can output these two values. You can obtain the number of
actions from the actInfo specification. Name the output 'action'.
numAction = numel(actInfo.Elements);
actorNetwork = [
imageInputLayer([4 1 1], 'Normalization','none','Name','observation')
fullyConnectedLayer(numAction,'Name','action')];
Create the actor representation using the observation name and specification and the action name
and specification. Use the same representation options.
actor = rlRepresentation(actorNetwork,obsInfo,actInfo,...
'Observation',{'observation'},'Action',{'action'},repOpts)
actor =
rlStochasticActorRepresentation with properties:
You can now use the actor and critic representations to create an AC agent.
agentOpts = rlACAgentOptions(...
'NumStepsToLookAhead',32,...
'DiscountFactor',0.99);
agent = rlACAgent(actor,critic,agentOpts)
agent =
rlACAgent with properties:
Create a Q table using the action and observation specifications from the environment.
2-168
rlRepresentation
qTable = rlTable(getObservationInfo(env),getActionInfo(env));
tableRep = rlRepresentation(qTable);
This example shows how to create a linear basis function critic representation.
Assume that you have an environment, env. For this example, load the environment used in the
“Train Custom LQR Agent” example.
load myLQREnv.mat
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a custom basis function. In this case, use the quadratic basis function from “Train Custom
LQR Agent”.
Set the dimensions and parameters required for your basis function.
n = 6;
w0 = 0.1*ones(0.5*(n+1)*n,1);
Function to compute the quadratic basis from “Train Custom LQR Agent”.
function B = computeQuadraticBasis(x,u,n)
z = cat(1,x,u);
idx = 1;
for r = 1:n
for c = r:n
if idx == 1
B = z(r)*z(c);
else
B = cat(1,B,z(r)*z(c));
end
idx = idx + 1;
end
end
end
Input Arguments
net — Deep neural network for actor or critic
array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object
2-169
2 Functions
Deep neural network for actor or critic, specified as one of the following:
For a list of deep neural network layers, see “List of Deep Learning Layers”. For more information on
creating deep neural networks for reinforcement learning, see “Create Policies and Value Functions”.
Observation names, specified as a cell array of character vectors. The observation names are the
network input layer names you specify when you create net. The names in obsNames must be in the
same order as the observation specifications in obsInfo.
Example: {'observation'}
Action name, specified as a single-element cell array that contains a character vector. The action
name is the network layer name you specify when you create net. For critic networks, this layer is
the first layer of the action input path. For actors, this layer is the last layer of the action output path.
Example: {'action'}
Action specification, specified as a reinforcement learning spec object. You can extract actInfo from
an existing environment using getActionInfo. Or, you can construct the spec manually using a
spec command such as rlFiniteSetSpec or rlNumericSpec. This specification defines such
information about the action as the dimensions and name of the action signal.
For linear basis function representations, the action signal must be a scalar, a column vector, or a
discrete action.
Value table or Q table for critic, specified as an rlTable object. The learnable parameters of a table
representation are the elements of tab.
2-170
rlRepresentation
Custom basis function, specified as a function handle to a user-defined function. For a linear basis
function representation, the output of the representation is f = W'B, where W is a weight array and B
is the column vector returned by the custom basis function. The learnable parameters of a linear
basis function representation are the elements of W.
When creating:
• A critic representation with observation inputs only, your basis function must have the following
signature.
B = myBasisFunction(obs1,obs2,...,obsN)
Here obs1 to obsN are observations in the same order and with the same data type and
dimensions as the observation specifications in obsInfo.
• A critic representation with observation and action inputs, your basis function must have the
following signature.
B = myBasisFunction(obs1,obs2,...,obsN,act)
Here obs1 to obsN are observations in the same order and with the same data type and
dimensions as the observation specifications in the first element of oaInfo, and act has the same
data type and dimensions as the action specification in the second element of oaInfo.
• An actor representation, your basis function must have the following signature.
B = myBasisFunction(obs1,obs2,...,obsN)
Here, obs1 to obsN are observations in the same order and with the same data type and
dimensions as the observation specifications in obsInfo. The data types and dimensions of the
action specification in actInfo affect the data type and dimensions of f.
Initial value for linear basis function weight array, W, specified as one of the following:
Observation and action specifications for creating linear basis function critic representations,
specified as the cell array {obsInfo,actInfo}.
2-171
2 Functions
Representation options, specified as an option set that you create with rlRepresentationOptions.
Available options include the optimizer used for training and the learning rate. See
rlRepresentationOptions for details.
Output Arguments
rep — Deep neural network representation
rlLayerRepresentation object
Version History
Introduced in R2019a
2-172
rlRepresentation
The following table shows some typical uses of the rlRepresentation function to create neural
network-based critics and actors, and how to update your code with one of the new objects instead.
The following table shows some typical uses of the rlRepresentation objects to express table-
based critics with discrete observation and action spaces, and how to update your code with one of
the new objects instead.
2-173
2 Functions
The following table shows some typical uses of the rlRepresentation function to create critics and
actors which use a custom basis function, and how to update your code with one of the new objects
instead. In the recommended function calls, the first input argument is a two-elements cell containing
both the handle to the custom basis function and the initial weight vector or matrix.
See Also
Functions
rlValueRepresentation | rlQValueRepresentation |
rlDeterministicActorRepresentation | rlStochasticActorRepresentation |
rlRepresentationOptions | getActionInfo | getObservationInfo
2-174
rlRepresentation
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
2-175
2 Functions
rlSimulinkEnv
Create reinforcement learning environment using dynamic model implemented in Simulink
Syntax
env = rlSimulinkEnv(mdl,agentBlocks)
env = rlSimulinkEnv(mdl,agentBlocks,obsInfo,actInfo)
env = rlSimulinkEnv( ___ ,'UseFastRestart',fastRestartToggle)
Description
The rlSimulinkEnv function creates a reinforcement learning environment object from a Simulink
model. The environment object acts an interface so that when you call sim or train, these functions
in turn call the Simulink model to generate experiences for the agents.
Examples
Create a Simulink environment using the trained agent and corresponding Simulink model from the
“Create Simulink Environment and Train Agent” example.
Create an environment for the rlwatertank model, which contains an RL Agent block. Since the
agent used by the block is already in the workspace, you do not need to pass the observation and
action specifications to create the environment.
env = rlSimulinkEnv('rlwatertank','rlwatertank/RL Agent')
env =
SimulinkEnvWithAgent with properties:
Model : rlwatertank
2-176
rlSimulinkEnv
Validate the environment by performing a short simulation for two sample times.
validateEnvironment(env)
You can now train and simulate the agent within the environment by using train and sim,
respectively.
For this example, consider the rlSimplePendulumModel Simulink model. The model is a simple
frictionless pendulum that initially hangs in a downward position.
mdl = 'rlSimplePendulumModel';
open_system(mdl)
Create rlNumericSpec and rlFiniteSetSpec objects for the observation and action information,
respectively.
The observation is a vector containing three signals: the sine, cosine, and time derivative of the
angle.
obsInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: [0×0 string]
Description: [0×0 string]
Dimension: [3 1]
DataType: "double"
The action is a scalar expressing the torque and can be one of three possible values, -2 Nm, 0 Nm and
2 Nm.
actInfo =
rlFiniteSetSpec with properties:
2-177
2 Functions
You can use dot notation to assign property values for the rlNumericSpec and rlFiniteSetSpec
objects.
obsInfo.Name = 'observations';
actInfo.Name = 'torque';
Assign the agent block path information, and create the reinforcement learning environment for the
Simulink model using the information extracted in the previous steps.
agentBlk = [mdl '/RL Agent'];
env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo)
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on
You can also include a reset function using dot notation. For this example, randomly initialize theta0
in the model workspace.
env.ResetFcn = @(in) setVariable(in,'theta0',randn,'Workspace',mdl)
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : @(in)setVariable(in,'theta0',randn,'Workspace',mdl)
UseFastRestart : on
Create an environment for the Simulink model from the example “Train Multiple Agents to Perform
Collaborative Task”.
Create an environment for the rlCollaborativeTask model, which has two agent blocks. Since the
agents used by the two blocks (agentA and agentB) are already in the workspace, you do not need
to pass their observation and action specifications to create the environment.
env = rlSimulinkEnv( ...
'rlCollaborativeTask', ...
["rlCollaborativeTask/Agent A","rlCollaborativeTask/Agent B"])
env =
SimulinkEnvWithAgent with properties:
Model : rlCollaborativeTask
2-178
rlSimulinkEnv
AgentBlock : [
rlCollaborativeTask/Agent A
rlCollaborativeTask/Agent B
]
ResetFcn : []
UseFastRestart : on
You can now simulate or train the agents within the environment using sim or train, respectively.
Input Arguments
mdl — Simulink model name
string | character vector
Simulink model name, specified as a string or character vector. The model must contain at least one
RL Agent block.
If mdl contains a single RL Agent block, specify agentBlocks as a string or character vector
containing the block path.
If mdl contains multiple RL Agent blocks, specify agentBlocks as a string array, where each
element contains the path of one agent block.
mdl can contain RL Agent blocks whose path is not included in agentBlocks. Such agent blocks
behave as part of the environment, selecting actions based on their current policies. When you call
sim or train, the experiences of these agents are not returned and their policies are not updated.
The agent blocks can be inside of a model reference. For more information on configuring an agent
block for reinforcement learning, see RL Agent.
If mdl contains multiple agent blocks, specify obsInfo as a cell array, where each cell contains a
specification object or array of specification objects for the corresponding block in agentBlocks.
2-179
2 Functions
If mdl contains multiple agent blocks, specify actInfo as a cell array, where each cell contains a
specification object for the corresponding block in agentBlocks.
Option to toggle fast restart, specified as either 'on' or 'off'. Fast restart allows you to perform
iterative simulations without compiling a model or terminating the simulation each time.
For more information on fast restart, see “How Fast Restart Improves Iterative Simulations”
(Simulink).
Output Arguments
env — Reinforcement learning environment
SimulinkEnvWithAgent object
For more information on reinforcement learning environments, see “Create Simulink Reinforcement
Learning Environments”.
Version History
Introduced in R2019a
See Also
Functions
train | sim | getObservationInfo | getActionInfo | rlNumericSpec | rlFiniteSetSpec
Blocks
RL Agent
Topics
“Train DDPG Agent to Control Double Integrator System”
“Train DDPG Agent to Swing Up and Balance Pendulum”
“Train DDPG Agent to Swing Up and Balance Cart-Pole System”
“Train DDPG Agent to Swing Up and Balance Pendulum with Bus Signal”
“Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation”
“Train DDPG Agent for Adaptive Cruise Control”
“How Fast Restart Improves Iterative Simulations” (Simulink)
2-180
runEpisode
runEpisode
Package: rl.env
Syntax
output = runEpisode(env,policy)
output = runEpisode(env,agent)
Description
output = runEpisode(env,policy) runs a single simulation of the environment env against the
policy policy.
output = runEpisode(env,agent) runs a single simulation of the environment env against the
agent agent.
output = runEpisode( ___ ,Name=Value) specifies nondefault simulation options using one or
more name-value arguments.
Examples
Create a reinforcement learning environment and extract its observation and action specifications.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
To approximate the Q-value function withing the critic, use a neural network. Create a network as an
array of layer objects.
net = [...
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(2)
softmaxLayer];
Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
net = dlnetwork(net);
summary(net)
2-181
2 Functions
Initialized: true
Inputs:
1 'input' 4 features
actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);
act = getAction(actor,{rand(obsInfo.Dimension)})
policy = rlStochasticActorPolicy(actor);
buffer = rlReplayMemory(obsInfo,actInfo);
Set up the environment for running multiple simulations. For this example, configure the training to
log any errors rather than send them to the command window.
setup(env,StopOnError="off")
Simulate multiple episodes using the environment and policy. After each episode, append the
experiences to the buffer. For this example, run 100 episodes.
for i = 1:100
output = runEpisode(env,policy,MaxSteps=300);
append(buffer,output.AgentData.Experiences)
end
cleanup(env)
Sample a mini-batch of experiences from the buffer. For this example, sample 10 experiences.
batch = sample(buffer,10);
You can then learn from the sampled experiences and update the policy and actor.
Input Arguments
env — Reinforcement learning environment
environment object | ...
2-182
runEpisode
policy — Policy
policy object | array of policy objects
• rlDeterministicActorPolicy
• rlAdditiveNoisePolicy
• rlEpsilonGreedyPolicy
• rlMaxQPolicy
• rlStochasticActorPolicy
If env is a Simulink environment configured for multi-agent training, specify policy as an array of
policy objects. The order of the policies in the array must match the agent order used to create env.
For more information on a policy object, at the MATLAB command line, type help followed by the
policy object name.
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• rlMBPOAgent
• Custom agent — For more information, see “Create Custom Reinforcement Learning Agents”.
If env is a Simulink environment configured for multi-agent training, specify agent as an array of
agent objects. The order of the agents in the array must match the agent order used to create env.
2-183
2 Functions
Function for processing experiences and updating the policy or agent based on each experience as it
occurs during the simulation, specified as a function handle with the following signature.
[updatedPolicy,updatedData] = myFcn(experience,episodeInfo,policy,data)
Here:
• experience is a structure that contains a single experience. For more information on the
structure fields, see output.Experiences.
• episodeInfo contains data about the current episode and corresponds to
output.EpisodeInfo.
• policy is the policy or agent object being simulated.
• data contains experience processing data. For more information, see ProcessExperienceData.
• updatedPolicy is the updated policy or agent.
• updatedData is the updated experience processing data, which is used as the data input when
processing the next experience.
Experience processing data, specified as any MATLAB data, such as an array or structure. Use this
data to pass additional parameters or information to the experience processing function.
You can also update this data within the experience processing function to use different parameters
when processing the next experience. The data values that you specify when you call runEpisode
are used to process the first experience in the simulation.
2-184
runEpisode
Option to clean up the environment after the simulation, specified as true or false. When
CleanupPostSim is true, runEpisode calls cleanup(env) when the simulation ends.
To run multiple episodes without cleaning up the environment, set CleanupPostSim to false. You
can then call cleanup(env) after running your simulations.
If env is a SimulinkEnvWithAgent object and the associated Simulink model is configured to use
fast restart, then the model remains in a compiled state between simulations when CleanUpPostSim
is false.
Option to log experiences for each policy or agent, specified as true or false. When
LogExperiences is true, the experiences of the policy or agent are logged in
output.Experiences.
Output Arguments
output — Simulation output
structure | structure array | rl.env.Future object
Simulation output, returned as a structure with the fields AgentData and SimulationInfo. When
you simulate multiple policies or agents, output is returned as a structure array.
Field Description
Experiences Logged experience of the policy or agent, returned as a
structure array. Each experience contains the following fields.
• Observation — Observation
• Action — Action taken
• NextObservation — Resulting next observation
• Reward — Corresponding reward
• IsDone — Termination signal
Time Simulation times of experiences, returned as a vector.
EpisodeInfo Episode information, returned as a structure with the following
fields.
2-185
2 Functions
• For MATLAB environments — Structure containing the field SimulationError. This structure
contains any errors that occurred during simulation.
• For Simulink environments — Simulink.SimulationOutput object containing simulation data.
Recorded data includes any signals and states that the model is configured to log, simulation
metadata, and any errors that occurred.
Tips
• You can speed up episode simulation by using parallel computing. To do so, use the setup
function and set the UseParallel argument to true.
setup(env,UseParallel=true)
Version History
Introduced in R2022a
See Also
setup | cleanup | reset
Topics
“Custom Training Loop with Simulink Action Noise”
2-186
sample
sample
Package: rl.replay
Syntax
experience = sample(buffer,batchSize)
experience = sample(buffer,batchSize,Name=Value)
[experience,Mask] = sample(buffer,batchSize,Name=Value)
Description
experience = sample(buffer,batchSize) returns a mini-batch of N experiences from the
replay memory buffer, where N is specified using batchSize.
Examples
Define observation specifications for the environment. For this example, assume that the environment
has a single observation channel with three continuous signals in specified ranges.
Define action specifications for the environment. For this example, assume that the environment has
a single action channel with two continuous signals in specified ranges.
buffer = rlReplayMemory(obsInfo,actInfo,20000);
Append a single experience to the buffer using a structure. Each experience contains the following
elements: current observation, action, next observation, reward, and is-done.
For this example, create an experience with random observation, action, and reward values. Indicate
that this experience is not a terminal condition by setting the IsDone value to 0.
2-187
2 Functions
exp.Observation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Action = {actInfo.UpperLimit.*rand(2,1)};
exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Reward = 10*rand(1);
exp.IsDone = 0;
append(buffer,exp);
You can also append a batch of experiences to the experience buffer using a structure array. For this
example, append a sequence of 100 random experiences, with the final experience representing a
terminal condition.
for i = 1:100
expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)};
expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Reward = 10*rand(1);
expBatch(i).IsDone = 0;
end
expBatch(100).IsDone = 1;
append(buffer,expBatch);
After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.
miniBatch = sample(buffer,50);
You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive
experiences with a discount factor of 0.95.
horizonSample = sample(buffer,1,...
NStepHorizon=10,...
DiscountFactor=0.95);
• Observation and Action are the observation and action from the first experience in the
horizon.
• NextObservation and IsDone are the next observation and termination signal from the final
experience in the horizon.
• Reward is the cumulative reward across the horizon using the specified discount factor.
You can also sample a sequence of consecutive experiences. In this case, the structure fields contain
arrays with values for all sampled experiences.
sequenceSample = sample(buffer,1,...
SequenceLength=20);
2-188
sample
Define observation specifications for the environment. For this example, assume that the environment
has two observation channels: one channel with two continuous observations and one channel with a
three-valued discrete observation.
Define action specifications for the environment. For this example, assume that the environment has
a single action channel with one continuous action in a specified range.
buffer = rlReplayMemory(obsInfo,actInfo,5000);
for i = 1:50
exp(i).Observation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
exp(i).Action = {actInfo.UpperLimit.*rand(2,1)};
exp(i).NextObservation = ...
{obsInfo(1).UpperLimit.*rand(2,1) randi(3)};
exp(i).Reward = 10*rand(1);
exp(i).IsDone = 0;
end
append(buffer,exp);
After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 10 experiences from the buffer.
miniBatch = sample(buffer,10);
Input Arguments
buffer — Experience buffer
rlReplayMemory object | rlPrioritizedReplayMemeory
If batchSize is greater than the current length of the buffer, then sample returns no experiences.
2-189
2 Functions
Sequence length, specified as a positive integer. For each batch element, sample up to
SequenceLength consecutive experiences. If a sampled experience has a nonzero IsDone value,
stop the sequence at that experience.
N-step horizon length, specified as a positive integer. For each batch element, sample up to
NStepHorizon consecutive experiences. If a sampled experience has a nonzero IsDone value, stop
the horizon at that experience. Return the following experience information based on the sampled
horizon.
• Observation and Action values from the first experience in the horizon
• NextObservation and IsDone values from the final experience in the horizon.
• Cumulative reward across the horizon using the specified discount factor, DiscountFactor.
Discount factor, specified as a nonnegative scalar less than or equal to one. When you sample a
horizon of experiences (NStepHorizon > 1), sample returns the cumulative reward R computed as
follows.
N
R= ∑ γiRi
i=1
Here:
2-190
sample
Output Arguments
experience — Experiences sampled from the buffer
structure
Experiences sampled from the buffer, returned as a structure with the following fields.
Observation — Observation
cell array
Observation, returned as a cell array with length equal to the number of observation specifications
specified when creating the buffer. Each element of Observation contains a DO-by-batchSize-by-
SequenceLength array, where DO is the dimension of the corresponding observation specification.
Agent action, returned as a cell array with length equal to the number of action specifications
specified when creating the buffer. Each element of Action contains a DA-by-batchSize-by-
SequenceLength array, where DA is the dimension of the corresponding action specification.
Reward value obtained by taking the specified action from the observation, returned as a 1-by-1-by-
SequenceLength array.
Next observation reached by taking the specified action from the observation, returned as a cell array
with the same format as Observation.
Sequence padding mask, returned as a logical array with length equal to SequenceLength. When
the sampled sequence length is less than SequenceLength, the data returned in experience is
padded. Each element of Mask is true for a real experience and false for a padded experience.
2-191
2 Functions
Version History
Introduced in R2022a
See Also
rlReplayMemory | append
2-192
setActor
setActor
Package: rl.agent
Syntax
agent = setActor(agent,actor)
Description
agent = setActor(agent,actor) updates the reinforcement learning agent, agent, to use the
specified actor object, actor.
Examples
Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.
load('DoubleIntegDDPG.mat','agent')
actor = getActor(agent);
params = getLearnableParameters(actor)
Modify the parameter values. For this example, simply multiply all of the parameters by 2.
Set the parameter values of the actor to the new modified values.
actor = setLearnableParameters(actor,modifiedParams);
setActor(agent,actor);
getLearnableParameters(getActor(agent))
2-193
2 Functions
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.
agent = rlPPOAgent(obsInfo,actInfo);
To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.
actor = getActor(agent);
critic = getCritic(agent);
Extract the deep neural networks from both the actor and critic function approximators.
actorNet = getModel(actor);
criticNet = getModel(critic);
The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.
plot(layerGraph(actorNet))
2-194
setActor
To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)
You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))
In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.
To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.
For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
createModifiedNetworks
2-195
2 Functions
plot(layerGraph(modifiedActorNet))
After exporting the networks, insert the networks into the actor and critic function approximators.
actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);
Finally, insert the modified actor and critic function approximators into the actor and critic objects.
agent = setActor(agent,actor);
agent = setCritic(agent,critic);
Input Arguments
agent — Reinforcement learning agent
rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent | rlSACAgent | rlPPOAgent |
rlTRPOAgent
Reinforcement learning agent that contains an actor, specified as one of the following:
• rlPGAgent object
• rlDDPGAgent object
2-196
setActor
• rlTD3Agent object
• rlACAgent object
• rlSACAgent object
• rlPPOAgent object
• rlTRPOAgent object
Note agent is an handle object. Therefore is updated by setActor whether agent is returned as
an output argument or not. For more information about handle objects, see “Handle Object
Behavior”.
actor — Actor
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object
The input and outputs of the approximation model in the actor (typically, a neural network) must
match the observation and action specifications of the original agent.
Output Arguments
agent — Updated reinforcement learning agent
rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent | rlSACAgent | rlPPOAgent |
rlTRPOAgent
Updated agent, returned as an agent object. Note that agent is an handle object. Therefore its actor
is updated by setActor whether agent is returned as an output argument or not. For more
information about handle objects, see “Handle Object Behavior”.
Version History
Introduced in R2019a
2-197
2 Functions
See Also
getActor | getCritic | setCritic | getModel | setModel | getLearnableParameters |
setLearnableParameters
Topics
“Create Policies and Value Functions”
“Import Neural Network Models”
2-198
setCritic
setCritic
Package: rl.agent
Syntax
agent = setCritic(agent,critic)
Description
agent = setCritic(agent,critic) updates the reinforcement learning agent, agent, to use the
specified critic object, critic.
Examples
Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.
load('DoubleIntegDDPG.mat','agent')
critic = getCritic(agent);
params = getLearnableParameters(critic)
Modify the parameter values. For this example, simply multiply all of the parameters by 2.
Set the parameter values of the critic to the new modified values.
critic = setLearnableParameters(critic,modifiedParams);
setCritic(agent,critic);
getLearnableParameters(getCritic(agent))
2-199
2 Functions
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.
agent = rlPPOAgent(obsInfo,actInfo);
To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.
actor = getActor(agent);
critic = getCritic(agent);
Extract the deep neural networks from both the actor and critic function approximators.
actorNet = getModel(actor);
criticNet = getModel(critic);
The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.
plot(layerGraph(actorNet))
2-200
setCritic
To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)
You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))
In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.
To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.
For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
createModifiedNetworks
2-201
2 Functions
plot(layerGraph(modifiedActorNet))
After exporting the networks, insert the networks into the actor and critic function approximators.
actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);
Finally, insert the modified actor and critic function approximators into the actor and critic objects.
agent = setActor(agent,actor);
agent = setCritic(agent,critic);
Input Arguments
agent — Reinforcement learning agent
rlQAgent | rlSARSAAgent | rlDQNAgent | rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent
| rlSACAgent | rlPPOAgent | rlTRPOAgent
Reinforcement learning agent that contains a critic, specified as one of the following:
• rlQAgent
• rlSARSAAgent
2-202
setCritic
• rlDQNAgent
• rlPGAgent (when using a critic to estimate a baseline value function)
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
Note agent is an handle object. Therefore is updated by setCritic whether agent is returned as
an output argument or not. For more information about handle objects, see “Handle Object
Behavior”.
critic — Critic
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object | two-
element row vector of rlQValueFunction objects
Output Arguments
agent — Updated reinforcement learning agent
rlQAgent | rlSARSAAgent | rlDQNAgent | rlPGAgent | rlDDPGAgent | rlTD3Agent | rlACAgent
| rlSACAgent | rlPPOAgent | rlTRPOAgent
Updated agent, returned as an agent object. Note that agent is an handle object. Therefore its actor
is updated by setCritic whether agent is returned as an output argument or not. For more
information about handle objects, see “Handle Object Behavior”.
Version History
Introduced in R2019a
See Also
getActor | getCritic | setActor | getModel | setModel | getLearnableParameters |
setLearnableParameters
2-203
2 Functions
Topics
“Create Policies and Value Functions”
“Import Neural Network Models”
2-204
setLearnableParameters
setLearnableParameters
Package: rl.policy
Syntax
setLearnableParameters(agent)
agent = setLearnableParameters(agent)
newFcn = setLearnableParameters(oldFcn,pars)
newPol = setLearnableParameters(oldPol,pars)
Description
Agent
Examples
Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.
load('DoubleIntegDDPG.mat','agent')
critic = getCritic(agent);
params = getLearnableParameters(critic)
2-205
2 Functions
Modify the parameter values. For this example, simply multiply all of the parameters by 2.
Set the parameter values of the critic to the new modified values.
critic = setLearnableParameters(critic,modifiedParams);
setCritic(agent,critic);
getLearnableParameters(getCritic(agent))
Assume that you have an existing trained reinforcement learning agent. For this example, load the
trained agent from “Train DDPG Agent to Control Double Integrator System”.
load('DoubleIntegDDPG.mat','agent')
actor = getActor(agent);
params = getLearnableParameters(actor)
Modify the parameter values. For this example, simply multiply all of the parameters by 2.
Set the parameter values of the actor to the new modified values.
actor = setLearnableParameters(actor,modifiedParams);
setActor(agent,actor);
2-206
setLearnableParameters
getLearnableParameters(getActor(agent))
Input Arguments
agent — Reinforcement learning agent
reinforcement learning agent object
• rlQAgent
• rlSARSAAgent
• rlDQNAgent
• rlPGAgent
• rlDDPGAgent
• rlTD3Agent
• rlACAgent
• rlSACAgent
• rlPPOAgent
• rlTRPOAgent
• Custom agent — For more information, see “Create Custom Reinforcement Learning Agents”.
2-207
2 Functions
To create an actor or critic function object, use one of the following methods.
• rlMaxQPolicy
• rlEpsilonGreedyPolicy
• rlDeterministicActorPolicy
• rlAdditiveNoisePolicy
• rlStochasticActorPolicy
Learnable parameter values for the representation object, specified as a cell array. The parameters in
pars must be compatible with the structure and parameterization of the agent, function
approximator, or policy object passed as a first argument.
To obtain a cell array of learnable parameter values from an existing agent, function approximator, or
policy object , which you can then modify, use the getLearnableParameters function.
Output Arguments
newFcn — New actor or critic object
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object
New actor or critic object, returned as a function object of the same type as oldFcn. Apart from the
learnable parameter values, newFcn is the same as oldFcn.
New reinforcement learning policy, returned as a policy object of the same type as oldPol. Apart
from the learnable parameter values, newPol is the same as oldPol.
Updated agent, returned as an agent object. Note that agent is an handle object. Therefore its
parameters are updated by setLearnableParameters whether agent is returned as an output
argument or not. For more information about handle objects, see “Handle Object Behavior”.
2-208
setLearnableParameters
Version History
Introduced in R2019a
Using representation objects to create actors and critics for reinforcement learning agents is no
longer recommended. Therefore, setLearnableParameters now uses function approximator
objects instead.
See Also
getLearnableParameters | getActor | getCritic | setActor | setCritic
Topics
“Create Policies and Value Functions”
“Import Neural Network Models”
2-209
2 Functions
setModel
Package: rl.function
Syntax
newFcnAppx = setModel(oldFcnAppx,model)
Description
newFcnAppx = setModel(oldFcnAppx,model) returns a new actor or critic function object,
newFcnAppx, with the same configuration as the original function object, oldFcnAppx, and the
computational model specified in model.
Examples
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a PPO agent from the environment observation and action specifications. This agent uses
default deep neural networks for its actor and critic.
agent = rlPPOAgent(obsInfo,actInfo);
To modify the deep neural networks within a reinforcement learning agent, you must first extract the
actor and critic function approximators.
actor = getActor(agent);
critic = getCritic(agent);
Extract the deep neural networks from both the actor and critic function approximators.
actorNet = getModel(actor);
criticNet = getModel(critic);
The networks are dlnetwork objects. To view them using the plot function, you must convert them
to layerGraph objects.
2-210
setModel
plot(layerGraph(actorNet))
To validate a network, use analyzeNetwork. For example, validate the critic network.
analyzeNetwork(criticNet)
You can modify the actor and critic networks and save them back to the agent. To modify the
networks, you can use the Deep Network Designer app. To open the app for each network, use the
following commands.
deepNetworkDesigner(layerGraph(criticNet))
deepNetworkDesigner(layerGraph(actorNet))
In Deep Network Designer, modify the networks. For example, you can add additional layers to
your network. When you modify the networks, do not change the input and output layers of the
networks returned by getModel. For more information on building networks, see “Build Networks
with Deep Network Designer”.
To validate the modified network in Deep Network Designer, you must click on Analyze for
dlnetwork, under the Analysis section. To export the modified network structures to the MATLAB®
workspace, generate code for creating the new networks and run this code from the command line.
Do not use the exporting option in Deep Network Designer. For an example that shows how to
generate and run code, see “Create Agent Using Deep Network Designer and Train Using Image
Observations”.
2-211
2 Functions
For this example, the code for creating the modified actor and critic networks is in the
createModifiedNetworks helper script.
createModifiedNetworks
After exporting the networks, insert the networks into the actor and critic function approximators.
actor = setModel(actor,modifiedActorNet);
critic = setModel(critic,modifiedCriticNet);
Finally, insert the modified actor and critic function approximators into the actor and critic objects.
agent = setActor(agent,actor);
agent = setCritic(agent,critic);
Input Arguments
oldFcnAppx — Original actor or critic function object
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object
2-212
setModel
To create an actor or critic function object, use one of the following methods.
• Deep neural network defined as an array of Layer objects, a layerGraph object, a DAGNetwork
object, or a dlnetwork object. The input and output layers of model must have the same names
and dimensions as the network returned by getModel for the same function object. Here, the
output layer is the layer immediately before the output loss layer.
• rlTable object with the same dimensions as the table model defined in newRep.
• 1-by-2 cell array that contains the function handle for a custom basis function and the basis
function parameters.
When specifying a new model, you must use the same type of model as the one already defined in
newRep.
Note For agents with more than one critic, such as TD3 and SAC agents, you must call setModel for
each critic representation individually, rather than calling setModel for the array of returned by
getCritic.
critics = getCritic(myTD3Agent);
% Modify critic networks.
critics(1) = setModel(critics(1),criticNet1);
critics(2) = setModel(critics(2),criticNet2);
myTD3Agent = setCritic(myTD3Agent,critics);
Output Arguments
newFcnAppx — New actor or critic function object
rlValueFunction object | rlQValueFunction object | rlVectorQValueFunction object |
rlContinuousDeterministicActor object | rlDiscreteCategoricalActor object |
rlContinuousGaussianActor object
2-213
2 Functions
New actor or critic function object, returned as a function object of the same type as oldFcnAppx.
Apart from the new computational model, newFcnAppx is the same as oldFcnAppx.
Version History
Introduced in R2020b
Using representation objects to create actors and critics for reinforcement learning agents is no
longer recommended. Therefore, setModel now uses function approximator objects instead.
See Also
getActor | setActor | getCritic | setCritic | getModel
Topics
“Create Policies and Value Functions”
2-214
setup
setup
Package: rl.env
Syntax
setup(env)
setup(env,Name=Value)
setup(lgr)
Description
When you define a custom training loop for reinforcement learning, you can simulate an agent or
policy against an environment using the runEpisode function. Use the setup function to configure
the environment for running simulations using multiple calls to runEpisode.
Also use setup to initialize a FileLogger or MonitorLogger object before logging data within a
custom training loop.
Environment Objects
setup(env) sets up the specified reinforcement learning environment for running multiple
simulations using runEpisode.
setup(lgr) sets up the specified data logger object. Setup tasks may include setting up a
visualization, or creating directories for logging to file.
Examples
Create a reinforcement learning environment and extract its observation and action specifications.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
To approximate the Q-value function withing the critic, use a neural network. Create a network as an
array of layer objects.
net = [...
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(24)
reluLayer
2-215
2 Functions
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(2)
softmaxLayer];
Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
net = dlnetwork(net);
summary(net)
Initialized: true
Inputs:
1 'input' 4 features
Set up the environment for running multiple simulations. For this example, configure the training to
log any errors rather than send them to the command window.
setup(env,StopOnError="off")
Simulate multiple episodes using the environment and policy. After each episode, append the
experiences to the buffer. For this example, run 100 episodes.
for i = 1:100
output = runEpisode(env,policy,MaxSteps=300);
append(buffer,output.AgentData.Experiences)
end
Sample a mini-batch of experiences from the buffer. For this example, sample 10 experiences.
batch = sample(buffer,10);
You can then learn from the sampled experiences and update the policy and actor.
2-216
setup
This example shows how to log data to disk when training an agent using a custom training loop.
Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.
setup(flgr);
Within a custom training loop, you can now store data to the logger object memory and write data to
file.
For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.
for iter = 1:10
end
Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.
cleanup(flgr);
Input Arguments
env — Reinforcement learning environment
rlFunctionEnv object | SimulinkEnvWithAgent object | rlNeuralNetworkEnvironment object
| rlMDPEnv object | ...
2-217
2 Functions
Option to stop an episode when an error occurs, specified as one of the following:
• "on" — Stop the episode when an error occurs and generate an error message in the MATLAB
command window.
• "off" — Log errors in the SimulationInfo output of runEpisode.
Option for using parallel simulations, specified as a logical value. Using parallel computing allows
the usage of multiple cores, processors, computer clusters, or cloud resources to speed up simulation.
When you set UseParallel to true, the output of a subsequent call to runEpisode is an
rl.env.Future object, which supports deferred evaluation of the simulation.
Function to run on the each worker before running an episode, specified as a handle to a function
with no input arguments. Use this function to perform any preprocessing required before running an
episode.
Function to run on each worker when cleaning up the environment, specified as a handle to a
function with no input arguments. Use this function to clean up the workspace or perform other
processing after calling runEpisode.
2-218
setup
Option to send model and workspace variables to parallel workers, specified as "on" or "off". When
the option is "on", the client sends variables used in models and defined in the base MATLAB
workspace to the workers.
Additional files to attach to the parallel pool before running an episode, specified as a string or string
array.
Version History
Introduced in R2022a
See Also
Functions
runEpisode | cleanup | reset | store | write
Objects
rlFunctionEnv | rlMDPEnv | SimulinkEnvWithAgent | rlNeuralNetworkEnvironment |
FileLogger | MonitorLogger
Topics
“Custom Training Loop with Simulink Action Noise”
2-219
2 Functions
sim
Package: rl.env
Syntax
experience = sim(env,agents)
experience = sim(agents,env)
Description
experience = sim(env,agents) simulates one or more reinforcement learning agents within an
environment, using default simulation options.
env = sim( ___ ,simOpts) uses the simulation options object simOpts. Use simulation options to
specify parameters such as the number of steps per simulation or the number of simulations to run.
Use this syntax after any of the input arguments in the previous syntaxes.
Examples
Simulate a reinforcement learning environment with an agent configured for that environment. For
this example, load an environment and agent that are already configured. The environment is a
discrete cart-pole environment created with rlPredefinedEnv. The agent is a policy gradient
(rlPGAgent) agent. For more information about the environment and agent used in this example, see
“Train PG Agent to Balance Cart-Pole System”.
env =
CartPoleDiscreteAction with properties:
Gravity: 9.8000
MassCart: 1
MassPole: 0.1000
Length: 0.5000
MaxForce: 10
Ts: 0.0200
ThetaThresholdRadians: 0.2094
XThreshold: 2.4000
RewardForNotFalling: 1
PenaltyForFalling: -5
2-220
sim
agent
agent =
rlPGAgent with properties:
Typically, you train the agent using train and simulate the environment to test the performance of
the trained agent. For this example, simulate the environment using the agent you loaded. Configure
simulation options, specifying that the simulation run for 100 steps.
simOpts = rlSimulationOptions('MaxSteps',100);
For the predefined cart-pole environment used in this example. you can use plot to generate a
visualization of the cart-pole system. When you simulate the environment, this plot updates
automatically so that you can watch the system evolve during the simulation.
plot(env)
experience = sim(env,agent,simOpts)
2-221
2 Functions
The output structure experience records the observations collected from the environment, the
action and reward, and other data collected during the simulation. Each field contains a timeseries
object or a structure of timeseries data objects. For instance, experience.Action is a
timeseries containing the action imposed on the cart-pole system by the agent at each step of the
simulation.
experience.Action
Simulate an environment created for the Simulink® model used in the example “Train Multiple
Agents to Perform Collaborative Task”, using the agents trained in that example.
Create an environment for the rlCollaborativeTask Simulink® model, which has two agent
blocks. Since the agents used by the two blocks (agentA and agentB) are already in the workspace,
you do not need to pass their observation and action specifications to create the environment.
env = rlSimulinkEnv( ...
'rlCollaborativeTask', ...
["rlCollaborativeTask/Agent A","rlCollaborativeTask/Agent B"]);
Load the parameters that are needed by the rlCollaborativeTask Simulink® model to run.
2-222
sim
rlCollaborativeTaskParams
Simulate the agents against the environment, saving the experiences in xpr.
subplot(2,1,1); plot(xpr(1).Action.forces)
subplot(2,1,2); plot(xpr(2).Action.forces)
Input Arguments
env — Environment
reinforcement learning environment object
Environment in which the agents act, specified as one of the following kinds of reinforcement
learning environment object:
2-223
2 Functions
• A custom Simulink environment you create using rlSimulinkEnv. This kind of environment
supports training multiple agents at the same time.
When env is a Simulink environment, calling sim compiles and simulates the model associated with
the environment.
agents — Agents
reinforcement learning agent object | array of agent objects
If env is a multi-agent environment created with rlSimulinkEnv, specify agents as an array. The
order of the agents in the array must match the agent order used to create env. Multi-agent
simulation is not supported for MATLAB environments.
For more information about how to create and configure agents for reinforcement learning, see
“Reinforcement Learning Agents”.
Output Arguments
experience — Simulation results
structure | structure array
Simulation results, returned as a structure or structure array. The number of rows in the array is
equal to the number of simulations specified by the NumSimulations option of
rlSimulationOptions. The number of columns in the array is the number of agents. The fields of
each experience structure are as follows.
Observation — Observations
structure
Observations collected from the environment, returned as a structure with fields corresponding to the
observations specified in the environment. Each field contains a timeseries of length N + 1, where
N is the number of simulation steps.
To obtain the current observation and the next observation for a given simulation step, use code such
as the following, assuming one of the fields of Observation is obs1.
2-224
sim
Obs = getSamples(experience.Observation.obs1,1:N);
NextObs = getSamples(experience.Observation.obs1,2:N+1);
These values can be useful if you are writing your own training algorithm using sim to generate
experiences for training.
Action — Actions
structure
Actions computed by the agent, returned as a structure with fields corresponding to the action
signals specified in the environment. Each field contains a timeseries of length N, where N is the
number of simulation steps.
Reward — Rewards
timeseries
Reward at each step in the simulation, returned as a timeseries of length N, where N is the
number of simulation steps.
Flag indicating termination of the episode, returned as a timeseries of a scalar logical signal. This
flag is set at each step by the environment, according to conditions you specify for episode
termination when you configure the environment. When the environment sets this flag to 1,
simulation terminates.
• For MATLAB environments, a structure containing the field SimulationError. This structure
contains any errors that occurred during simulation.
• For Simulink environments, a Simulink.SimulationOutput object containing simulation data.
Recorded data includes any signals and states that the model is configured to log, simulation
metadata, and any errors that occurred.
Version History
Introduced in R2019a
See Also
train | rlSimulationOptions
Topics
“Train Reinforcement Learning Agents”
2-225
2 Functions
store
Package: rl.logging
Syntax
store(lgr,context,data,iter)
Description
store(lgr,context,data,iter) stores data into lgr internal memory, grouped in a variable
named context and associated with iteration iter.
Examples
This example shows how to log data to disk when training an agent using a custom training loop.
flgr = rlDataLogger();
Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.
setup(flgr);
Within a custom training loop, you can now store data to the logger object memory and write data to
file.
For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.
2-226
store
end
Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.
cleanup(flgr);
Input Arguments
lgr — Date logger object
FileLogger object | MonitorLogger object | ...
Name of the saved variable, specified as either a string or character array. All data associated with
iteration iter and the name context is vertically concatenated in a single MATLAB variable named
context. This variable is then written to the logger target (either a MAT file or a
trainingProgressMonitor object) when write is invoked for lgr.
Data to be saved, specified as any fundamental MATLAB datatype. Data associated with the same
iteration and the same context name is grouped in a single variable.
Iteration number, specified as a positive integer. When store is executed multiple times with the
same iteration number, data is appended to the memory entry for that iteration. This memory entry,
which can contain several context variables, is then written to the logger target as a single unit (for
example as a single MAT file) when write is invoked for lgr.
Version History
Introduced in R2022b
See Also
Functions
rlDataLogger | train | write | setup | cleanup
Objects
FileLogger | MonitorLogger
2-227
2 Functions
Topics
“Log Training Data To Disk”
“Function Handles”
“Handle Object Behavior”
2-228
train
train
Package: rl.agent
Syntax
trainStats = train(env,agents)
trainStats = train(agents,env)
trainStats = train(agents,env,prevTrainStats)
Description
trainStats = train(env,agents) trains one or more reinforcement learning agents within a
specified environment, using default training options. Although agents is an input argument, after
each training episode, train updates the parameters of each agent specified in agents to maximize
their expected long-term reward from the environment. This is possible because each agent is an
handle object. When training terminates, agents reflects the state of each agent at the end of the
final training episode.
trainStats = train( ___ ,trainOpts) trains agents within env, using the training options
object trainOpts. Use training options to specify training parameters such as the criteria for
terminating training, when to save agents, the maximum number of episodes to train, and the
maximum number of steps per episode. Use this syntax after any of the input arguments in the
previous syntaxes.
Examples
Train the agent configured in the “Train PG Agent to Balance Cart-Pole System” example, within the
corresponding environment. The observation from the environment is a vector containing the position
and velocity of a cart, as well as the angular position and velocity of the pole. The action is a scalar
with two possible elements (a force of either -10 or 10 Newtons applied to a cart).
Load the file containing the environment and a PG agent already configured for it.
load RLTrainExample.mat
Specify some training parameters using rlTrainingOptions. These parameters include the
maximum number of episodes to train, the maximum steps per episode, and the conditions for
2-229
2 Functions
terminating training. For this example, use a maximum of 1000 episodes and 500 steps per episode.
Instruct the training to stop when the average reward over the previous five episodes reaches 500.
Create a default options set and use dot notation to change some of the parameter values.
trainOpts = rlTrainingOptions;
trainOpts.MaxEpisodes = 1000;
trainOpts.MaxStepsPerEpisode = 500;
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 500;
trainOpts.ScoreAveragingWindowLength = 5;
During training, the train command can save candidate agents that give good results. Further
configure the training options to save an agent when the episode reward exceeds 500. Save the agent
to a folder called savedAgents.
trainOpts.SaveAgentCriteria = "EpisodeReward";
trainOpts.SaveAgentValue = 500;
trainOpts.SaveAgentDirectory = "savedAgents";
Finally, turn off the command-line display. Turn on the Reinforcement Learning Episode Manager so
you can observe the training progress visually.
trainOpts.Verbose = false;
trainOpts.Plots = "training-progress";
You are now ready to train the PG agent. For the predefined cart-pole environment used in this
example, you can use plot to generate a visualization of the cart-pole system.
plot(env)
When you run this example, both this visualization and the Reinforcement Learning Episode Manager
update with each training episode. Place them side by side on your screen to observe the progress,
and train the agent. (This computation can take 20 minutes or more.)
trainingInfo = train(agent,env,trainOpts);
2-230
train
2-231
2 Functions
Episode Manager shows that the training successfully reaches the termination condition of a reward
of 500 averaged over the previous five episodes. At each training episode, train updates agent with
the parameters learned in the previous episode. When training terminates, you can simulate the
environment with the trained agent to evaluate its performance. The environment plot updates during
simulation as it did during training.
simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);
During training, train saves to disk any agents that meet the condition specified with
trainOps.SaveAgentCritera and trainOpts.SaveAgentValue. To test the performance of any
2-232
train
of those agents, you can load the data from the data files in the folder you specified using
trainOpts.SaveAgentDirectory, and simulate the environment with that agent.
This example shows how to set up a multi-agent training session on a Simulink® environment. In the
example, you train two agents to collaboratively perform the task of moving an object.
The environment in this example is a frictionless two dimensional surface containing elements
represented by circles. A target object C is represented by the blue circle with a radius of 2 m, and
robots A (red) and B (green) are represented by smaller circles with radii of 1 m each. The robots
attempt to move object C outside a circular ring of a radius 8 m by applying forces through collision.
All elements within the environment have mass and obey Newton's laws of motion. In addition,
contact forces between the elements and the environment boundaries are modeled as spring and
mass damper systems. The elements can move on the surface through the application of externally
applied forces in the X and Y directions. There is no motion in the third dimension and the total
energy of the system is conserved.
Set the random seed and create the set of parameters required for this example.
rng(10)
rlCollaborativeTaskParams
2-233
2 Functions
mdl = "rlCollaborativeTask";
open_system(mdl)
• The 2-dimensional space is bounded from –12 m to 12 m in both the X and Y directions.
• The contact spring stiffness and damping values are 100 N/m and 0.1 N/m/s, respectively.
• The agents share the same observations for positions, velocities of A, B, and C and the action
values from the last time step.
• The simulation terminates when object C moves outside the circular ring.
• At each time step, the agents receive the following reward:
2-234
train
r A = rglobal + rlocal, A
rB = rglobal + rlocal, B
rglobal = 0 . 001dc
rlocal, A = − 0 . 005dAC − 0 . 008u2A
2
rlocal, B = − 0 . 005dBC − 0 . 008uB
Here:
Environment
To create a multi-agent environment, specify the block paths of the agents using a string array. Also,
specify the observation and action specification objects using cell arrays. The order of the
specification objects in the cell array must match the order specified in the block path array. When
agents are available in the MATLAB workspace at the time of environment creation, the observation
and action specification arrays are optional. For more information on creating multi-agent
environments, see rlSimulinkEnv.
Create the I/O specifications for the environment. In this example, the agents are homogeneous and
have the same I/O specifications.
% Number of observations
numObs = 16;
% Number of actions
numAct = 2;
2-235
2 Functions
Specify a reset function for the environment. The reset function resetRobots ensures that the
robots start from random initial positions at the beginning of each episode.
Agents
This example uses two Proximal Policy Optimization (PPO) agents with continuous action spaces. The
agents apply external forces on the robots that result in motion. To learn more about PPO agents, see
“Proximal Policy Optimization Agents”.
The agents collect experiences until the experience horizon (600 steps) is reached. After trajectory
completion, the agents learn from mini-batches of 300 experiences. An objective function clip factor
of 0.2 is used to improve training stability and a discount factor of 0.99 is used to encourage long-
term rewards.
agentOptions = rlPPOAgentOptions(...
"ExperienceHorizon",600,...
"ClipFactor",0.2,...
"EntropyLossWeight",0.01,...
"MiniBatchSize",300,...
"NumEpoch",4,...
"AdvantageEstimateMethod","gae",...
"GAEFactor",0.95,...
"SampleTime",Ts,...
"DiscountFactor",0.99);
agentOptions.ActorOptimizerOptions.LearnRate = 1e-4;
agentOptions.CriticOptimizerOptions.LearnRate = 1e-4;
Create the agents using the default agent creation syntax. For more information see rlPPOAgent.
Training
To train multiple agents, you can pass an array of agents to the train function. The order of agents
in the array must match the order of agent block paths specified during environment creation. Doing
so ensures that the agent objects are linked to their appropriate I/O interfaces in the environment.
You can train multiple agents in a decentralized or centralized manner. In decentralized training,
agents collect their own set of experiences during the episodes and learn independently from those
experiences. In centralized training, the agents share the collected experiences and learn from them
together. The actor and critic functions are synchronized between the agents after trajectory
completion.
To configure a multi-agent training, you can create agent groups and specify a learning strategy for
each group through the rlMultiAgentTrainingOptions object. Each agent group may contain
2-236
train
unique agent indices, and the learning strategy can be "centralized" or "decentralized". For
example, you can use the following command to configure training for three agent groups with
different learning strategies. The agents with indices [1,2] and [3,4] learn in a centralized manner
while agent 4 learns in a decentralized manner.
You can perform decentralized or centralized training by running one of the following sections using
the Run Section button.
1. Decentralized Training
• Automatically assign agent groups using the AgentGroups=auto option. This allocates each
agent in a separate group.
• Specify the "decentralized" learning strategy.
• Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.
• Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or
more.
trainOpts = rlMultiAgentTrainingOptions(...
"AgentGroups","auto",...
"LearningStrategy","decentralized",...
"MaxEpisodes",1000,...
"MaxStepsPerEpisode",600,...
"ScoreAveragingWindowLength",30,...
"StopTrainingCriteria","AverageReward",...
"StopTrainingValue",-10);
Train the agents using the train function. Training can take several hours to complete depending on
the available computational power. To save time, load the MAT file decentralizedAgents.mat
which contains a set of pretrained agents. To train the agents yourself, set doTraining to true.
doTraining = false;
if doTraining
decentralizedTrainResults = train([agentA,agentB],env,trainOpts);
else
load("decentralizedAgents.mat");
end
The following figure shows a snapshot of decentralized training progress. You can expect different
results due to randomness in the training process.
2-237
2 Functions
2. Centralized Training
• Allocate both agents (with indices 1 and 2) in a single group. You can do this by specifying the
agent indices in the "AgentGroups" option.
• Specify the "centralized" learning strategy.
• Run the training for at most 1000 episodes, with each episode lasting at most 600 time steps.
• Stop the training of an agent when its average reward over 30 consecutive episodes is –10 or
more.
trainOpts = rlMultiAgentTrainingOptions(...
"AgentGroups",{[1,2]},...
"LearningStrategy","centralized",...
"MaxEpisodes",1000,...
"MaxStepsPerEpisode",600,...
"ScoreAveragingWindowLength",30,...
"StopTrainingCriteria","AverageReward",...
"StopTrainingValue",-10);
Train the agents using the train function. Training can take several hours to complete depending on
the available computational power. To save time, load the MAT file centralizedAgents.mat which
contains a set of pretrained agents. To train the agents yourself, set doTraining to true.
2-238
train
doTraining = false;
if doTraining
centralizedTrainResults = train([agentA,agentB],env,trainOpts);
else
load("centralizedAgents.mat");
end
The following figure shows a snapshot of centralized training progress. You can expect different
results due to randomness in the training process.
Simulation
Once the training is finished, simulate the trained agents with the environment.
simOptions = rlSimulationOptions("MaxSteps",300);
exp = sim(env,[agentA agentB],simOptions);
2-239
2 Functions
This example shows how to resume training using existing training data for training Q-learning. For
more information on these agents, see “Q-Learning Agents” and “SARSA Agents”.
To specify that the initial state of the agent is always [2,1], create a reset function that returns the
state number for the initial agent state.
x0 = [1:12 15:17 19:22 24];
env.ResetFcn = @() x0(randi(numel(x0)));
To create a Q-learning agent, first create a Q table using the observation and action specifications
from the grid world environment. Set the learning rate of the representation to 1.
qTable = rlTable(getObservationInfo(env),getActionInfo(env));
qVf = rlQValueFunction(qTable,getObservationInfo(env),getActionInfo(env));
Next, create a Q-learning agent using this table representation and configure the epsilon-greedy
exploration. For more information on creating Q-learning agents, see rlQAgent and
rlQAgentOptions. Keep the default value of the discount factor to 0.99.
2-240
train
agentOpts = rlQAgentOptions;
agentOpts.EpsilonGreedyExploration.Epsilon = 0.2;
agentOpts.CriticOptimizerOptions.LearnRate = 0.2;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-3;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 1e-3;
agentOpts.DiscountFactor = 1;
qAgent = rlQAgent(qVf,agentOpts);
To train the agent, first specify the training options. For more information, see rlTrainingOptions.
trainOpts = rlTrainingOptions;
trainOpts.MaxStepsPerEpisode = 200;
trainOpts.MaxEpisodes = 1e6;
trainOpts.Plots = "none";
trainOpts.Verbose = false;
trainOpts.StopTrainingCriteria = "EpisodeCount";
trainOpts.StopTrainingValue = 100;
trainOpts.ScoreAveragingWindowLength = 30;
Train the Q-learning agent using the train function. Training can take several minutes to complete.
To save time while running this example, load a pretrained agent by setting doTraining to false.
To train the agent yourself, set doTraining to true.
trainingStats = train(qAgent,env,trainOpts);
trainingStats.EpisodeIndex(end)
ans = 100
trainingStats.TrainingOptions.StopTrainingValue = 300;
Resume the training using the training data that exists in trainingStats.
trainingStats = train(qAgent,env,trainingStats);
trainingStats.EpisodeIndex(end)
ans = 300
figure()
plot(trainingStats.EpisodeIndex,trainingStats.EpisodeReward)
title('Episode Reward')
xlabel('EpisodeIndex')
ylabel('EpisodeReward')
2-241
2 Functions
qAgentFinalQ = getLearnableParameters(getCritic(qAgent));
qAgentFinalQ{1}
To validate the training results, simulate the agent in the training environment.
Before running the simulation, visualize the environment and configure the visualization to maintain a
trace of the agent states.
2-242
train
plot(env)
env.ResetFcn = @() 2;
env.Model.Viewer.ShowTrace = true;
env.Model.Viewer.clearTrace;
sim(qAgent,env)
Input Arguments
agents — Agents
agent object | array of agent objects
If env is a multi-agent environment created with rlSimulinkEnv, specify agents as an array. The
order of the agents in the array must match the agent order used to create env. Multi-agent training
is not supported for MATLAB environments.
Note train updates the agents as training progresses. This is possible because each agent is an
handle object. To preserve the original agent parameters for later use, save the agent to a MAT-file (if
you copy the agent into a new variable, the new variable will also always point to the most recent
2-243
2 Functions
agent version with updated parameters). For more information about handle objects, see “Handle
Object Behavior”.
Note When training terminates, agents reflects the state of each agent at the end of the final
training episode. The rewards obtained by the final agents are not necessarily the highest achieved
during the training process, due to continuous exploration. To save agents during training, create an
rlTrainingOptions object specifying the SaveAgentCriteria and SaveAgentValue properties and
pass it to train as a trainOpts argument.
For more information about how to create and configure agents for reinforcement learning, see
“Reinforcement Learning Agents”.
env — Environment
reinforcement learning environment object
Environment in which the agents act, specified as one of the following kinds of reinforcement
learning environment object:
When env is a Simulink environment, calling train compiles and simulates the model associated
with the environment.
2-244
train
Use this argument to resume training from the exact point at which it stopped. This starts the
training from the last values of the agent parameters and training results object obtained after the
previous train function call. prevTrainStats contains, as one of its properties, the
rlTrainingOptions object or the rlMultiAgentTrainingOptions object specifying the training
option set. Therefore, to restart the training with updated training options, first change the training
options in trainResults using dot notation. If the maximum number of episodes was already
reached in the previous training session, you must increase the maximum number of episodes.
For details about the rlTrainingResult object properties, see the “trainStats” on page 2-0
output argument.
Output Arguments
trainStats — Training episode data
rlTrainingResult object | array of rlTrainingResult objects
Episode numbers, returned as the column vector [1;2;…;N], where N is the number of episodes in
the training run. This vector is useful if you want to plot the evolution of other quantities from
episode to episode.
Reward for each episode, returned in a column vector of length N. Each entry contains the reward for
the corresponding episode.
Number of steps in each episode, returned in a column vector of length N. Each entry contains the
number of steps in the corresponding episode.
Average reward over the averaging window specified in trainOpts, returned as a column vector of
length N. Each entry contains the average award computed at the end of the corresponding episode.
2-245
2 Functions
Total number of agent steps in training, returned as a column vector of length N. Each entry contains
the cumulative sum of the entries in EpisodeSteps up to that point.
Critic estimate of long-term reward using the current agent and the environment initial conditions,
returned as a column vector of length N. Each entry is the critic estimate (Q0) for the agent of the
corresponding episode. This field is present only for agents that have critics, such as rlDDPGAgent
and rlDQNAgent.
Information collected during the simulations performed for training, returned as:
• For training in MATLAB environments, a structure containing the field SimulationError. This
field is a column vector with one entry per episode. When the StopOnError option of
rlTrainingOptions is "off", each entry contains any errors that occurred during the
corresponding episode.
• For training in Simulink environments, a vector of Simulink.SimulationOutput objects
containing simulation data recorded during the corresponding episode. Recorded data for an
episode includes any signals and states that the model is configured to log, simulation metadata,
and any errors that occurred during the corresponding episode.
• For a single agent, an rlTrainingOptions object. For more information, see the
rlTrainingOptions reference page.
• For multiple agents, an rlMultiAgentTrainingOptions object. For more information, see the
rlMultiAgentTrainingOptions reference page.
Tips
• train updates the agents as training progresses. To preserve the original agent parameters for
later use, save the agents to a MAT-file.
• By default, calling train opens the Reinforcement Learning Episode Manager, which lets you
visualize the progress of the training. The Episode Manager plot shows the reward for each
episode, a running average reward value, and the critic estimate Q0 (for agents that have critics).
The Episode Manager also displays various episode and training statistics. To turn off the
Reinforcement Learning Episode Manager, set the Plots option of trainOpts to "none".
• If you use a predefined environment for which there is a visualization, you can use plot(env) to
visualize the environment. If you call plot(env) before training, then the visualization updates
during training to allow you to visualize the progress of each episode. (For custom environments,
you must implement your own plot method.)
2-246
train
• Training terminates when the conditions specified in trainOpts are satisfied. To terminate
training in progress, in the Reinforcement Learning Episode Manager, click Stop Training.
Because train updates the agent at each episode, you can resume training by calling
train(agent,env,trainOpts) again, without losing the trained parameters learned during the
first call to train.
• During training, you can save candidate agents that meet conditions you specify with trainOpts.
For instance, you can save any agent whose episode reward exceeds a certain value, even if the
overall condition for terminating training is not yet satisfied. train stores saved agents in a MAT-
file in the folder you specify with trainOpts. Saved agents can be useful, for instance, to allow
you to test candidate agents generated during a long-running training process. For details about
saving criteria and saving location, see rlTrainingOptions.
Algorithms
In general, train performs the following iterative steps:
1 Initialize agent.
2 For each episode:
i Step the environment with action a to obtain the next observation s' and the reward r.
ii Learn from the experience set (s,a,r,s').
iii Compute the next action a' = μ(s').
iv Update the current action with the next action (a←a') and update the current
observation with the next observation (s←s').
v Break if the episode termination conditions defined in the environment are met.
3 If the training termination condition defined by trainOpts is met, terminate training.
Otherwise, begin the next episode.
The specifics of how train performs these computations depends on your configuration of the agent
and environment. For instance, resetting the environment at the start of each episode can include
randomizing initial state values, if you configure your environment to do so.
Version History
Introduced in R2019a
Starting in R2022a, train returns an object or an array of objects instead of a structure. The
properties of the object match the fields of the structure returned in previous versions. Therefore, the
code based on dot notation works in the same way.
2-247
2 Functions
trainStats = train(agent,env,trainOptions);
When training terminates, either because a termination condition is reached or because you click
Stop Training in the Reinforcement Learning Episode Manager, trainStats is returned as an
rlTrainingResult object.
The rlTrainingResult object contains the same training statistics previously returned in a
structure along with data to correctly recreate the training scenario and update the episode manager.
You can use trainStats as third argument for another train call, which (when executed with the
same agents and environment) will cause training to resume from the exact point at which it stopped.
Extended Capabilities
Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.
To train in parallel, set the UseParallel and ParallelizationOptions options in the option set
trainOpts. Parallel training is not supported for multi-agent environments. For more information,
see rlTrainingOptions.
See Also
rlTrainingOptions | sim | rlMultiAgentTrainingOptions
Topics
“Train Reinforcement Learning Agents”
2-248
validateEnvironment
validateEnvironment
Package: rl.env
Syntax
validateEnvironment(env)
Description
validateEnvironment(env) validates a reinforcement learning environment. This function is
useful when:
• You are using a custom environment for which you supplied your own step and reset functions,
such as an environment created using rlCreateEnvTemplate.
• You are using an environment you created from a Simulink model using rlSimulinkEnv.
validateEnvironment resets the environment, generates an initial observation and action, and
simulates the environment for one or two steps (see “Algorithms” on page 2-251). If there are no
errors during these operations, validation is successful, and validateEnvironment returns no
result. If errors occur, these errors appear in the MATLAB command window. Use the errors to
determine what to change in your observation specification, action specification, custom functions, or
Simulink model.
Examples
Create and validate and environment for the rlwatertank model, which represents a control
system containing a reinforcement learning agent (For details about this model, see “Create Simulink
Environment and Train Agent”.)
open_system('rlwatertank')
2-249
2 Functions
Now you use validateEnvironment to check whether the model is configured correctly.
validateEnvironment(env)
validateEnvironment attempts to compile the model, initialize the environment and the agent,
and simulate the model. In this case, the RL Agent block is configured to use an agent called agent,
but no such variable exists in the MATLAB® workspace. Thus, the function returns an error
indicating the problem.
Create an appropriate agent for this system using the commands detailed in the “Create Simulink
Environment and Train Agent” example. In this case, load the agent from the
rlWaterTankDDPGAgent.mat file.
load rlWaterTankDDPGAgent
validateEnvironment(env)
Input Arguments
env — Environment to validate
environment object
2-250
validateEnvironment
Algorithms
validateEnvironment works by running a brief simulation of the environment and making sure
that the generated signals match the observation and action specifications you provided when you
created the environment.
MATLAB Environments
1 Reset the environment using the reset function associated with the environment.
2 Obtain the first observation and check whether it is consistent with the dimension, data type, and
range of values in the observation specification.
3 Generate a test action based on the dimension, data type, and range of values in the action
specification.
4 Simulate the environment for one step using the generated action and the step function
associated with the environment.
5 Obtain the new observation signal and check whether it is consistent with the dimension, data
type, and range of values in the observation specification.
Simulink Environments
validateEnvironment performs these steps without dirtying the model, and leaves all model
parameters in the state they were in when you called the function.
Version History
Introduced in R2019a
See Also
rlCreateEnvTemplate | rlSimulinkEnv | rlFunctionEnv
Topics
“Create Simulink Environment and Train Agent”
“Create Custom MATLAB Environment from Template”
2-251
2 Functions
write
Package: rl.logging
Transfer stored data from the internal logger memory to the logging target
Syntax
write(lgr)
Description
write(lgr) transfers all stored logging contexts into the logger's target (either a MAT file or a
trainingProgressMonitor object).
Examples
This example shows how to log data to disk when training an agent using a custom training loop.
flgr = rlDataLogger();
Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.
setup(flgr);
Within a custom training loop, you can now store data to the logger object memory and write data to
file.
For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.
2-252
write
end
Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.
cleanup(flgr);
Input Arguments
lgr — Date logger object
FileLogger object | MonitorLogger object | ...
Version History
Introduced in R2022b
See Also
Functions
rlDataLogger | train | setup | store | cleanup
Objects
FileLogger | MonitorLogger
Topics
“Log Training Data To Disk”
“Function Handles”
“Handle Object Behavior”
2-253
3
Objects
3 Objects
quadraticLayer
Quadratic layer for actor or critic network
Description
A quadratic layer takes an input vector and outputs a vector of quadratic monomials constructed from
the input elements. This layer is useful when you need a layer whose output is a quadratic function of
its inputs. For example, to recreate the structure of quadratic value functions such as those used in
LQR controller design.
For example, consider an input vector U = [u1 u2 u3]. For this input, a quadratic layer gives the
output Y = [u1*u1 u1*u2 u2*u2 u1*u3 u2*u3 u3*u3]. For an example that uses a
QuadraticLayer, see “Train DDPG Agent to Control Double Integrator System”.
Note The QuadraticLayer layer does not support inputs coming directly or indirectly from a
featureInputLayer or sequenceInputLayer.
Creation
Syntax
qLayer = quadraticLayer
qLayer = quadraticLayer(Name,Value)
Description
Properties
Name — Name of layer
'quadratic' (default) | character vector
Name of layer, specified as a character vector. To include a layer in a layer graph, you must specify a
nonempty unique layer name. If you train a series network with this layer and Name is set to '', then
the software automatically assigns a name to the layer at training time.
3-2
quadraticLayer
Description of layer, specified as a character vector. When you create the quadratic layer, you can use
this property to give it a description that helps you identify its purpose.
Examples
Create a quadratic layer that converts an input vector U into a vector of quadratic monomials
constructed from binary combinations of the elements of U.
qLayer = quadraticLayer
qLayer =
QuadraticLayer with properties:
Name: 'quadratic'
Learnable Parameters
No properties.
State Parameters
No properties.
Confirm that the layer produces the expected output. For instance, for U = [u1 u2 u3], the
expected output is [u1*u1 u1*u2 u2*u2 u1*u3 u2*u3 u3*u3].
predict(qLayer,[1 2 3])
ans = 1×3
1 4 9
You can incorporate qLayer into an actor network or critic network for reinforcement learning.
Version History
Introduced in R2019a
See Also
scalingLayer | softplusLayer
Topics
“Train DDPG Agent to Control Double Integrator System”
“Create Policies and Value Functions”
3-3
3 Objects
FileLogger
Log reinforcement learning training data to MAT files
Description
Use a FileLogger object to log data to MAT files, within the train function or inside a custom
training loop. To log data when using the train function, specify appropriate callback functions in
FileLogger, as shown in the examples. These callbacks are executed at different stages of training,
for example, EpisodeFinishedFcn is executed after the completion of an episode. The output of a
callback function is a structure containing the data to log at that stage of training.
Note FileLogger is a handle object. If you assign an existing FileLogger object to a new
FileLogger object, both the new object and the original one refer to the same underlying object in
memory. To preserve the original object parameters for later use, save the object to a MAT-file. For
more information about handle objects, see “Handle Object Behavior”.
Creation
Create a FileLogger object using rlDataLogger without any input arguments.
Properties
LoggingOptions — Object containing a set of logging options
MATFileLoggingOptions object (default)
Name or fully qualified path of the logging directory, specified as a string or a character array. This is
the name of the directory where the MAT files containing the logged data are saved. As a default, a
subdirectory called logs is created in the current folder during setup and files are saved there
during training.
Rule to name the MAT files, specified as a string or a character array. For example, the naming rule
"episode<id>" results in the file names episode001.mat, episode002.mat and so on.
3-4
FileLogger
MAT file versions, specified as a string or character array. The default is "-v7". For more information,
see “MAT-File Versions”.
Option to use compression when saving data to a MAT file, specified as a logical variable. The default
is true. For more information, see “MAT-File Versions”.
Frequency for writing data to disk, specified as a positive integer. It is the number of episodes after
which data is saved to disk. The default is 1.
Maximum number of episodes, specified as a positive integer. When using train, the value is
automatically initialized. Set this value when using the logger object in a custom training loop. The
default is 500.
Callback to log data after episode completion, specified as a function handle object. The specified
function must return a structure containing the data to log, such as experiences, simulation
information, or initial observations.
Example: @myEpisodeLoggingFcn
Callback to log data after training step completion within an episode, specified as a function handle
object. The specified function must return a structure containing the data to log, such as for example
the state of the agent's exploration policy.
For multi agent training, AgentStepFinishedFcn can be a cell array of function handles with as
many elements as number of agent groups.
Note Logging data using the AgentStepFinishedFcn callback is not supported when training
agents in parallel with the train function.
Example: @myAgentStepLoggingFcn
Callback to log data after completion of the learn subroutine, specified as a function handle object.
The specified function must return a structure containing the data to log, such as the training losses
of the actor and critic networks, or, for a model-based agent, the environment model training losses.
3-5
3 Objects
For multi agent training, AgentLearnFinishedFcn can be a cell array of function handles with as
many elements as number of agent groups.
Example: @myLearnLoggingFcn
Object Functions
setup Set up reinforcement learning environment or initialize data logger object
cleanup Clean up reinforcement learning environment or data logger object
Examples
This example shows how to log data to disk when using train.
logger = rlDataLogger();
logger.LoggingOptions.LoggingDirectory = "myDataLog";
Create callback functions to log the data (for this example, see the helper function section), and
specify the appropriate callback functions in the logger object. For a related example, see “Log
Training Data To Disk”.
logger.EpisodeFinishedFcn = @myEpisodeFinishedFcn;
logger.AgentStepFinishedFcn = @myAgentStepFinishedFcn;
logger.AgentLearnFinishedFcn = @myAgentLearnFinishedFcn;
To train the agent, you can now call train, passing logger as an argument such as in the following
command.
While the training progresses, data will be logged to the specified directory, according to the rule
specified in the FileNameRule property of logger.LoggingOptions.
logger.LoggingOptions.FileNameRule
ans =
"loggedData<id>"
3-6
FileLogger
This example shows how to log data to disk when training an agent using a custom training loop.
Set up the logger object. This operation initializes the object performing setup tasks such as, for
example, creating the directory to save the data files.
setup(flgr);
Within a custom training loop, you can now store data to the logger object memory and write data to
file.
For this example, store random numbers to the file logger object, grouping them in the variables
Context1 and Context2. When you issue a write command, a MAT file corresponding to an iteration
and containing both variables is saved with the name specified in
flgr.LoggingOptions.FileNameRule, in the folder specified by
flgr.LoggingOptions.LoggingDirectory.
for iter = 1:10
end
Clean up the logger object. This operation performs clean up tasks like for example writing to file any
data still in memory.
cleanup(flgr);
3-7
3 Objects
Limitations
• Logging data using the AgentStepFinishedFcn callback is not supported when training agents
in parallel with the train function.
Version History
Introduced in R2022b
See Also
Functions
rlDataLogger | train | setup | store | write | cleanup
Objects
MonitorLogger
Topics
“Log Training Data To Disk”
“Function Handles”
“Handle Object Behavior”
3-8
MonitorLogger
MonitorLogger
Log reinforcement learning training data to monitor window
Description
Use a MonitorLogger object to log data to a monitor window, within the train function or inside a
custom training loop. To log data when using the train function, specify appropriate callback
functions in MonitorLogger, as shown in the examples. These callbacks are executed at different
stages of training, for example, EpisodeFinishedFcn is executed after the completion of an
episode. The output of a callback function is a structure containing the data to log at that stage of
training.
Note MonitorLogger is an handle object. If you assign an existing MonitorLogger object to a new
MonitorLogger object, both the new object and the original one refer to the same underlying object
in memory. To preserve the original object parameters for later use, save the object to a MAT-file. For
more information about handle objects, see “Handle Object Behavior”.
Creation
Create a MonitorLogger object using rlDataLogger specifying a trainingProgressMonitor
object as input argument.
Properties
LoggingOptions — Object containing a set of logging options
MonitorLoggingOptions object (default)
Frequency for writing data to the monitor window, specified as a positive integer. It is the number of
episodes after which data is transmitted to the trainingProgressMonitor object. The default is 1.
Maximum number of episodes, specified as a positive integer. When using train, the value is
automatically initialized. Set this value when using the logger object in a custom training loop. The
default is 500.
3-9
3 Objects
Callback to log data after episode completion, specified as a function handle object. The specified
function must return a structure containing the data to log, such as experiences, simulation
information, or initial observations.
Example: @myEpisodeLoggingFcn
Callback to log data after training step completion within an episode, specified as a function handle
object. The specified function must return a structure containing the data to log, such as for example
the state of the agent's exploration policy.
For multi agent training, AgentStepFinishedFcn can be a cell array of function handles with as
many elements as number of agent groups.
Note Logging data using the AgentStepFinishedFcn callback is not supported when training
agents in parallel with the train function.
Example: @myAgentStepLoggingFcn
Callback to log data after completion of the learn subroutine, specified as a function handle object.
The specified function must return a structure containing the data to log, such as the training losses
of the actor and critic networks, or, for a model-based agent, the environment model training losses.
For multi agent training, AgentLearnFinishedFcn can be a cell array of function handles with as
many elements as number of agent groups.
Example: @myLearnLoggingFcn
Object Functions
setup Set up reinforcement learning environment or initialize data logger object
cleanup Clean up reinforcement learning environment or data logger object
Examples
This example shows how to log and visualize data to the window of a trainingProgressMonitor
object when using train.
Create a trainingProgressMonitor object. Creating the object also opens a window associated
with the object.
monitor = trainingProgressMonitor();
3-10
MonitorLogger
logger = rlDataLogger(monitor);
Create callback functions to log the data (for this example, see the helper function section), and
specify the appropriate callback functions in the logger object.
logger.AgentLearnFinishedFcn = @myAgentLearnFinishedFcn;
To train the agent, you can now call train, passing logger as an argument such as in the following
command.
While the training progresses, data will be logged to the training monitor object, and visualized in the
associated window.
Note that only scalar data can be logged with a monitor logger object.
Define a logging function that logs data periodically at the completion of the learning subroutine.
if mod(data.AgentLearnCount, 2) == 0
dataToLog.ActorLoss = data.ActorLoss;
dataToLog.CriticLoss = data.CriticLoss;
else
dataToLog = [];
end
3-11
3 Objects
end
Limitations
• Only scalar data is supported when logging data with a MonitorLogger object. The structure
returned by the callback functions must contain fields with scalar data.
• Resuming of training from a previous training result is not supported when logging data with a
MonitorLogger object.
Version History
Introduced in R2022b
See Also
Functions
rlDataLogger | train | setup | store | write | cleanup
Objects
FileLogger | trainingProgressMonitor
Topics
“Log Training Data To Disk”
“Monitor Custom Training Loop Progress”
“Function Handles”
“Handle Object Behavior”
3-12
rlACAgent
rlACAgent
Actor-critic reinforcement learning agent
Description
Actor-critic (AC) agents implement actor-critic algorithms such as A2C and A3C, which are model-
free, online, on-policy reinforcement learning methods. The actor-critic agent optimizes the policy
(actor) directly and uses a critic to estimate the return or future rewards. The action space can be
either discrete or continuous.
For more information, see “Actor-Critic Agents”. For more information on the different types of
reinforcement learning agents, see “Reinforcement Learning Agents”.
Creation
Syntax
agent = rlACAgent(observationInfo,actionInfo)
agent = rlACAgent(observationInfo,actionInfo,initOpts)
agent = rlACAgent(actor,critic)
Description
agent = rlACAgent(actor,critic) creates an actor-critic agent with the specified actor and
critic, using the default options for the agent.
3-13
3 Objects
agent = rlACAgent( ___ ,agentOptions) creates an actor-critic agent and sets the
AgentOptions property to the agentOptions input argument. Use this syntax after any of the input
arguments in the previous syntaxes.
Input Arguments
actor — Actor
rlDiscreteCategoricalActor object | rlContinuousGaussianActor object
critic — Critic
rlValueFunction object
Critic that estimates the discounted long-term reward, specified as an rlValueFunction object. For
more information on creating critic approximators, see “Create Policies and Value Functions”.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.
For a discrete action space, you must specify actionInfo as an rlFiniteSetSpec object.
For a continuous action space, you must specify actionInfo as an rlNumericSpec object.
If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.
3-14
rlACAgent
You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.
Option to use exploration policy when selecting actions, specified as a one of the following logical
values.
• true — Use the base agent exploration policy when selecting actions in sim and
generatePolicyFunction. In this case, the agent selects its actions by sampling its probability
distribution, the policy is therefore stochastic and the agent explores its observation space.
• false — Use the base agent greedy policy (the action with maximum likelihood) when selecting
actions in sim and generatePolicyFunction. In this case, the simulated agent and generated
policy behave deterministically.
Note This option affects only simulation and deployment; it does not affect training.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
3-15
3 Objects
Examples
Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to a swinging pole).
env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
Create an actor-critic agent from the environment observation and action specifications.
agent = rlACAgent(obsInfo,actInfo);
To check your agent, use getAction to return the action from random observations.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment. You can also use getActor and
getCritic to extract the actor and critic, respectively, and getModel to extract the approximator
model (by default a deep neural network) from the actor or critic.
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.
3-16
rlACAgent
Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256). Actor-critic agents do not
support recurrent networks, so setting the UseRNN option to true generates an error when the agent
is created.
initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);
The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator.
rng(0)
Create an actor-critic agent from the environment observation and action specifications.
agent = rlACAgent(obsInfo,actInfo,initOpts);
Extract the deep neural networks from both the agent actor and critic.
actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));
Display the layers of the critic network, and verify that each hidden fully connected layer has 128
neurons
criticNet.Layers
ans =
11x1 Layer array with layers:
plot(layerGraph(actorNet))
3-17
3 Objects
plot(layerGraph(criticNet))
3-18
rlACAgent
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment.
Create an environment with a discrete action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DQN Agent to
Balance Cart-Pole System”. This environment has a four-dimensional observation vector (cart position
and velocity, pole angle, and pole angle derivative), and a scalar action with two possible elements (a
force of either -10 or +10 N applied on the cart).
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo =
rlNumericSpec with properties:
3-19
3 Objects
LowerLimit: -Inf
UpperLimit: Inf
Name: "CartPole States"
Description: "x, dx, theta, dtheta"
Dimension: [4 1]
DataType: "double"
actInfo = getActionInfo(env)
actInfo =
rlFiniteSetSpec with properties:
The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator.
rng(0)
For actor-critic agents, the critic estimates a value function, therefore it must take the observation
signal as input and return a scalar value.
To approximate the value function within the critic, use a deep neural network. Define the network as
an array of layer objects. Get the dimensions of the observation space from the environment
specification objects.
cnet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(1)];
Convert the network to a dlnetwork object, and display the number of weights.
cnet = dlnetwork(cnet);
summary(cnet)
Initialized: true
Inputs:
1 'input' 4 features
Create the critic. Actor-critic agents use an rlValueFunction object to implement the critic.
critic = rlValueFunction(cnet,obsInfo);
getValue(critic,{rand(obsInfo.Dimension)})
3-20
rlACAgent
ans = single
-0.1411
Create a deep neural network to be used as approximation model within the actor. For actor-critic
agents, the actor executes a stochastic policy, which for discrete action spaces is implemented by a
discrete categorical actor. In this case the network must take the observation signal as input and
return a probability for each action. Therefore the output layer must have as many elements as the
number of possible actions.
anet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(numel(actInfo.Dimension))];
Convert the network to a dlnetwork object, and display the number of weights.
anet = dlnetwork(anet);
summary(anet)
Initialized: true
Inputs:
1 'input' 4 features
agent =
rlACAgent with properties:
Specify some options for the agent, including training options for the actor and critic.
agent.AgentOptions.NumStepsToLookAhead=32;
agent.AgentOptions.DiscountFactor=0.99;
agent.AgentOptions.CriticOptimizerOptions.LearnRate=8e-3;
3-21
3 Objects
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold=1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate=8e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold=1;
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the double integrator continuous action space environment used
in the example “Train DDPG Agent to Control Double Integrator System”.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env)
obsInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"
actInfo = getActionInfo(env)
actInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"
In this example, the action is a scalar value representing a force ranging from -2 to 2 Newton. To
make sure that the output from the agent is in this range, you perform an appropriate scaling
operation. Store these limits so you can easily access them later.
3-22
rlACAgent
% Make sure action space upper and lower limits are finite
actInfo.LowerLimit=-2;
actInfo.UpperLimit=2;
The actor and critic networks are initialized randomly. You can ensure reproducibility by fixing the
seed of the random generator.
rng(0)
For actor-critic agents, the critic estimates a value function, therefore it must take the observation
signal as input and return a scalar value. To approximate the value function within the critic, use a
deep neural network.
Define the network as an array of layer objects, and get the dimensions of the observation space from
the environment specification object.
cNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(1)];
Convert the network to a dlnetwork object and display the number of weights.
cNet = dlnetwork(cNet);
summary(cNet)
Initialized: true
Inputs:
1 'input' 2 features
Create the critic using cNet. Actor-critic agents use an rlValueFunction object to implement the
critic.
critic = rlValueFunction(cNet,obsInfo);
getValue(critic,{rand(obsInfo.Dimension)})
ans = single
-0.0969
To approximate the policy within the actor, use a deep neural network. For actor-critic agents, the
actor executes a stochastic policy, which for continuous action spaces is implemented by a continuous
Gaussian actor. In this case the network must take the observation signal as input and return both a
mean value and a standard deviation value for each action. Therefore it must have two output layers
(one for the mean values the other for the standard deviation values), each having as many elements
as the dimension of the action space.
Note that standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.
3-23
3 Objects
Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects, and specify a name for the input and output
layers, so you can later explicitly associate them with the appropriate channel.
% Input path
inPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="netObsIn")
fullyConnectedLayer(prod(actInfo.Dimension),Name="infc")
];
% Connect layers
aNet = connectLayers(aNet,"infc","tanhMean/in");
aNet = connectLayers(aNet,"infc","tanhStdv/in");
% Plot network
plot(aNet)
3-24
rlACAgent
Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
aNet = dlnetwork(aNet);
summary(aNet)
Initialized: true
Inputs:
1 'netObsIn' 2 features
Create the actor. Actor-critic agents use an rlContinuousGaussianActor object to implement the
actor for continuous action spaces.
getAction(actor,{rand(obsInfo.Dimension)})
3-25
3 Objects
agent = rlACAgent(actor,critic);
Specify agent options, including training options for its actor and critic.
agent.AgentOptions.NumStepsToLookAhead = 32;
agent.AgentOptions.DiscountFactor=0.99;
agent.AgentOptions.CriticOptimizerOptions.LearnRate=8e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold=1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate=8e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold=1;
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
For this example load the predefined environment used for the “Train DQN Agent to Balance Cart-
Pole System” example. This environment has a four-dimensional observation vector (cart position and
velocity, pole angle, and pole angle derivative), and a scalar action with two possible elements (a
force of either -10 or +10 N applied on the cart).
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
For actor-critic agents, the critic estimates a value function, therefore it must take the observation
signal as input and return a scalar value.
To approximate the value function within the critic, use a recurrent deep neural network. Define the
network as an array of layer objects, and get the dimensions of the observation space from the
environment specification object. To create a recurrent network, use a sequenceInputLayer as the
input layer and include an lstmLayer as one of the other network layers.
cNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
lstmLayer(10)
3-26
rlACAgent
reluLayer
fullyConnectedLayer(1)];
Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
cNet = dlnetwork(cNet);
summary(cNet)
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
Create the critic using cNet. Actor-critic agents use an rlValueFunction object to implement the
critic.
critic = rlValueFunction(cNet,obsInfo);
getValue(critic,{rand(obsInfo.Dimension)})
ans = single
-0.0344
Since the critic has a recurrent network, the actor must also use a recurrent network too. For actor-
critic agents, the actor executes a stochastic policy, which for discrete action spaces is implemented
by a discrete categorical actor. In this case the network must take the observation signal as input and
return a probability for each action. Therefore the output layer must have as many elements as the
number of possible actions.
aNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
lstmLayer(20)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))];
Convert the network to a dlnetwork object and display the number of weights.
aNet = dlnetwork(aNet);
summary(aNet)
Initialized: true
Number of learnables: 2k
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
Create the actor using aNet. Actor-critic agents use an rlDiscreteCategoricalActor object to
implement the actor for discrete action spaces.
actor = rlDiscreteCategoricalActor(aNet,obsInfo,actInfo);
3-27
3 Objects
getAction(actor,{rand(obsInfo.Dimension)})
Specify agent options, and create an AC agent using the actor, the critic, and the agent options
object. Since the agent uses recurrent neural networks, NumStepsToLookAhead is treated as the
training trajectory length.
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
To train an agent using the asynchronous advantage actor-critic (A3C) method, you must set the
agent and parallel training options appropriately.
When creating the AC agent, set the NumStepsToLookAhead value to be greater than 1. Common
values are 64 and 128.
agentOpts = rlACAgentOptions(NumStepsToLookAhead=64);
Use agentOpts when creating your agent. Alternatively, create your agent first and then modify its
options, including the actor and critic options later using dot notation.
trainOpts = rlTrainingOptions(UseParallel=true);
trainOpts.ParallelizationOptions.Mode = "async";
3-28
rlACAgent
Configure the workers to return gradient data to the host. Also, set the number of steps before the
workers send data back to the host to match the number of steps to look ahead.
trainOpts.ParallelizationOptions.DataToSendFromWorkers = ...
"gradients";
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = ...
agentOpts.NumStepsToLookAhead;
For an example on asynchronous advantage actor-critic agent training, see “Train AC Agent to
Balance Cart-Pole System Using Parallel Computing”.
Tips
• For continuous action spaces, the rlACAgent object does not enforce the constraints set by the
action specification, so you must enforce action space constraints within the environment.
Version History
Introduced in R2019a
See Also
rlAgentInitializationOptions | rlACAgentOptions | rlValueFunction |
rlDiscreteCategoricalActor | rlContinuousGaussianActor | Deep Network Designer
Topics
“Actor-Critic Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
3-29
3 Objects
rlACAgentOptions
Options for AC agent
Description
Use an rlACAgentOptions object to specify options for creating actor-critic (AC) agents. To create
an actor-critic agent, use rlACAgent
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
opt = rlACAgentOptions
opt = rlACAgentOptions(Name,Value)
Description
opt = rlACAgentOptions creates a default option set for an AC agent. You can modify the object
properties using dot notation.
Properties
NumStepsToLookAhead — Number of steps ahead
32 (default) | positive integer
Number of steps the agent interacts with the environment before learning from its experience,
specified as a positive integer. When the agent uses a recurrent neural network,
NumStepsToLookAhead is treated as the training trajectory length.
Entropy loss weight, specified as a scalar value between 0 and 1. A higher entropy loss weight value
promotes agent exploration by applying a penalty for being too certain about which action to take.
Doing so can help the agent move out of local optima.
When gradients are computed during training, an additional gradient component is computed for
minimizing this loss function.
3-30
rlACAgentOptions
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.
Object Functions
rlACAgent Actor-critic reinforcement learning agent
Examples
opt = rlACAgentOptions('DiscountFactor',0.95)
opt =
rlACAgentOptions with properties:
NumStepsToLookAhead: 32
EntropyLossWeight: 0
3-31
3 Objects
You can modify options using dot notation. For example, set the agent sample time to 0.5.
opt.SampleTime = 0.5;
Version History
Introduced in R2019a
• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.AgentOptions.UseDeterministicExploitation = true;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.AgentOptions.UseDeterministicExploitation = false;
• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.UseExplorationPolicy = false;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.UseExplorationPolicy = true;
A value of 32 for this property should work better than 1 for most environments. If you nave MATLAB
R2020b or a later version and you want to reproduce how rlACAgent behaved on versions prior to
R2020b, set this value to 1.
3-32
rlACAgentOptions
See Also
Topics
“Actor-Critic Agents”
3-33
3 Objects
rlAdditiveNoisePolicy
Policy object to generate continuous noisy actions for custom training loops
Description
This object implements an additive noise policy, which returns continuous deterministic actions with
added noise, given an input observation. You can create an rlAdditiveNoisePolicy object from an
rlContinuousDeterministicActor or extract it from an rlDDPGAgent or rlTD3Agent. You can
then train the policy object using a custom training loop. If UseNoisyAction is set to 0 the policy
does not explore. This object is not compatible with generatePolicyBlock and
generatePolicyFunction. For more information on policies and value functions, see “Create
Policies and Value Functions”.
Creation
Syntax
policy = rlAdditiveNoisePolicy(actor)
policy = rlAdditiveNoisePolicy(actor,NoiseType=noiseType)
Description
policy = rlAdditiveNoisePolicy(actor) creates the additive noise policy object policy from
the continuous deterministic actor actor. It also sets the Actor property of policy to the input
argument actor.
Properties
Actor — Continuous deterministic actor
rlContinuousDeterministicActor object
Noise type, specified as either "gaussian" (default, Gaussian noise) or "ou" (Ornstein-Uhlenbeck
noise). For more information on noise models, see “Noise Models” on page 3-365.
Example: "ou"
3-34
rlAdditiveNoisePolicy
Option to enable noise decay, specified as a logical value: either true (default, enabling noise decay)
or false (disabling noise decay).
Example: false
Option to enable noisy actions, specified as a logical value: either true (default, adding noise to
actions, which helps exploration) or false (no noise is added to the actions). When noise is not
added to the actions the policy is deterministic and therefore it does not explore.
Example: false
Action specifications, specified as an rlNumericSpec object. This object defines the properties of the
environment action channel, such as its dimensions, data type, and name. Note that the name of the
action channel specified in actionInfo (if any) is not used.
Sample time of the policy, specified as a positive scalar or as -1 (default). Setting this parameter to
-1 allows for event-based simulations.
Within a Simulink environment, the RL Agent block in which the policy is specified executes every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the policy is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience. If
SampleTime is -1, the sample time is treated as being equal to 1.
Example: 0.2
3-35
3 Objects
Object Functions
getAction Obtain action from agent, actor, or policy object given environment
observations
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
reset Reset environment, agent, experience buffer, or policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
Examples
Create observation and action specification objects. For this example, define the observation and
action spaces as continuous four- and two-dimensional spaces, respectively.
Create a continuous deterministic actor. This actor must accept an observation as input and return an
action as output.
To approximate the policy function within the actor, use a deep neural network model. Define the
network as an array of layer objects, and get the dimension of the observation and action spaces from
the environment specification objects.
layers = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(16)
reluLayer
fullyConnectedLayer(actInfo.Dimension(1))
];
Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)
Initialized: true
Inputs:
1 'input' 4 features
Create the actor using model, and the observation and action specifications.
actor = rlContinuousDeterministicActor(model,obsInfo,actInfo)
actor =
rlContinuousDeterministicActor with properties:
3-36
rlAdditiveNoisePolicy
act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}
0.4013
0.0578
policy = rlAdditiveNoisePolicy(actor)
policy =
rlAdditiveNoisePolicy with properties:
You can access the policy options using dot notation. For example, change the upper and lower limits
of the distribution.
policy.NoiseOptions.LowerLimit = -3;
policy.NoiseOptions.UpperLimit = 3;
act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}
ans = 2×1
0.1878
-0.1645
You can now train the policy with a custom training loop and then deploy it to your application.
Create observation and action specification objects. For this example, define the observation and
action spaces as continuous three- and one-dimensional spaces, respectively.
3-37
3 Objects
Create a continuous deterministic actor. This actor must accept an observation as input and return an
action as output.
To approximate the policy function within the actor, use a deep neural network model. Define the
network as an array of layer objects, and get the dimension of the observation and action spaces from
the environment specification objects.
layers = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(9)
reluLayer
fullyConnectedLayer(actInfo.Dimension(1))
];
Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)
Initialized: true
Number of learnables: 46
Inputs:
1 'input' 3 features
Create the actor using model, and the observation and action specifications.
actor = rlContinuousDeterministicActor(model,obsInfo,actInfo)
actor =
rlContinuousDeterministicActor with properties:
ans = single
-0.2535
Create a policy object from actor, specifying an Ornstein-Uhlenbeck probability distribution for the
noise.
policy = rlAdditiveNoisePolicy(actor,NoiseType="ou")
policy =
rlAdditiveNoisePolicy with properties:
3-38
rlAdditiveNoisePolicy
You can access the policy options using dot notation. For example, change the standard deviation of
the distribution.
policy.NoiseOptions.StandardDeviation = 0.6;
act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}
ans = -0.1625
You can now train the policy with a custom training loop and then deploy it to your application.
Version History
Introduced in R2022a
See Also
Functions
rlMaxQPolicy | rlEpsilonGreedyPolicy | rlDeterministicActorPolicy |
rlStochasticActorPolicy | rlTD3Agent | rlDDPGAgent
Blocks
RL Agent
Topics
“Create Policies and Value Functions”
“Model-Based Reinforcement Learning Using Custom Training Loop”
“Train Reinforcement Learning Policy Using Custom Training Loop”
3-39
3 Objects
rlAgentInitializationOptions
Options for initializing reinforcement learning agents
Description
Use the rlAgentInitializationOptions object to specify initialization options for an agent. To
create an agent, use the specific agent creation function, such as rlACAgent.
Creation
Syntax
initOpts = rlAgentInitializationOptions
initOpts = rlAgentInitializationOptions(Name,Value)
Description
Properties
NumHiddenUnit — Number of units in each hidden fully connected layer
256 (default) | positive integer
Number of units in each hidden fully connected layer of the agent networks, except for the fully
connected layer just before the network output, specified as a positive integer. The value you set also
applies to any LSTM layers.
Example: 'NumHiddenUnit',64
If you set UseRNN to true, during agent creation the software inserts a recurrent LSTM layer with
the output mode set to sequence in the output path of the agent networks. Policy gradient and actor-
critic agents do not support recurrent neural networks. For more information on LSTM, see “Long
Short-Term Memory Networks”.
Example: 'UseRNN',true
3-40
rlAgentInitializationOptions
Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
rlDQNAgent Deep Q-network (DQN) reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning agent
rlSACAgent Soft actor-critic reinforcement learning agent
rlTRPOAgent Trust region policy optimization reinforcement learning agent
Examples
Create an agent initialization options object, specifying the number of hidden neurons and use of a
recurrent neural network.
initOpts = rlAgentInitializationOptions('NumHiddenUnit',64,'UseRNN',true)
initOpts =
rlAgentInitializationOptions with properties:
NumHiddenUnit: 64
UseRNN: 1
You can modify the options using dot notation. For example, set the agent sample time to 0.5.
initOpts.NumHiddenUnit = 128
initOpts =
rlAgentInitializationOptions with properties:
NumHiddenUnit: 128
UseRNN: 1
Version History
Introduced in R2020b
See Also
getActionInfo | getObservationInfo
Topics
“Reinforcement Learning Agents”
3-41
3 Objects
rlContinuousDeterministicActor
Deterministic actor with a continuous action space for reinforcement learning agents
Description
This object implements a function approximator to be used as a deterministic actor within a
reinforcement learning agent with a continuous action space. A continuous deterministic actor takes
an environment state as input and returns as output the action that maximizes the expected
discounted cumulative long-term reward, thereby implementing a deterministic policy. After you
create an rlContinuousDeterministicActor object, use it to create a suitable agent, such as
rlDDPGAgent. For more information on creating representations, see “Create Policies and Value
Functions”.
Creation
Syntax
actor = rlContinuousDeterministicActor(net,observationInfo,actionInfo)
actor = rlContinuousDeterministicActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames)
actor = rlContinuousDeterministicActor({basisFcn,W0},observationInfo,
actionInfo)
Description
Note actor does not enforce constraints set by the action specification; therefore, when using this
actor, you must enforce action space constraints within the environment.
actor = rlContinuousDeterministicActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames) specifies the names of the network input
layers to be associated with the environment observation channels. The function assigns, in
sequential order, each environment observation channel specified in observationInfo to the layer
specified by the corresponding name in the string array netObsNames. Therefore, the network input
3-42
rlContinuousDeterministicActor
layers, ordered as the names in netObsNames, must have the same data type and dimensions as the
observation specifications, as ordered in observationInfo.
actor = rlContinuousDeterministicActor({basisFcn,W0},observationInfo,
actionInfo) creates a continuous deterministic actor object using a custom basis function as
underlying approximator. The first input argument is a two-element cell array whose first element is
the handle basisFcn to a custom basis function and whose second element is the initial weight
vector W0. This function sets the ObservationInfo and ActionInfo properties of actor to the
observationInfo and actionInfo input arguments, respectively.
Input Arguments
Deep neural network used as the underlying approximator within the actor, specified as one of the
following:
Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
Deep Learning Toolbox neural network object. The resulting dlnet is the dlnetwork object that you
use for your critic or actor. This practice allows a greater level of insight and control for cases in
which the conversion is not straightforward and might require additional specifications.
The network must have the environment observation channels as inputs and a single output layer
representing the action.
The learnable parameters of the actor are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.
3-43
3 Objects
Network input layers names corresponding to the environment observation channels. When you use
the pair value arguments 'ObservationInputNames' with netObsNames, the function assigns, in
sequential order, each environment observation channel specified in observationInfo to each
network input layer specified by the corresponding name in the string array netObsNames.
Therefore, the network input layers, ordered as the names in netObsNames, must have the same
data type and dimensions as the observation specifications, as ordered in observationInfo.
Note Of the information specified in observationInfo, the function only uses the data type and
dimension of each channel, but not its (optional) name or description.
Example: {"NetInput1_airspeed","NetInput2_altitude"}
Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The action
to be taken based on the current observation, which is the output of the actor, is the vector a =
W'*B, where W is a weight matrix containing the learnable parameters and B is the column vector
returned by the custom basis function.
B = myBasisFunction(obs1,obs2,...,obsN)
Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the
environment observation channels defined in observationInfo.
Example: @(obs1,obs2,obs3) [obs3(2)*obs1(1)^2; abs(obs2(5)+obs3(1))]
Initial value of the basis function weights W, specified as a matrix having as many rows as the length
of the vector returned by the basis function and as many columns as the dimension of the action
space.
Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array
3-44
rlContinuousDeterministicActor
Action specifications for a continuous action space, specified as an rlNumericSpec object defining
properties such as dimensions, data type and name of the action signals.
You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually.
For custom basis function representations, the action signal must be a scalar, a column vector, or a
discrete action.
Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox™ software and a CUDA® enabled
NVIDIA® GPU. For more information on supported GPUs see “GPU Computing Requirements”
(Parallel Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.
To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: "gpu"
Object Functions
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning
agent
getAction Obtain action from agent, actor, or policy object given environment
observations
evaluate Evaluate function approximator object given observation (or observation-
action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data
accelerate Option to accelerate computation of gradient for approximator object
based on neural network
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
3-45
3 Objects
Examples
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.
To approximate the policy within the actor, use a deep neural network. The input of the network must
accept a four-element vector (the observation vector just defined by obsInfo), and its output must
be the action and be a two-element vector, as defined by actInfo.
net = [featureInputLayer(4)
fullyConnectedLayer(2)];
Convert the network to a dlnetwork object and display the number of learnable parameters.
net = dlnetwork(net);
summary(net)
Initialized: true
Number of learnables: 10
Inputs:
1 'input' 4 features
Create the actor object with rlContinuousDeterministicActor, using the network and the
observation and action specification objects as input arguments. The network input layer is
automatically associated with the environment observation channel according to the dimension
specifications in obsInfo.
actor =
rlContinuousDeterministicActor with properties:
3-46
rlContinuousDeterministicActor
To check your actor, use getAction to return the action from a random observation, using the
current network weights.
act = getAction(actor, ...
{rand(obsInfo.Dimension)});
act{1}
-0.5054
1.5390
You can now use the actor to create a suitable agent (such as rlDDPGAgent or
rlTD3AgentOptions).
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
obsInfo = rlNumericSpec([4 1]);
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.
actInfo = rlNumericSpec([2 1]);
To approximate the policy within the actor, use a deep neural network. The input of the network (here
called myobs) must accept a four-element vector (the observation vector just defined by obsInfo),
and its output must be the action (here called myact) and be a two-element vector, as defined by
actInfo.
Create the network as an array of layer objects. Name the network input layer netObsIn so you can
later explicitly associate it to the observation input channel.
net = [
featureInputLayer(4,Name="netObsIn")
fullyConnectedLayer(16)
reluLayer
fullyConnectedLayer(2)];
Convert the network to a dlnetwork object, and display the number of learnable parameters.
net = dlnetwork(net);
summary(net)
3-47
3 Objects
Initialized: true
Inputs:
1 'netObsIn' 4 features
Create the actor object with rlContinuousDeterministicActor, using the network, the
observation and action specification objects, and the name of the network input layer to be associated
with the environment observation channel.
actor =
rlContinuousDeterministicActor with properties:
To check your actor, use getAction to return the action from a random observation, using the
current network weights.
act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}
0.4013
0.0578
You can now use the actor to create a suitable agent (such as rlDDPGAgent or
rlTD3AgentOptions).
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as
consisting of two channels, the first is a two-by-two continuous matrix and the second is a scalar that
can assume only two values, 0 and 1.
Create a continuous action space specification object (or alternatively use getActionInfo to extract
the specification object from an environment). For this example, define the action space as a
continuous three-dimensional space, so that a single action is a column vector containing three
doubles.
3-48
rlContinuousDeterministicActor
Create a custom basis function with two input arguments in which each output element is a function
of the observations defined by obsInfo.
The output of the actor is the vector W'*myBasisFcn(obsA,obsB), which is the action taken as a
result of the given observation. The weight matrix W contains the learnable parameters and must have
as many rows as the length of the basis function output and as many columns as the dimension of the
action space.
W0 = rand(4,3);
Create the actor. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight matrix. The second and third arguments are, respectively, the
observation and action specification objects.
actor = rlContinuousDeterministicActor({myBasisFcn,W0},obsInfo,actInfo)
actor =
rlContinuousDeterministicActor with properties:
To check your actor, use the getAction function to return the action from a given observation, using
the current parameter matrix.
a = getAction(actor,{rand(2,2),0})
a{1}
ans = 3×1
1.3192
0.8420
1.5053
Note that the actor does not enforce the set constraint for the discrete set elements.
a = getAction(actor,{rand(2,2),-1});
a{1}
ans = 3×1
2.7890
1.8375
3-49
3 Objects
3.0855
You can now use the actor to create a suitable agent (such as rlDDPGAgent or
rlTD3AgentOptions).
Create observation and action information. You can also obtain these specifications from an
environment. For this example, define the observation space as a continuous four-dimensional space,
so that a single observation is a column vector containing four doubles, and the action space as a
continuous two-dimensional space, so that a single action is a column vector containing two doubles.
To approximate the policy within the actor, use a recurrent deep neural network. You can obtain the
dimension of the observation and action spaces from the environment specification objects.
Create a neural network as an array of layer objects. Since this network is recurrent, use a
sequenceInputLayer as the input layer and at least one lstmLayer.
net = [sequenceInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(10)
reluLayer
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer(20)
fullyConnectedLayer(actInfo.Dimension(1))
tanhLayer];
Convert the network to a dlnetwork object and display the number of learnable parameters.
net = dlnetwork(net);
summary(net)
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
To check your actor, use getAction to return the action from a random observation, given the
current network weights.
a = getAction(actor, ...
{rand(obsInfo.Dimension)});
a{1}
3-50
rlContinuousDeterministicActor
-0.0742
0.0158
You can now use the actor to create a suitable agent (such as rlDDPGAgent or
rlTD3AgentOptions).
Version History
Introduced in R2022a
See Also
Functions
rlDiscreteCategoricalActor | rlContinuousGaussianActor | getActionInfo |
getObservationInfo
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-51
3 Objects
rlContinuousDeterministicRewardFunction
Deterministic reward function approximator object for neural network-based environment
Description
When creating a neural network-based environment using rlNeuralNetworkEnvironment, you can
specify the reward function approximator using an
rlContinuousDeterministicRewardFunction object. Do so when you do not know a ground-
truth reward signal for your environment but you expect the reward signal to be deterministic.
The reward function approximator object uses a deep neural network as internal approximation
model to predict the reward signal for the environment given one of the following input combinations.
Creation
Syntax
rwdFcnAppx = rlContinuousDeterministicRewardFunction(net,observationInfo,
actionInfo,Name=Value)
Description
rwdFcnAppx = rlContinuousDeterministicRewardFunction(net,observationInfo,
actionInfo,Name=Value) creates the deterministic reward function approximator object
rwdFcnAppx using the deep neural network net and sets the ObservationInfo and ActionInfo
properties.
When creating a reward function you must specify the names of the deep neural network inputs using
one of the following combinations of name-value pair arguments.
You can also specify the UseDevice property using and an optional name-value pair argument. For
example, to use a GPU for prediction, specify UseDevice="gpu".
3-52
rlContinuousDeterministicRewardFunction
Input Arguments
Deep neural network with a scalar output value, specified as a dlnetwork object.
The input layer names for this network must match the input names specified using the
ObservationInputNames, ActionInputNames, and NextObservationInputNames. The
dimensions of the input layers must match the dimensions of the corresponding observation and
action specifications in ObservationInfo and ActionInfo, respectively.
The number of observation input names must match the length of ObservationInfo and the order
of the names must match the order of the specifications in ObservationInfo.
Action input layer names, specified as a string or string array. Specify ActionInputNames when you
expect the reward signal to depend on the current action value.
The number of action input names must match the length of ActionInfo and the order of the names
must match the order of the specifications in ActionInfo.
Next observation input layer names, specified as a string or string array. Specify
NextObservationInputNames when you expect the reward signal to depend on the next
environment observation.
The number of next observation input names must match the length of ObservationInfo and the
order of the names must match the order of the specifications in ObservationInfo.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
3-53
3 Objects
You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.
You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.
Computation device used to perform operations such as gradient computation, parameter updates,
and prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA-enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Training or simulating a network on a GPU involves device-specific numerical round-off errors. These
errors can produce different results compared to performing the same operations using a CPU.
Object Functions
rlNeuralNetworkEnvironment Environment model with deep neural network transition models
Examples
Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
3-54
rlContinuousDeterministicRewardFunction
To approximate the reward function, create a deep neural network. For this example, the network has
two input channels, one for the current action and one for the next observations. The single output
channel contains a scalar, which represents the value of the predicted reward.
Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specifications, and specify a name for the input layers, so you can
later explicitly associate them with the appropriate environment channel.
commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64,Name="FC2")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(64,Name="FC3")
reluLayer(Name="CriticCommonRelu3")
fullyConnectedLayer(1,Name="reward")];
net = layerGraph(nextStatePath);
net = addLayers(net,actionPath);
net = addLayers(net,commonPath);
net = connectLayers(net,"nextState","concat/in1");
net = connectLayers(net,"action","concat/in2");
plot(net)
3-55
3 Objects
Initialized: true
Inputs:
1 'nextState' 4 features
2 'action' 1 features
Using this reward function object, you can predict the next reward value based on the current action
and next observation. For example, predict the reward for a random action and next observation.
Since, for this example, only the action and the next observation influence the reward, use an empty
cell array for the current observation.
act = rand(actInfo.Dimension);
nxtobs = rand(obsInfo.Dimension);
reward = predict(rwdFcnAppx,{}, {act}, {nxtobs})
3-56
rlContinuousDeterministicRewardFunction
reward = single
0.1034
Version History
Introduced in R2022a
See Also
Objects
rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction | rlContinuousGaussianRewardFunction |
rlNeuralNetworkEnvironment | rlIsDoneFunction | evaluate | gradient | accelerate
Topics
“Model-Based Policy Optimization Agents”
3-57
3 Objects
rlContinuousDeterministicTransitionFunction
Deterministic transition function approximator object for neural network-based environment
Description
When creating a neural network-based environment using rlNeuralNetworkEnvironment, you can
specify deterministic transition function approximators using
rlContinuousDeterministicTransitionFunction objects.
A transition function approximator object uses a deep neural network to predict the next observations
based on the current observations and actions.
Creation
Syntax
tsnFcnAppx = rlContinuousDeterministicTransitionFunction(net,observationInfo,
actionInfo,Name=Value)
Description
tsnFcnAppx = rlContinuousDeterministicTransitionFunction(net,observationInfo,
actionInfo,Name=Value) creates a deterministic transition function approximator object using
the deep neural network net and sets the ObservationInfo and ActionInfo properties.
When creating a deterministic transition function approximator you must specify the names of the
deep neural network inputs and outputs using the ObservationInputNames, ActionInputNames,
and NextObservationOutputNames name-value pair arguments.
You can also specify the PredictDiff and UseDevice properties using optional name-value pair
arguments. For example, to use a GPU for prediction, specify UseDevice="gpu".
Input Arguments
The input layer names for this network must match the input names specified using
ObservationInputNames and ActionInputNames. The dimensions of the input layers must match
the dimensions of the corresponding observation and action specifications in ObservationInfo and
ActionInfo, respectively.
3-58
rlContinuousDeterministicTransitionFunction
The output layer names for this network must match the output names specified using
NextObservationOutputNames. The dimensions of the input layers must match the dimensions of
the corresponding observation specifications in ObservationInfo.
The number of observation input names must match the length of ObservationInfo and the order
of the names must match the order of the specifications in ObservationInfo.
The number of action input names must match the length of ActionInfo and the order of the names
must match the order of the specifications in ActionInfo.
The number of next observation output names must match the length of ObservationInfo and the
order of the names must match the order of the specifications in ObservationInfo.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.
3-59
3 Objects
You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.
PredictDiff — Option to predict the difference between the current observation and the
next observation
false (default) | true
Option to predict the difference between the current observation and the next observation, specified
as one of the following logical values.
• false — Select this option if net outputs the value of the next observation.
• true — Select this option if net outputs the difference between the next observation and the
current observation. In this case, the predict function computes the next observation by adding
the current observation to the output of net.
Computation device used to perform operations such as gradient computation, parameter updates,
and prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA-enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating a network on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations using a CPU.
Object Functions
rlNeuralNetworkEnvironment Environment model with deep neural network transition models
Examples
Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a deep neural network. The network has two input channels, one for the current observations
and one for the current actions. The single output channel is for the predicted next observation.
3-60
rlContinuousDeterministicTransitionFunction
statePath = featureInputLayer(obsInfo.Dimension(1),...
Normalization="none",Name="state");
actionPath = featureInputLayer(actInfo.Dimension(1),...
Normalization="none",Name="action");
commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64, Name="FC3")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(obsInfo.Dimension(1),Name="nextObservation")];
tsnNet = layerGraph(statePath);
tsnNet = addLayers(tsnNet,actionPath);
tsnNet = addLayers(tsnNet,commonPath);
tsnNet = connectLayers(tsnNet,"state","concat/in1");
tsnNet = connectLayers(tsnNet,"action","concat/in2");
plot(tsnNet)
3-61
3 Objects
ObservationInputNames="state", ...
ActionInputNames="action", ...
NextObservationOutputNames="nextObservation");
Using this transition function object, you can predict the next observation based on the current
observation and action. For example, predict the next observation for a random observation and
action.
obs = rand(obsInfo.Dimension);
act = rand(actInfo.Dimension);
nextObsP = predict(tsnFcnAppx,{obs},{act})
nextObsP{1}
-0.1172
0.1168
0.0493
-0.0155
nextObsE = evaluate(tsnFcnAppx,{obs,act})
nextObsE{1}
-0.1172
0.1168
0.0493
-0.0155
Version History
Introduced in R2022a
See Also
Objects
rlNeuralNetworkEnvironment | rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction | evaluate | gradient | accelerate
3-62
rlContinuousDeterministicTransitionFunction
Topics
“Model-Based Policy Optimization Agents”
3-63
3 Objects
rlContinuousGaussianActor
Stochastic Gaussian actor with a continuous action space for reinforcement learning agents
Description
This object implements a function approximator to be used as a stochastic actor within a
reinforcement learning agent with a continuous action space. A continuous Gaussian actor takes an
environment state as input and returns as output a random action sampled from a Gaussian
probability distribution of the expected cumulative long term reward, thereby implementing a
stochastic policy. After you create an rlContinuousGaussianActor object, use it to create a
suitable agent, such as an rlACAgent or rlPGAgent agent. For more information on creating
representations, see “Create Policies and Value Functions”.
Creation
Syntax
actor = rlContinuousGaussianActor(net,observationInfo,
actionInfo,ActionMeanOutputNames=
netMeanActName,ActionStandardDeviationOutputNames=netStdvActName)
actor = rlContinuousGaussianActor(net,observationInfo,
actionInfo,ActionMeanOutputNames=
netMeanActName,ActionStandardDeviationOutputNames=
netStdActName,ObservationInputNames=netObsNames)
Description
actor = rlContinuousGaussianActor(net,observationInfo,
actionInfo,ActionMeanOutputNames=
netMeanActName,ActionStandardDeviationOutputNames=netStdvActName) creates a
Gaussian stochastic actor with a continuous action space using the deep neural network net as
function approximator. Here, net must have two differently named output layers, each with as many
elements as the number of dimensions of the action space, as specified in actionInfo. The two
output layers calculate the mean and standard deviation of each component of the action. The actor
uses these layers, according to the names specified in the strings netMeanActName and
netStdActName, to represent the Gaussian probability distribution from which the action is
sampled. The function sets the ObservationInfo and ActionInfo properties of actor to the input
arguments observationInfo and actionInfo, respectively.
Note actor does not enforce constraints set by the action specification, therefore, when using this
actor, you must enforce action space constraints within the environment.
actor = rlContinuousGaussianActor(net,observationInfo,
actionInfo,ActionMeanOutputNames=
3-64
rlContinuousGaussianActor
netMeanActName,ActionStandardDeviationOutputNames=
netStdActName,ObservationInputNames=netObsNames) specifies the names of the network
input layers to be associated with the environment observation channels. The function assigns, in
sequential order, each environment observation channel specified in observationInfo to the layer
specified by the corresponding name in the string array netObsNames. Therefore, the network input
layers, ordered as the names in netObsNames, must have the same data type and dimensions as the
observation specifications, as ordered in observationInfo.
Input Arguments
Deep neural network used as the underlying approximator within the actor. The network must have
two differently named output layers each with as many elements as the number of dimensions of the
action space, as specified in actionInfo. The two output layers calculate the mean and standard
deviation of each component of the action. The actor uses these layers, according to the names
specified in the strings netMeanActName and netStdActName, to represent the Gaussian
probability distribution from which the action is sampled.
Note Standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore, the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, and the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.
Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
neural network object from the Deep Learning Toolbox. The resulting dlnet is the dlnetwork object
that you use for your critic or actor. This practice allows a greater level of insight and control for
cases in which the conversion is not straightforward and might require additional specifications.
3-65
3 Objects
The learnable parameters of the actor are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.
netMeanActName — Names of the network output layers corresponding to the mean values
of the action channel
string | character vector
Names of the network output layers corresponding to the mean values of the action channel,
specified as a string or character vector. The actor uses this name to select the network output layer
that returns the mean values of each elements of the action channel. Therefore, this network output
layer must be named as indicated in netMeanActName. Furthermore, it must be a scaling layer that
scales the returned mean values to the desired action range.
Note Of the information specified in actionInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.
Example: "myNetOut_Force_Mean_Values"
Names of the network output layers corresponding to the standard deviations of the action channel,
specified as a string or character vector. The actor uses this name to select the network output layer
that returns the standard deviations of each elements of the action channel. Therefore, this network
output layer must be named as indicated in netStdvActName. Furthermore, it must be a softplus or
ReLU layer, to enforce nonnegativity of the returned standard deviations.
Note Of the information specified in actionInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.
Example: "myNetOut_Force_Standard_Deviations"
Network input layers names corresponding to the environment observation channels, specified as a
string array or a cell array of character vectors. When you use this argument after
'ObservationInputNames', the function assigns, in sequential order, each environment
observation channel specified in observationInfo to each network input layer specified by the
corresponding name in the string array netObsNames. Therefore, the network input layers, ordered
as the names in netObsNames, must have the same data type and dimensions as the observation
specifications, as ordered in observationInfo.
Note Of the information specified in observationInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.
3-66
rlContinuousGaussianActor
Example: {"NetInput1_airspeed","NetInput2_altitude"}
Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array
Action specifications, specified as an rlNumericSpec object. This object defines the properties of the
environment action channel, such as its dimensions, data type, and name. Note that the function does
not use the name of the action channel specified in actionInfo.
You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specifications manually.
Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.
To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
3-67
3 Objects
Example: 'UseDevice',"gpu"
Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
rlSACAgent Soft actor-critic reinforcement learning agent
getAction Obtain action from agent, actor, or policy object given environment
observations
evaluate Evaluate function approximator object given observation (or observation-
action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data
accelerate Option to accelerate computation of gradient for approximator object
based on neural network
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
setModel Set function approximation model for actor or critic
getModel Get function approximator model from actor or critic
Examples
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous six-dimensional space, so that a single observation is a column vector containing five
doubles.
obsInfo = rlNumericSpec([5 1]);
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
three-dimensional space, so that a single action is a column vector containing three doubles, each
between -10 and 10.
actInfo = rlNumericSpec([3 1], ...
LowerLimit=-10, ...
UpperLimit=10);
To approximate the policy within the actor, use a deep neural network.
For a continuous Gaussian actor, the network must take the observation signal as input and return
both a mean value and a standard deviation value for each action. Therefore it must have two output
layers (one for the mean values the other for the standard deviation values), each having as many
elements as the dimension of the action space. You can obtain the dimensions of the observation and
action spaces from the environment specification objects (for example regardless of whether the
observation space is a column vector, row vector, or matrix, prod(obsInfo.Dimension)).
Note that standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU
3-68
rlContinuousGaussianActor
layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.
Create each network path as an array of layer objects. Specify a name for the input and output layers,
so you can later explicitly associate them with the correct channels.
% Connect layers
net = connectLayers(net,"infc","tanhMean/in");
net = connectLayers(net,"infc","tanhStdv/in");
3-69
3 Objects
Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
net = dlnetwork(net);
summary(net)
Initialized: true
Number of learnables: 42
Inputs:
1 'netOin' 5 features
Create the actor with rlContinuousGaussianActor, using the network, the observation and action
specification objects, and the names of the network input and output layers.
actor = rlContinuousGaussianActor(net, obsInfo, actInfo, ...
ActionMeanOutputNames="scale",...
ActionStandardDeviationOutputNames="splus",...
ObservationInputNames="netOin");
To check your actor, use getAction to return an action from a random observation vector, using the
current network weights. Each of the three elements of the action vector is a random sample from the
Gaussian distribution with mean and standard deviation calculated, as a function of the current
observation, by the neural network.
act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}
3-70
rlContinuousGaussianActor
-12.0285
1.7628
10.8733
To return the Gaussian distribution of the action, given an observation, use evaluate.
dist = evaluate(actor,{rand(obsInfo.Dimension)});
dist{1}
-5.6127
3.9449
9.6213
dist{2}
0.8516
0.8366
0.7004
You can now use the actor to create a suitable agent (such as rlACAgent, rlPGAgent,
rlSACAgentOptions, rlPPOAgent, or rlTRPOAgentOptions).
Version History
Introduced in R2022a
See Also
Functions
rlContinuousDeterministicActor | rlDiscreteCategoricalActor | getActionInfo |
getObservationInfo
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-71
3 Objects
rlContinuousGaussianRewardFunction
Stochastic Gaussian reward function approximator object for neural network-based environment
Description
When creating a neural network-based environment using rlNeuralNetworkEnvironment, you can
specify the reward function approximator using an
rlContinuousDeterministicRewardFunction object. Do so when you do not know a ground-
truth reward signal for your environment and you expect the reward signal to be stochastic.
The reward function object uses a deep neural network as internal approximation model to predict
the reward signal for the environment given one of the following input combinations.
Creation
Syntax
rwdFcnAppx = rlContinuousGaussianRewardFunction(net,observationInfo,
actionInfo,Name=Value)
Description
rwdFcnAppx = rlContinuousGaussianRewardFunction(net,observationInfo,
actionInfo,Name=Value) creates a stochastic reward function using the deep neural network net
and sets the ObservationInfo and ActionInfo properties.
When creating a reward function you must specify the names of the deep neural network inputs using
one of the following combinations of name-value pair arguments.
You must also specify the names of the deep neural network outputs using the
RewardMeanOutputName and RewardStandardDeviationOutputName name-value pair
arguments.
3-72
rlContinuousGaussianRewardFunction
You can also specify the UseDevice property using an optional name-value pair argument. For
example, to use a GPU for prediction, specify UseDevice="gpu".
Input Arguments
Deep neural network with a scalar output value, specified as a dlnetwork object.
The input layer names for this network must match the input names specified using the
ObservationInputNames, ActionInputNames, and NextObservationInputNames. The
dimensions of the input layers must match the dimensions of the corresponding observation and
action specifications in ObservationInfo and ActionInfo, respectively.
The number of observation input names must match the length of ObservationInfo and the order
of the names must match the order of the specifications in ObservationInfo.
Action input layer names, specified as a string or string array. Specify ActionInputNames when you
expect the reward signal to depend on the current action value.
The number of action input names must match the length of ActionInfo and the order of the names
must match the order of the specifications in ActionInfo.
Next observation input layer names, specified as a string or string array. Specify
NextObservationInputNames when you expect the reward signal to depend on the next
environment observation.
The number of next observation input names must match the length of ObservationInfo and the
order of the names must match the order of the specifications in ObservationInfo.
3-73
3 Objects
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.
You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.
Computation device used to perform operations such as gradient computation, parameter updates,
and prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA-enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating a network on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations using a CPU.
Object Functions
rlNeuralNetworkEnvironment Environment model with deep neural network transition models
3-74
rlContinuousGaussianRewardFunction
Examples
Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a deep neural network. The network has two input channels, one for the current action and
one for the next observations. The single output channel is for the predicted reward value.
statePath = featureInputLayer(obsInfo.Dimension(1),Name="obs");
actionPath = featureInputLayer(actInfo.Dimension(1),Name="action");
nextStatePath = featureInputLayer(obsInfo.Dimension(1),Name="nextObs");
commonPath = [concatenationLayer(1,3,Name="concat")
fullyConnectedLayer(32,Name="fc")
reluLayer(Name="relu1")
fullyConnectedLayer(32,Name="fc2")];
meanPath = [reluLayer(Name="rewardMeanRelu")
fullyConnectedLayer(1,Name="rewardMean")];
stdPath = [reluLayer(Name="rewardStdRelu")
fullyConnectedLayer(1,Name="rewardStdFc")
softplusLayer(Name="rewardStd")];
rwdNet = layerGraph(statePath);
rwdNet = addLayers(rwdNet,actionPath);
rwdNet = addLayers(rwdNet,nextStatePath);
rwdNet = addLayers(rwdNet,commonPath);
rwdNet = addLayers(rwdNet,meanPath);
rwdNet = addLayers(rwdNet,stdPath);
rwdNet = connectLayers(rwdNet,"nextObs","concat/in1");
rwdNet = connectLayers(rwdNet,"action","concat/in2");
rwdNet = connectLayers(rwdNet,"obs",'concat/in3');
rwdNet = connectLayers(rwdNet,"fc2","rewardMeanRelu");
rwdNet = connectLayers(rwdNet,"fc2","rewardStdRelu");
plot(rwdNet)
3-75
3 Objects
rwdNet = dlnetwork(rwdNet);
rwdFncAppx = rlContinuousGaussianRewardFunction(...
rwdNet,obsInfo,actInfo,...
ObservationInputNames="obs",...
ActionInputNames="action", ...
NextObservationInputNames="nextObs", ...
RewardMeanOutputNames="rewardMean", ...
RewardStandardDeviationOutputNames="rewardStd");
Using this reward function object, you can predict the next reward value based on the current action
and next observation. For example, predict the reward for a random action and next observation. The
reward value is sampled from a Gaussian distribution with the mean and standard deviation output by
the reward network.
obs = rand(obsInfo.Dimension);
act = rand(actInfo.Dimension);
nextObs = rand(obsInfo.Dimension(1),1);
predRwd = predict(rwdFncAppx,{obs},{act},{nextObs})
predRwd = single
-0.1308
3-76
rlContinuousGaussianRewardFunction
You can obtain the mean value and standard deviation of the Gaussian distribution for the predicted
reward using evaluate.
predRwdDist = evaluate(rwdFncAppx,{obs,act,nextObs})
Version History
Introduced in R2022a
See Also
Objects
rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlNeuralNetworkEnvironment |
rlIsDoneFunction | evaluate | gradient | accelerate
Topics
“Model-Based Policy Optimization Agents”
3-77
3 Objects
rlContinuousGaussianTransitionFunction
Stochastic Gaussian transition function approximator object for neural network-based environment
Description
When creating a neural network-based environment using rlNeuralNetworkEnvironment, you can
specify stochastic transition function approximators using
rlContinuousDeterministicTransitionFunction objects.
A transition function approximator object uses a deep neural network as internal approximation
model to predict the next observations based on the current observations and actions.
Creation
Syntax
tsnFcnAppx = rlContinuousGaussianTransitionFunction(net,observationInfo,
actionInfo,Name=Value)
Description
tsnFcnAppx = rlContinuousGaussianTransitionFunction(net,observationInfo,
actionInfo,Name=Value) creates the stochastic transition function approximator object
tsnFcnAppx using the deep neural network net and sets the ObservationInfo and ActionInfo
properties.
When creating a stochastic transition function approximator, you must specify the names of the deep
neural network inputs and outputs using the ObservationInputNames, ActionInputNames,
NextObservationMeanOutputNames, and NextObservationStandardDeviationOutputNames
name-value pair arguments.
You can also specify the PredictDiff and UseDevice properties using optional name-value pair
arguments. For example, to use a GPU for prediction, specify UseDevice="gpu".
Input Arguments
The input layer names for this network must match the input names specified using
ObservationInputNames and ActionInputNames. The dimensions of the input layers must match
the dimensions of the corresponding observation and action specifications in ObservationInfo and
ActionInfo, respectively.
3-78
rlContinuousGaussianTransitionFunction
The output layer names for this network must match the output names specified using
NextObservationOutputNames. The dimensions of the input layers must match the dimensions of
the corresponding observation specifications in ObservationInfo.
Name-Value Pair Arguments
The number of observation input names must match the length of ObservationInfo and the order
of the names must match the order of the specifications in ObservationInfo.
The number of action input names must match the length of ActionInfo and the order of the names
must match the order of the specifications in ActionInfo.
Next observation mean output layer names, specified as a string or string array.
The number of next observation mean output names must match the length of ObservationInfo
and the order of the names must match the order of the specifications in ObservationInfo.
Next observation standard deviation output layer names, specified as a string or string array.
The number of next observation standard deviation output names must match the length of
ObservationInfo and the order of the names must match the order of the specifications in
ObservationInfo.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
3-79
3 Objects
You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.
You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.
Computation device used to perform operations such as gradient computation, parameter updates,
and prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA-enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating a network on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations using a CPU.
PredictDiff — Option to predict the difference between the current observation and the
next observation
false (default) | true
Option to predict the difference between the current observation and the next observation, specified
as one of the following logical values.
• false — Select this option if net outputs the value of the next observation.
• true — Select this option if net outputs the difference between the next observation and the
current observation. In this case, the predict function computes the next observation by adding
the current observation to the output of net.
Object Functions
rlNeuralNetworkEnvironment Environment model with deep neural network transition models
Examples
3-80
rlContinuousGaussianTransitionFunction
Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Define the layers for the deep neural network. The network has two input channels, one for the
current observations and one for the current actions. The output of the network is the predicted
Gaussian distribution for each next observation. The two output channels correspond to the means
and standard deviations of these distribution.
statePath = featureInputLayer(obsInfo.Dimension(1),Name="obs");
actionPath = featureInputLayer(actInfo.Dimension(1),Name="act");
commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(32,Name="fc")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(32,Name="fc2")];
meanPath = [reluLayer(Name="nextObsMeanRelu")
fullyConnectedLayer(obsInfo.Dimension(1),Name="nextObsMean")];
stdPath = [reluLayer(Name="nextObsStdRelu")
fullyConnectedLayer(obsInfo.Dimension(1),Name="nextObsStdReluFull")
softplusLayer(Name="nextObsStd")];
tsnNet = layerGraph(statePath);
tsnNet = addLayers(tsnNet,actionPath);
tsnNet = addLayers(tsnNet,commonPath);
tsnNet = addLayers(tsnNet,meanPath);
tsnNet = addLayers(tsnNet,stdPath);
tsnNet = connectLayers(tsnNet,"obs","concat/in1");
tsnNet = connectLayers(tsnNet,"act","concat/in2");
tsnNet = connectLayers(tsnNet,"fc2","nextObsMeanRelu");
tsnNet = connectLayers(tsnNet,"fc2","nextObsStdRelu");
plot(tsnNet)
3-81
3 Objects
Using this transition function object, you can predict the next observation based on the current
observation and action. For example, predict the next observation for a random observation and
action. The next observation values are sampled from Gaussian distributions with the means and
standard deviations output by the transition network.
observation = rand(obsInfo.Dimension);
action = rand(actInfo.Dimension);
nextObs = predict(tsnFcnAppx,{observation},{action})
nextObs{1}
3-82
rlContinuousGaussianTransitionFunction
1.2414
0.7307
-0.5588
-0.9567
You can also obtain the mean value and standard deviation of the Gaussian distribution of the
predicted next observation using evaluate.
nextObsDist = evaluate(tsnFcnAppx,{observation,action})
Version History
Introduced in R2022a
See Also
Objects
rlContinuousDeterministicTransitionFunction | rlNeuralNetworkEnvironment
Topics
“Model-Based Policy Optimization Agents”
3-83
3 Objects
rlDDPGAgent
Deep deterministic policy gradient (DDPG) reinforcement learning agent
Description
The deep deterministic policy gradient (DDPG) algorithm is an actor-critic, model-free, online, off-
policy reinforcement learning method which computes an optimal policy that maximizes the long-
term reward. The action space can only be continuous.
For more information, see “Deep Deterministic Policy Gradient (DDPG) Agents”. For more
information on the different types of reinforcement learning agents, see “Reinforcement Learning
Agents”.
Creation
Syntax
agent = rlDDPGAgent(observationInfo,actionInfo)
agent = rlDDPGAgent(observationInfo,actionInfo,initOpts)
agent = rlDDPGAgent(actor,critic,agentOptions)
Description
Create Agent from Observation and Action Specifications
agent = rlDDPGAgent( ___ ,agentOptions) creates a DDPG agent and sets the AgentOptions
property to the agentOptions input argument. Use this syntax after any of the input arguments in
the previous syntaxes.
3-84
rlDDPGAgent
Input Arguments
actor — Actor
rlContinuousDeterministicActor object
critic — Critic
rlQValueFunction object
Critic, specified as an rlQValueFunction object. For more information on creating critics, see
“Create Policies and Value Functions”.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.
Since a DDPG agent operates in a continuous action space, you must specify actionInfo as an
rlNumericSpec object.
If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.
You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlNumericSpec.
3-85
3 Objects
If you create a DDPG agent with default actor and critic that use recurrent neural networks, the
default value of AgentOptions.SequenceLength is 32.
Experience buffer, specified as an rlReplayMemory object. During training the agent stores each of
its experiences (S,A,R,S',D) in a buffer. Here:
Option to use exploration policy when selecting actions, specified as a one of the following logical
values.
• true — Use the base agent exploration policy when selecting actions.
• false — Use the base agent greedy policy when selecting actions.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
3-86
rlDDPGAgent
Examples
Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”. The observation from the environment is a vector containing the
position and velocity of a mass. The action is a scalar representing a force, applied to the mass,
ranging continuously from -2 to 2 Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
Create a policy gradient agent from the environment observation and action specifications.
agent = rlDDPGAgent(obsInfo,actInfo)
agent =
rlDDPGAgent with properties:
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension)})
You can now test and train the agent within the environment. You can also use getActor and
getCritic to extract the actor and critic, respectively, and getModel to extract the approximator
model (by default a deep neural network) from the actor or critic.
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
3-87
3 Objects
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.
env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256).
initOpts = rlAgentInitializationOptions('NumHiddenUnit',128);
The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator.
rng(0)
Create a DDPG agent from the environment observation and action specifications.
agent = rlDDPGAgent(obsInfo,actInfo,initOpts);
Extract the deep neural networks from both the agent actor and critic.
actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));
Display the layers of the critic network, and verify that each hidden fully connected layer has 128
neurons
criticNet.Layers
ans =
13x1 Layer array with layers:
plot(layerGraph(actorNet))
3-88
rlDDPGAgent
plot(layerGraph(criticNet))
3-89
3 Objects
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment.
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”. The observation from the environment is a vector containing the
position and velocity of a mass. The action is a scalar representing a force ranging continuously from
-2 to 2 Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
3-90
rlDDPGAgent
The actor and critic networks are initialized randomly. Ensure reproducibility by fixing the seed of the
random generator.
rng(0)
For DDPG agents, the critic estimates a Q-value function, therefore it must take both the observation
and action signals as inputs and return a scalar value.
To approximate the Q-value function within the critic, use a deep neural network. Define each
network path as an array of layer objects. Get the dimensions of the observation and action spaces
from the environment specification objects, and specify a name for the input layers, so you can later
explicitly associate them with the appropriate environment channel.
% Connect paths
cNet = connectLayers(cNet,"netOin","concat/in1");
cNet = connectLayers(cNet,"netAin","concat/in2");
3-91
3 Objects
Initialized: true
Inputs:
1 'netOin' 2 features
2 'netAin' 1 features
Create the critic using cNet, and the names of the input layers. DDPG agents use an
rlQValueFunction object to implement the critic.
critic = rlQValueFunction(cNet,obsInfo,actInfo,...
ObservationInputNames="netOin", ...
ActionInputNames="netAin");
getValue(critic,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})
ans = single
-0.4260
3-92
rlDDPGAgent
To approximate the policy within the actor, use a neural network. For DDPG agents, the actor
executes a deterministic policy, which is implemented by a continuous deterministic actor. In this case
the network must take the observation signal as input and return an action. Therefore the output
layer must have as many elements as the number of possible actions.
% create a network to be used as underlying actor approximator
aNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(numel(actInfo))];
Initialized: true
Inputs:
1 'input' 2 features
Specify agent options, including training options for the actor and critic.
agent.AgentOptions.SampleTime=env.Ts;
agent.AgentOptions.TargetSmoothFactor=1e-3;
agent.AgentOptions.ExperienceBufferLength=1e6;
agent.AgentOptions.DiscountFactor=0.99;
agent.AgentOptions.MiniBatchSize=32;
agent.AgentOptions.CriticOptimizerOptions.LearnRate=5e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold=1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate=1e-4;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold=1;
3-93
3 Objects
For this example, load the environment used in the example “Train DDPG Agent to Control Double
Integrator System”. The observation from the environment is a vector containing the position and
velocity of a mass. The action is a scalar representing a force ranging continuously from -2 to 2
Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
For DDPG agents, the critic estimates a Q-value function, therefore it must take both the observation
and action signals as inputs and return a scalar value.
To approximate the Q-value function within the critic, use a recurrent neural network. Define each
network path as an array of layer objects. Get the dimensions of the observation spaces from the
environment specification object, and specify a name for the input layers, so you can later explicitly
associate them with the correct environment channel. To create a recurrent neural network, use
sequenceInputLayer as the input layer and include an lstmLayer as one of the other network
layers.
% Connect paths
cNet = connectLayers(cNet,"netOin","cat/in1");
cNet = connectLayers(cNet,"netAin","cat/in2");
3-94
rlDDPGAgent
Initialized: true
Inputs:
1 'netOin' Sequence input with 2 dimensions
2 'netAin' Sequence input with 1 dimensions
Create the critic using cNet, specifying the names of the input layers. DDPG agents use an
rlQValueFunction object to implement the critic.
critic = rlQValueFunction(cNet,obsInfo,actInfo,...
ObservationInputNames="netOin",ActionInputNames="netAin");
ans = single
-0.0074
Since the critic has a recurrent network, the actor must have a recurrent network too. For DDPG
agents, the actor executes a deterministic policy, which is implemented by a continuous deterministic
3-95
3 Objects
actor. In this case the network must take the observation signal as input and return an action.
Therefore the output layer must have as many elements as the number of possible actions.
Define the network as an array of layer objects, and get the dimensions of the observation and action
spaces from the environment specification objects.
aNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
lstmLayer(10)
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension)) ];
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 2 dimensions
Create the actor using aNet. DDPG agents use an rlContinuousDeterministicActor object to
implement the actor.
actor = rlContinuousDeterministicActor(aNet,obsInfo,actInfo);
Specify agent options. To use a DDPG agent with recurrent neural networks, you must specify a
SequenceLength greater than 1.
agentOpts = rlDDPGAgentOptions(...
'SampleTime',env.Ts,...
'TargetSmoothFactor',1e-3,...
'ExperienceBufferLength',1e6,...
'DiscountFactor',0.99,...
'SequenceLength',20,...
'MiniBatchSize',32, ...
'CriticOptimizerOptions',criticOpts, ...
'ActorOptimizerOptions',actorOpts);
3-96
rlDDPGAgent
agent = rlDDPGAgent(actor,critic,agentOpts);
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
Version History
Introduced in R2019a
See Also
rlAgentInitializationOptions | rlDDPGAgentOptions | rlQValueFunction |
rlContinuousDeterministicActor | Deep Network Designer
Topics
“Deep Deterministic Policy Gradient (DDPG) Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
3-97
3 Objects
rlDDPGAgentOptions
Options for DDPG agent
Description
Use an rlDDPGAgentOptions object to specify options for deep deterministic policy gradient
(DDPG) agents. To create a DDPG agent, use rlDDPGAgent.
For more information, see “Deep Deterministic Policy Gradient (DDPG) Agents”.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
opt = rlDDPGAgentOptions
opt = rlDDPGAgentOptions(Name,Value)
Description
opt = rlDDPGAgentOptions creates an options object for use as an argument when creating a
DDPG agent using all default options. You can modify the object properties using dot notation.
Properties
NoiseOptions — Noise model options
OrnsteinUhlenbeckActionNoise object
For an agent with multiple actions, if the actions have different ranges and units, it is likely that each
action requires different noise model parameters. If the actions have similar ranges and units, you
can set the noise parameters for all actions to the same value.
For example, for an agent with two actions, set the standard deviation of each action to a different
value while using the same decay rate for both standard deviations.
opt = rlDDPGAgentOptions;
opt.NoiseOptions.StandardDeviation = [0.1 0.2];
opt.NoiseOptions.StandardDeviationDecayRate = 1e-4;
3-98
rlDDPGAgentOptions
Smoothing factor for target actor and critic updates, specified as a positive scalar less than or equal
to 1. For more information, see “Target Update Methods”.
Number of steps between target actor and critic updates, specified as a positive integer. For more
information, see “Target Update Methods”.
Option for clearing the experience buffer before training, specified as a logical value.
Maximum batch-training trajectory length when using a recurrent neural network, specified as a
positive integer. This value must be greater than 1 when using a recurrent neural network and 1
otherwise.
Size of random experience mini-batch, specified as a positive integer. During each training episode,
the agent randomly samples experiences from the experience buffer when computing gradients for
updating the critic properties. Large mini-batches reduce the variance when computing gradients but
increase the computational effort.
NumStepsToLookAhead — Number of future rewards used to estimate the value of the policy
1 (default) | positive integer
Number of future rewards used to estimate the value of the policy, specified as a positive integer. For
more information, see [1], Chapter 7.
3-99
3 Objects
Note that if parallel training is enabled (that is if an rlTrainingOptions option object in which the
UseParallel property is set to true is passed to train) then NumStepsToLookAhead must be set
to 1, otherwise an error is generated. This guarantees that experiences are stored contiguously.
Experience buffer size, specified as a positive integer. During training, the agent computes updates
using a mini-batch of experiences randomly sampled from the buffer.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.
Object Functions
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
Examples
opt = rlDDPGAgentOptions('MiniBatchSize',48)
opt =
rlDDPGAgentOptions with properties:
3-100
rlDDPGAgentOptions
TargetUpdateFrequency: 1
ResetExperienceBufferBeforeTraining: 1
SequenceLength: 1
MiniBatchSize: 48
NumStepsToLookAhead: 1
ExperienceBufferLength: 10000
SampleTime: 1
DiscountFactor: 0.9900
InfoToSave: [1x1 struct]
You can modify options using dot notation. For example, set the agent sample time to 0.5.
opt.SampleTime = 0.5;
Algorithms
Noise Model
At each sample time step k, the noise value v(k) is updated using the following formula, where Ts is
the agent sample time, and the initial value v(1) is defined by the InitialAction parameter.
At each sample time step, the standard deviation decays as shown in the following code.
You can calculate how many samples it will take for the standard deviation to be halved using this
simple formula.
halflife = log(0.5)/log(1-StandardDeviationDecayRate);
3-101
3 Objects
For continuous action signals, it is important to set the noise standard deviation appropriately to
encourage exploration. It is common to set StandardDeviation*sqrt(Ts) to a value between 1%
and 10% of your action range.
If your agent converges on local optima too quickly, promote agent exploration by increasing the
amount of noise; that is, by increasing the standard deviation. Also, to increase exploration, you can
reduce the StandardDeviationDecayRate.
Version History
Introduced in R2019a
The properties defining the probability distribution of the Ornstein-Uhlenbeck (OU) noise model have
been renamed. DDPG agents use OU noise for exploration.
The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the OU noise model, use the new property
names instead.
Update Code
This table shows how to update your code to use the new property names for rlDDPGAgentOptions
object ddpgopt.
Target update method settings for DDPG agents have changed. The following changes require
updates to your code:
3-102
rlDDPGAgentOptions
• The TargetUpdateMethod option has been removed. Now, DDPG agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.
To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.
The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.
Update Code
This table shows some typical uses of rlDDPGAgentOptions and how to update your code to use the
new option configuration.
References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second
edition. Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press,
2018.
See Also
Topics
“Deep Deterministic Policy Gradient (DDPG) Agents”
3-103
3 Objects
rlDeterministicActorPolicy
Policy object to generate continuous deterministic actions for custom training loops and application
deployment
Description
This object implements a deterministic policy, which returns continuous deterministic actions given
an input observation. You can create an rlDeterministicActorPolicy object from an
rlContinuousDeterministicActor or extract it from an rlDDPGAgent or rlTD3Agent. You can
then train the policy object using a custom training loop or deploy it for your application using
generatePolicyBlock or generatePolicyFunction. This policy is always deterministic and
does not perform any exploration. For more information on policies and value functions, see “Create
Policies and Value Functions”.
Creation
Syntax
policy = rlDeterministicActorPolicy(actor)
Description
Properties
Actor — Continuous deterministic actor
rlContinuousDeterministicActor object
Action specifications, specified as an rlNumericSpec object. This object defines the properties of the
environment action channel, such as its dimensions, data type, and name. Note that the name of the
action channel specified in actionInfo (if any) is not used.
3-104
rlDeterministicActorPolicy
Sample time of the policy, specified as a positive scalar or as -1 (default). Setting this parameter to
-1 allows for event-based simulations.
Within a Simulink environment, the RL Agent block in which the policy is specified executes every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the policy is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience. If
SampleTime is -1, the sample time is treated as being equal to 1.
Example: 0.2
Object Functions
generatePolicyBlock Generate Simulink block that evaluates policy of an agent or policy object
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
getAction Obtain action from agent, actor, or policy object given environment
observations
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
reset Reset environment, agent, experience buffer, or policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
Examples
Create observation and action specification objects. For this example, define the observation and
action spaces as continuous four- and two-dimensional spaces, respectively.
Create a continuous deterministic actor. This actor must accept an observation as input and return an
action as output.
To approximate the policy function within the actor, use a deep neural network model. Define the
network as an array of layer objects, and get the dimension of the observation and action spaces from
the environment specification objects.
layers = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(16)
3-105
3 Objects
reluLayer
fullyConnectedLayer(actInfo.Dimension(1))
];
Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)
Initialized: true
Inputs:
1 'input' 4 features
Create the actor using model, and the observation and action specifications.
actor = rlContinuousDeterministicActor(model,obsInfo,actInfo)
actor =
rlContinuousDeterministicActor with properties:
act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}
0.4013
0.0578
policy = rlDeterministicActorPolicy(actor)
policy =
rlDeterministicActorPolicy with properties:
act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}
ans = 2×1
3-106
rlDeterministicActorPolicy
0.4313
-0.3002
You can now train the policy with a custom training loop and then deploy it to your application.
Version History
Introduced in R2022a
See Also
Functions
rlMaxQPolicy | rlEpsilonGreedyPolicy | rlAdditiveNoisePolicy |
rlStochasticActorPolicy | rlTD3Agent | rlDDPGAgent | generatePolicyBlock |
generatePolicyFunction
Blocks
RL Agent
Topics
“Create Policies and Value Functions”
“Model-Based Reinforcement Learning Using Custom Training Loop”
“Train Reinforcement Learning Policy Using Custom Training Loop”
3-107
3 Objects
rlDeterministicActorRepresentation
(Not recommended) Deterministic actor representation for reinforcement learning agents
Description
This object implements a function approximator to be used as a deterministic actor within a
reinforcement learning agent with a continuous action space. A deterministic actor takes
observations as inputs and returns as outputs the action that maximizes the expected cumulative
long-term reward, thereby implementing a deterministic policy. After you create an
rlDeterministicActorRepresentation object, use it to create a suitable agent, such as an
rlDDPGAgent agent. For more information on creating representations, see “Create Policies and
Value Functions”.
Creation
Syntax
actor = rlDeterministicActorRepresentation(net,observationInfo,
actionInfo,'Observation',obsName,'Action',actName)
actor = rlDeterministicActorRepresentation({basisFcn,W0},observationInfo,
actionInfo)
actor = rlDeterministicActorRepresentation( ___ ,options)
Description
actor = rlDeterministicActorRepresentation(net,observationInfo,
actionInfo,'Observation',obsName,'Action',actName) creates a deterministic actor using
the deep neural network net as approximator. This syntax sets the ObservationInfo and ActionInfo
properties of actor to the inputs observationInfo and actionInfo, containing the specifications
for observations and actions, respectively. actionInfo must specify a continuous action space,
discrete action spaces are not supported. obsName must contain the names of the input layers of net
that are associated with the observation specifications. The action names actName must be the
names of the output layers of net that are associated with the action specifications.
actor = rlDeterministicActorRepresentation({basisFcn,W0},observationInfo,
actionInfo) creates a deterministic actor using a custom basis function as underlying
approximator. The first input argument is a two-elements cell in which the first element contains the
handle basisFcn to a custom basis function, and the second element contains the initial weight
matrix W0. This syntax sets the ObservationInfo and ActionInfo properties of actor respectively to
the inputs observationInfo and actionInfo.
3-108
rlDeterministicActorRepresentation
This syntax sets the Options property of actor to theoptions input argument. You can use this
syntax with any of the previous input-argument combinations.
Input Arguments
Deep neural network used as the underlying approximator within the actor, specified as one of the
following:
The network input layers must be in the same order and with the same data type and dimensions as
the signals defined in ObservationInfo. Also, the names of these input layers must match the
observation names listed in obsName.
The network output layer must have the same data type and dimension as the signal defined in
ActionInfo. Its name must be the action name specified in actName.
For a list of deep neural network layers, see “List of Deep Learning Layers”. For more information on
creating deep neural networks for reinforcement learning, see “Create Policies and Value Functions”.
Observation names, specified as a cell array of strings or character vectors. The observation names
must be the names of the input layers in net.
Example: {'my_obs'}
Action name, specified as a single-element cell array that contains a character vector. It must be the
name of the output layer of net.
Example: {'my_act'}
Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The action
to be taken based on the current observation, which is the output of the actor, is the vector a =
W'*B, where W is a weight matrix containing the learnable parameters and B is the column vector
returned by the custom basis function.
3-109
3 Objects
When creating a deterministic actor representation, your basis function must have the following
signature.
B = myBasisFunction(obs1,obs2,...,obsN)
Here obs1 to obsN are observations in the same order and with the same data type and dimensions
as the signals defined in observationInfo
Example: @(obs1,obs2,obs3) [obs3(2)*obs1(1)^2; abs(obs2(5)+obs3(1))]
Initial value of the basis function weights, W, specified as a matrix having as many rows as the length
of the vector returned by the basis function and as many columns as the dimension of the action
space.
Properties
Options — Representation options
rlRepresentationOptions object
Action specifications for a continuous action space, specified as an rlNumericSpec object defining
properties such as dimensions, data type and name of the action signals. The deterministic actor
representation does not support discrete actions.
You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually.
For custom basis function representations, the action signal must be a scalar, a column vector, or a
discrete action.
3-110
rlDeterministicActorRepresentation
Object Functions
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning agent
getAction Obtain action from agent, actor, or policy object given environment observations
Examples
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
obsInfo = rlNumericSpec([4 1]);
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.
actInfo = rlNumericSpec([2 1]);
Create a deep neural network approximator for the actor. The input of the network (here called
myobs) must accept a four-element vector (the observation vector just defined by obsInfo), and its
output must be the action (here called myact) and be a two-element vector, as defined by actInfo.
net = [featureInputLayer(4,'Normalization','none','Name','myobs')
fullyConnectedLayer(2,'Name','myact')];
Create the critic with rlQValueRepresentation, using the network, the observations and action
specification objects, as well as the names of the network input and output layers.
actor = rlDeterministicActorRepresentation(net,obsInfo,actInfo, ...
'Observation',{'myobs'},'Action',{'myact'})
actor =
rlDeterministicActorRepresentation with properties:
To check your actor, use getAction to return the action from a random observation, using the
current network weights.
act = getAction(actor,{rand(4,1)}); act{1}
-0.5054
1.5390
You can now use the actor to create a suitable agent (such as an rlACAgent, rlPGAgent, or
rlDDPGAgent agent).
3-111
3 Objects
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 3
doubles.
The deterministic actor does not support discrete action spaces. Therefore, create a continuous
action space specification object (or alternatively use getActionInfo to extract the specification
object from an environment). For this example, define the action space as a continuous two-
dimensional space, so that a single action is a column vector containing 2 doubles.
Create a custom basis function. Each element is a function of the observations defined by obsInfo.
The output of the actor is the vector W'*myBasisFcn(myobs), which is the action taken as a result
of the given observation. The weight matrix W contains the learnable parameters and must have as
many rows as the length of the basis function output and as many columns as the dimension of the
action space.
W0 = rand(4,2);
Create the actor. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight matrix. The second and third arguments are, respectively, the
observation and action specification objects.
actor = rlDeterministicActorRepresentation({myBasisFcn,W0},obsInfo,actInfo)
actor =
rlDeterministicActorRepresentation with properties:
To check your actor, use the getAction function to return the action from a given observation, using
the current parameter matrix.
a = getAction(actor,{[1 2 3]'});
a{1}
ans =
2x1 dlarray
3-112
rlDeterministicActorRepresentation
2.0595
2.3788
You can now use the actor (along with an critic) to create a suitable continuous action space agent.
Create observation and action information. You can also obtain these specifications from an
environment.
Create a recurrent deep neural network for the actor. To create a recurrent neural network, use a
sequenceInputLayer as the input layer and include at least one lstmLayer.
net = [sequenceInputLayer(numObs,'Normalization','none','Name','state')
fullyConnectedLayer(10,'Name','fc1')
reluLayer('Name','relu1')
lstmLayer(8,'OutputMode','sequence','Name','ActorLSTM')
fullyConnectedLayer(20,'Name','CriticStateFC2')
fullyConnectedLayer(numAct,'Name','action')
tanhLayer('Name','tanh1')];
actorOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(net,obsinfo,actinfo,...
'Observation',{'state'},'Action',{'tanh1'});
Version History
Introduced in R2020a
The following table shows some typical uses of rlDeterministicActorRepresentation, and how
to update your code with rlContinuousDeterministicActor instead. The first table entry uses a
neural network, the second one uses a basis function.
3-113
3 Objects
rlDeterministicActorRepresentation: rlContinuousDeterministicActor:
Not Recommended Recommended
myActor = myActor =
rlDeterministicActorRepresentation(net rlContinuousDeterministicActor(net,obs
,obsInfo,actInfo,'Observation',obsName Info,actInfo,'ObservationInputNames',o
s,'Action',actNames) with actInfo defining bsNames). Use this syntax to create a
a continuous action space and net having deterministic actor object with a continuous
observations as inputs and a single output layer action space.
with as many elements as the number of
dimensions of the continuous action space.
rep = rep =
rlDeterministicActorRepresentation({ba rlContinuousDeterministicActor({basisF
sisFcn,W0},obsInfo,actInfo), where the cn,W0},obsInfo,actInfo). Use this syntax to
basis function has observations as inputs and create a deterministic actor object with a
actions as outputs, W0 is a matrix with as many continuous action space.
columns as the number of possible actions, and
actInfo defines a continuous action space.
See Also
Functions
rlContinuousDeterministicActor | rlRepresentationOptions | getActionInfo |
getObservationInfo
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-114
rlDiscreteCategoricalActor
rlDiscreteCategoricalActor
Stochastic categorical actor with a discrete action space for reinforcement learning agents
Description
This object implements a function approximator to be used as a stochastic actor within a
reinforcement learning agent with a discrete action space. A discrete categorical actor takes an
environment state as input and returns as output a random action sampled from a categorical (also
known as Multinoulli) probability distribution of the expected cumulative long term reward, thereby
implementing a stochastic policy. After you create an rlDiscreteCategoricalActor object, use it
to create a suitable agent, such as rlACAgent or rlPGAgent. For more information on creating
representations, see “Create Policies and Value Functions”.
Creation
Syntax
actor = rlDiscreteCategoricalActor(net,observationInfo,actionInfo)
actor = rlDiscreteCategoricalActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames)
actor = rlDiscreteCategoricalActor({basisFcn,W0},observationInfo,actionInfo)
Description
Note actor does not enforce constraints set by the action specification; therefore, when using this
actor, you must enforce action space constraints within the environment.
actor = rlDiscreteCategoricalActor(net,observationInfo,
actionInfo,ObservationInputNames=netObsNames) specifies the names of the network input
layers to be associated with the environment observation channels. The function assigns, in
sequential order, each environment observation channel specified in observationInfo to the layer
specified by the corresponding name in the string array netObsNames. Therefore, the network input
layers, ordered as the names in netObsNames, must have the same data type and dimensions as the
observation specifications, as ordered in observationInfo.
3-115
3 Objects
actor = rlDiscreteCategoricalActor({basisFcn,W0},observationInfo,actionInfo)
creates a discrete space stochastic actor using a custom basis function as underlying approximator.
The first input argument is a two-element cell array whose first element is the handle basisFcn to a
custom basis function and whose second element is the initial weight matrix W0. This function sets
the ObservationInfo and ActionInfo properties of actor to the inputs observationInfo and
actionInfo, respectively.
Input Arguments
Deep neural network used as the underlying approximator within the actor, specified as one of the
following:
Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
Deep Learning Toolbox neural network object. The resulting dlnet is the dlnetwork object that you
use for your critic or actor. This practice allows a greater level of insight and control for cases in
which the conversion is not straightforward and might require additional specifications.
The network must have the environment observation channels as inputs and a single output layer
with as many elements as the number of possible discrete actions. Since the output of the network
must represent the probability of executing each possible action, the software automatically adds a
softmaxLayer as a final output layer if you do not specify it explicitly. When computing the action,
the actor then randomly samples the distribution to return an action.
The learnable parameters of the actor are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.
3-116
rlDiscreteCategoricalActor
Network input layers names corresponding to the environment observation channels, specified as a
string array or a cell array of character vectors. When you use this argument after
'ObservationInputNames', the function assigns, in sequential order, each environment
observation channel specified in observationInfo to each network input layer specified by the
corresponding name in the string array netObsNames. Therefore, the network input layers, ordered
as the names in netObsNames, must have the same data type and dimensions as the observation
specifications, as ordered in observationInfo.
Note Of the information specified in observationInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.
Example: {"NetInput1_airspeed","NetInput2_altitude"}
Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The number
of the action to be taken based on the current observation, which is the output of the actor, is
randomly sampled from a categorical distribution with probabilities p = softmax(W'*B), where W
is a weight matrix containing the learnable parameters and B is the column vector returned by the
custom basis function. Each element of p represents the probability of executing the corresponding
action from the observed state.
B = myBasisFunction(obs1,obs2,...,obsN)
Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the
environment observation channels defined in observationInfo.
Example: @(obs1,obs2,obs3) [obs3(2)*obs1(1)^2; abs(obs2(5)+obs3(1))]
Initial value of the basis function weights W, specified as a matrix having as many rows as the length
of the vector returned by the basis function and as many columns as the dimension of the action
space.
Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array
3-117
3 Objects
Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of
the environment action channel, such as its dimensions, data type, and name. Note that the function
does not use the name of the action channel specified in actionInfo.
You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specifications manually.
Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.
To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: "gpu"
Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
getAction Obtain action from agent, actor, or policy object given environment
observations
3-118
rlDiscreteCategoricalActor
Examples
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
obsInfo = rlNumericSpec([4 1]);
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as consisting of
three values, -10, 0, and 10.
actInfo = rlFiniteSetSpec([-10 0 10]);
To approximate the policy within the actor, use a deep neural network.
The input of the network must accept a four-element vector (the observation vector just defined by
obsInfo), and its output must be a three-element vector. Each element of the output vector must be
between 0 and 1 since it represents the probability of executing each of the three possible actions (as
defined by actInfo). Using softmax as the output layer enforces this requirement (the software
automatically adds a softmaxLayer as a final output layer if you do not specify it explicitly). When
computing the action, the actor then randomly samples the distribution to return an action.
Convert the network to a dlnetwork object and display the number of learnable parameters.
net = dlnetwork(net);
summary(net)
Initialized: true
Number of learnables: 15
Inputs:
1 'input' 4 features
3-119
3 Objects
Create the actor with rlDiscreteCategoricalActor, using the network, the observations and
action specification objects. When the network has multiple input layers, they are automatically
associated with the environment observation channels according to the dimension specifications in
obsInfo.
actor = rlDiscreteCategoricalActor(net,obsInfo,actInfo);
To check your actor, use getAction to return an action from a random observation vector, given the
current network weights.
act = getAction(actor,{rand(obsInfo.Dimension)});
act
You can now use the actor to create a suitable agent, such as rlACAgent, or rlPGAgent.
Create Discrete Categorical Actor from Deep Neural Network Specifying Input Layer Name
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as consisting of
three values, -10, 0, and 10.
The input of the network (here called netOin) must accept a four-element vector (the observation
vector just defined by obsInfo), and its output (here called actionProb) must be a three-element
vector. Each element of the output vector must be between 0 and 1 since it represents the probability
of executing each of the three possible actions (as defined by actInfo). Using softmax as the output
layer enforces this requirement (however, the software automatically adds a softmaxLayer as a final
output layer if you do not specify it explicitly). When computing the action, the actor then randomly
samples the distribution to return an action.
Create a network as an array of layer objects. Specify a name for the input layer, so you can later
explicitly associate it with the observation channel.
net = [ featureInputLayer(4,Name="netOin")
fullyConnectedLayer(3)
softmaxLayer(Name="actionProb") ];
Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
3-120
rlDiscreteCategoricalActor
net = dlnetwork(net);
summary(net)
Initialized: true
Number of learnables: 15
Inputs:
1 'netOin' 4 features
Create the actor with rlDiscreteCategoricalActor, using the network, the observations and
action specification objects, and the name of the network input layer.
To validate your actor, use getAction to return an action from a random observation, given the
current network weights.
act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}
ans = 0
To return the probability distribution of the possible actions as a function of a random observation,
and given the current network weights, use evaluate.
prb = evaluate(actor,{rand(obsInfo.Dimension)})
prb{1}
0.5606
0.2619
0.1776
You can now use the actor to create a suitable agent, such as rlACAgent, or rlPGAgent.
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as
consisting of two channels, the first being a two-dimensional vector in a continuous space, the second
being a two dimensional vector that can assume only three values, -[1 1], [0 0], and [1 1]. Therefore a
single observation consists of two two-dimensional vectors, one continuous, the other discrete.
3-121
3 Objects
Create a discrete action space specification object (or alternatively use getActionInfo to extract
the specification object from an environment with a discrete action space). For this example, define
the action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).
Create a custom basis function. Each element is a function of the observation defined by obsInfo.
The output of the actor is the action, among the ones defined in actInfo, corresponding to the
element of softmax(W'*myBasisFcn(obsC,obsD)) which has the highest value. W is a weight
matrix, containing the learnable parameters, which must have as many rows as the length of the
basis function output, and as many columns as the number of possible actions.
W0 = rand(4,3);
Create the actor. The first argument is a two-element cell containing both the handle to the custom
function and the initial parameter matrix. The second and third arguments are, respectively, the
observation and action specification objects.
actor = rlDiscreteCategoricalActor({myBasisFcn,W0},obsInfo,actInfo);
To check your actor use the getAction function to return one of the three possible actions,
depending on a given random observation and on the current parameter matrix.
getAction(actor,{rand(2,1),[1 1]})
getAction(actor,{rand(2,1),[0.5 -0.7]})
You can now use the actor (along with an critic) to create a suitable discrete action space agent (such
as rlACAgent, rlPGAgent, or rlPPOAgent).
This example shows you how to create a stochastic actor with a discrete action space using a
recurrent neural network. You can also use a recurrent neural network for a continuous stochastic
actor.
3-122
rlDiscreteCategoricalActor
For this example, use the same environment used in “Train PG Agent to Balance Cart-Pole System”.
Load the environment and obtain the observation and action specifications.
env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
To approximate the policy within the actor, use a recurrent deep neural network.
Create a neural network as an array of layer objects. To create a recurrent network, use a
sequenceInputLayer as the input layer (with size equal to the number of dimensions of the
observation channel) and include at least one lstmLayer.
Specify a name for the input layer, so you can later explicitly associate it with the observation
channel.
net = [
sequenceInputLayer( ...
prod(obsInfo.Dimension), ...
Name="netOin")
fullyConnectedLayer(8)
reluLayer
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer( ...
numel(actInfo.Elements)) ];
Convert the network to a dlnetwork object and display the number of learnable parameters
(weights).
net = dlnetwork(net);
summary(net)
Initialized: true
Inputs:
1 'netOin' Sequence input with 4 dimensions
Create a discrete categorical actor using the network, the environment specifications, and the name
of the network input layer to be associated with the observation channel.
actor = rlDiscreteCategoricalActor(net, ...
obsInfo,actInfo,...
Observation="netOin");
To check your actor use getAction to return one of the two possible actions, depending on a given
random observation and on the current network weights.
act = getAction(actor,{rand(obsInfo.Dimension)});
act{1}
ans = -10
To return the probability of each of the two possible action, use evaluate. Note that the type of the
returned numbers is single, not double.
prob = evaluate(actor,{rand(obsInfo.Dimension)});
prob{1}
3-123
3 Objects
0.4704
0.5296
You can use getState and setState to extract and set the current state of the recurrent neural
network in the actor.
getState(actor)
To evaluate the actor using sequential observations, use the sequence length (time) dimension. For
example, obtain actions for 5 independent sequences each one consisting of 9 sequential
observations.
Display the action corresponding to the seventh element of the observation sequence in the fourth
sequence.
action = action{1};
action(1,1,4,7)
ans = 10
state
For more information on input and output format for recurrent neural networks, see the Algorithms
section of lstmLayer.
You can now use the actor (along with an critic) to create a suitable discrete action space agent (such
as rlACAgent, rlPGAgent, or rlPPOAgent).
Version History
Introduced in R2022a
3-124
rlDiscreteCategoricalActor
See Also
Functions
rlContinuousDeterministicActor | rlContinuousGaussianActor | getActionInfo |
getObservationInfo
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-125
3 Objects
rlDQNAgent
Deep Q-network (DQN) reinforcement learning agent
Description
The deep Q-network (DQN) algorithm is a model-free, online, off-policy reinforcement learning
method. A DQN agent is a value-based reinforcement learning agent that trains a critic to estimate
the return or future rewards. DQN is a variant of Q-learning, and it operates only within discrete
action spaces.
For more information, “Deep Q-Network (DQN) Agents”. For more information on the different types
of reinforcement learning agents, see “Reinforcement Learning Agents”.
Creation
Syntax
agent = rlDQNAgent(observationInfo,actionInfo)
agent = rlDQNAgent(observationInfo,actionInfo,initOpts)
agent = rlDQNAgent(critic)
agent = rlDQNAgent(critic,agentOptions)
Description
Create Agent from Observation and Action Specifications
agent = rlDQNAgent(critic) creates a DQN agent with the specified critic network using a
default option set for a DQN agent.
Specify Agent Options
3-126
rlDQNAgent
Input Arguments
critic — Critic
rlQValueFunction object | rlVectorQValueFunction object
Your critic can use a recurrent neural network as its function approximator. However, only
rlVectorQValueFunction supports recurrent neural networks. For an example, see “Create DQN
Agent with Recurrent Neural Network” on page 3-136.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
If you create the agent by specifying a critic object, the value of ObservationInfo matches the
value specified in critic.
Since a DDPG agent operates in a discrete action space, you must specify actionInfo as an
rlFiniteSetSpec object.
If you create the agent by specifying a critic object, the value of ActionInfo matches the value
specified in critic.
You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec.
3-127
3 Objects
If you create a DQN agent with a default critic that uses a recurrent neural network, the default value
of AgentOptions.SequenceLength is 32.
Experience buffer, specified as an rlReplayMemory object. During training the agent stores each of
its experiences (S,A,R,S',D) in a buffer. Here:
Option to use exploration policy when selecting actions, specified as a one of the following logical
values.
• true — Use the base agent exploration policy when selecting actions.
• false — Use the base agent greedy policy when selecting actions.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
3-128
rlDQNAgent
Examples
Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to a swinging pole).
The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator.
rng(0)
Create a deep Q-network agent from the environment observation and action specifications.
agent = rlDQNAgent(obsInfo,actInfo);
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment.
Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to a swinging pole).
Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256).
3-129
3 Objects
initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
Create a policy gradient agent from the environment observation and action specifications.
agent = rlDQNAgent(obsInfo,actInfo,initOpts);
criticNet = getModel(getCritic(agent));
The default DQN agent uses a multi-output Q-value critic approximator. A multi-output approximator
has observations as inputs and state-action values as outputs. Each output element represents the
expected cumulative long-term reward for taking the corresponding discrete action from the state
indicated by the observation inputs.
Display the layers of the critic network, and verify that each hidden fully connected layer has 128
neurons
criticNet.Layers
ans =
11x1 Layer array with layers:
plot(layerGraph(criticNet))
3-130
rlDQNAgent
To check your agent, use getAction to return the action from random observations.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment.
Create an environment interface and obtain its observation and action specifications. For this
example load the predefined environment used for the “Train DQN Agent to Balance Cart-Pole
System” example. This environment has a continuous four-dimensional observation space (the
positions and velocities of both cart and pole) and a discrete one-dimensional action space consisting
on the application of two possible forces, -10N or 10N.
env = rlPredefinedEnv("CartPole-Discrete");
3-131
3 Objects
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
To approximate the Q-value function within the critic, use a deep neural network. For DQN agents
with a discrete action space, you have the option to create a multi-output Q-value function critic,
which is generally more efficient than a comparable single-output critic.
A network for this critic must take only the observation as input and return a vector of values for
each action. Therefore, it must have an input layer with as many elements as the dimension of the
observation space and an output layer having as many elements as the number of possible discrete
actions. Each output element represents the expected cumulative long-term reward following from
the observation given as input, when the corresponding action is taken.
Define the network as an array of layer objects, and get the dimensions of the observation space (that
is, prod(obsInfo.Dimension)) and the number of possible actions (that is,
numel(actInfo.Elements)) directly from the environment specification objects.
dnn = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))];
dnn = dlnetwork(dnn);
summary(dnn)
Initialized: true
Inputs:
1 'input' 4 features
Create the critic using rlVectorQValueFunction, the network dnn as well as the observation and
action specifications.
critic = rlVectorQValueFunction(dnn,obsInfo,actInfo);
getValue(critic,{rand(obsInfo.Dimension)})
-0.0361
0.0913
agent = rlDQNAgent(critic)
agent =
rlDQNAgent with properties:
3-132
rlDQNAgent
agent.AgentOptions.UseDoubleDQN=false;
agent.AgentOptions.TargetUpdateMethod="periodic";
agent.AgentOptions.TargetUpdateFrequency=4;
agent.AgentOptions.ExperienceBufferLength=100000;
agent.AgentOptions.DiscountFactor=0.99;
agent.AgentOptions.MiniBatchSize=256;
agent.AgentOptions.CriticOptimizerOptions.LearnRate=1e-2;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold=1;
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
Create an environment interface and obtain its observation and action specifications. For this
example load the predefined environment used for the “Train DQN Agent to Balance Cart-Pole
System” example. This environment has a continuous four-dimensional observation space (the
positions and velocities of both cart and pole) and a discrete one-dimensional action space consisting
on the application of two possible forces, -10N or 10N.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a deep neural network to be used as approximation model within the critic. For DQN agents,
you have the option to create a multi-output Q-value function critic, which is generally more efficient
than a comparable single-output critic. However, for this example, create a single-output Q-value
function critic instead.
3-133
3 Objects
The network for this critic must have two input layers, one for the observation and the other for the
action, and return a scalar value representing the expected cumulative long-term reward following
from the given observation and action.
Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects and specify a name for the input layers, so
you can later explicitly associate them with the appropriate environment channel.
% Observation path
obsPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="netOin")
fullyConnectedLayer(24)
reluLayer
fullyConnectedLayer(24,Name="fcObsPath")];
% Action path
actPath = [
featureInputLayer(prod(actInfo.Dimension),Name="netAin")
fullyConnectedLayer(24,Name="fcActPath")];
% Connect layers
net = connectLayers(net,'fcObsPath','cat/in1');
net = connectLayers(net,'fcActPath','cat/in2');
% Plot network
plot(net)
3-134
rlDQNAgent
Initialized: true
Inputs:
1 'netOin' 4 features
2 'netAin' 1 features
Create the critic using rlQValueFunction. Specify the names of the layers to be associated with the
observation and action channels.
critic = rlQValueFunction(net, ...
obsInfo, ...
actInfo, ...
ObservationInputNames="netOin", ...
ActionInputNames="netAin");
ans = single
-0.0232
3-135
3 Objects
agent = rlDQNAgent(critic)
agent =
rlDQNAgent with properties:
agent.AgentOptions.UseDoubleDQN=false;
agent.AgentOptions.TargetUpdateMethod="periodic";
agent.AgentOptions.TargetUpdateFrequency=4;
agent.AgentOptions.ExperienceBufferLength=100000;
agent.AgentOptions.DiscountFactor=0.99;
agent.AgentOptions.MiniBatchSize=256;
agent.AgentOptions.CriticOptimizerOptions.LearnRate=1e-2;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold=1;
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
For this example load the predefined environment used for the “Train DQN Agent to Balance Cart-
Pole System” example. This environment has a continuous four-dimensional observation space (the
positions and velocities of both cart and pole) and a discrete one-dimensional action space consisting
on the application of two possible forces, -10N or 10N.
env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
To approximate the Q-value function within the critic, use a recurrent deep neural network. For DQN
agents, only the vector function approximator, rlVectorQValueFunction, supports recurrent
neural networks models. For vector Q-value function critics, the number of elements of the output
layer has to be equal to the number of possible actions: numel(actInfo.Elements).
3-136
rlDQNAgent
Define the network as an array of layer objects. Get the dimensions of the observation space from the
environment specification object (prod(obsInfo.Dimension)). To create a recurrent neural
network, use a sequenceInputLayer as the input layer and include an lstmLayer as one of the
other network layers.
net = [
sequenceInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(50)
reluLayer
lstmLayer(20,OutputMode="sequence");
fullyConnectedLayer(20)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))];
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
0.0136
0.0067
Specify options for creating the DQN agent. To use a recurrent neural network, you must specify a
SequenceLength greater than 1.
agentOptions = rlDQNAgentOptions(...
UseDoubleDQN=false, ...
TargetSmoothFactor=5e-3, ...
ExperienceBufferLength=1e6, ...
SequenceLength=32, ...
CriticOptimizerOptions=criticOptions);
agentOptions.EpsilonGreedyExploration.EpsilonDecay = 1e-4;
Create the agent. The actor and critic networks are initialized randomly.
3-137
3 Objects
agent = rlDQNAgent(critic,agentOptions);
Check your agent using getAction to return the action from a random observation.
getAction(agent,rand(obsInfo.Dimension))
You can now test and train the agent against the environment.
Version History
Introduced in R2019a
See Also
rlAgentInitializationOptions | rlDQNAgentOptions | rlVectorQValueFunction |
rlQValueFunction | Deep Network Designer
Topics
“Deep Q-Network (DQN) Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
3-138
rlDQNAgentOptions
rlDQNAgentOptions
Options for DQN agent
Description
Use an rlDQNAgentOptions object to specify options for deep Q-network (DQN) agents. To create a
DQN agent, use rlDQNAgent.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
opt = rlDQNAgentOptions
opt = rlDQNAgentOptions(Name,Value)
Description
opt = rlDQNAgentOptions creates an options object for use as an argument when creating a DQN
agent using all default settings. You can modify the object properties using dot notation.
Properties
UseDoubleDQN — Flag for using double DQN
true (default) | false
Flag for using double DQN for value function target updates, specified as a logical value. For most
application set UseDoubleDQN to "on". For more information, see “Deep Q-Network (DQN) Agents”.
3-139
3 Objects
At the end of each training time step, if Epsilon is greater than EpsilonMin, then it is updated
using the following formula.
Epsilon = Epsilon*(1-EpsilonDecay)
If your agent converges on local optima too quickly, you can promote agent exploration by increasing
Epsilon.
To specify exploration options, use dot notation after creating the rlDQNAgentOptions object opt.
For example, set the epsilon value to 0.9.
opt.EpsilonGreedyExploration.Epsilon = 0.9;
Smoothing factor for target critic updates, specified as a positive scalar less than or equal to 1. For
more information, see “Target Update Methods”.
Number of steps between target critic updates, specified as a positive integer. For more information,
see “Target Update Methods”.
Option for clearing the experience buffer before training, specified as a logical value.
3-140
rlDQNAgentOptions
Maximum batch-training trajectory length when using a recurrent neural network for the critic,
specified as a positive integer. This value must be greater than 1 when using a recurrent neural
network for the critic and 1 otherwise.
Size of random experience mini-batch, specified as a positive integer. During each training episode,
the agent randomly samples experiences from the experience buffer when computing gradients for
updating the critic properties. Large mini-batches reduce the variance when computing gradients but
increase the computational effort.
When using a recurrent neural network for the critic, MiniBatchSize is the number of experience
trajectories in a batch, where each trajectory has length equal to SequenceLength.
NumStepsToLookAhead — Number of future rewards used to estimate the value of the policy
1 (default) | positive integer
Number of future rewards used to estimate the value of the policy, specified as a positive integer. For
more information, see chapter 7 of [1].
N-step Q learning is not supported when using a recurrent neural network for the critic. In this case,
NumStepsToLookAhead must be 1.
Experience buffer size, specified as a positive integer. During training, the agent computes updates
using a mini-batch of experiences randomly sampled from the buffer.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.
Object Functions
rlDQNAgent Deep Q-network (DQN) reinforcement learning agent
3-141
3 Objects
Examples
opt = rlDQNAgentOptions('MiniBatchSize',48)
opt =
rlDQNAgentOptions with properties:
UseDoubleDQN: 1
EpsilonGreedyExploration: [1x1 rl.option.EpsilonGreedyExploration]
CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
TargetSmoothFactor: 1.0000e-03
TargetUpdateFrequency: 1
ResetExperienceBufferBeforeTraining: 1
SequenceLength: 1
MiniBatchSize: 48
NumStepsToLookAhead: 1
ExperienceBufferLength: 10000
SampleTime: 1
DiscountFactor: 0.9900
InfoToSave: [1x1 struct]
You can modify options using dot notation. For example, set the agent sample time to 0.5.
opt.SampleTime = 0.5;
Version History
Introduced in R2019a
Target update method settings for DQN agents have changed. The following changes require updates
to your code:
• The TargetUpdateMethod option has been removed. Now, DQN agents determine the target
update method based on the TargetUpdateFrequency and TargetSmoothFactor option
values.
• The default value of TargetUpdateFrequency has changed from 4 to 1.
To use one of the following target update methods, set the TargetUpdateFrequency and
TargetSmoothFactor properties as indicated.
3-142
rlDQNAgentOptions
The default target update configuration, which is a smoothing update with a TargetSmoothFactor
value of 0.001, remains the same.
Update Code
This table shows some typical uses of rlDQNAgentOptions and how to update your code to use the
new option configuration.
References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second
edition. Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press,
2018.
See Also
Topics
“Deep Q-Network (DQN) Agents”
3-143
3 Objects
rlEpsilonGreedyPolicy
Policy object to generate discrete epsilon-greedy actions for custom training loops
Description
This object implements an epsilon-greedy policy, which returns either the action that maximizes a
discrete action-space Q-value function, with probability 1-Epsilon, or a random action otherwise,
given an input observation. You can create an rlEpsilonGreedyPolicy object from an
rlQValueFunction or rlVectorQValueFunction object, or extract it from an rlQAgent,
rlDQNAgent or rlSARSAAgent. You can then train the policy object using a custom training loop or
deploy it for your application. If UseEpsilonGreedyAction is set to 0 the policy is deterministic,
therefore in this case it does not explore. This object is not compatible with generatePolicyBlock
and generatePolicyFunction. For more information on policies and value functions, see “Create
Policies and Value Functions”.
Creation
Syntax
policy = rlEpsilonGreedyPolicy(qValueFunction)
Description
Properties
QValueFunction — Discrete action-space Q-value function
rlQValueFunction object | rlVectorQValueFunction object
Option to enable epsilon-greedy actions, specified as a logical value: either true (default, enabling
epsilon-greedy actions, which helps exploration) or false (epsilon-greedy actions not enabled). When
epsilon-greedy actions are disabled the policy is deterministic and therefore it does not explore.
3-144
rlEpsilonGreedyPolicy
Example: false
Option to enable noise decay, specified as a logical value: either true (default, enabling noise decay)
or false (disabling noise decay).
Example: false
Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of
the environment action channel, such as its dimensions, data type, and name. Note that the name of
the action channel specified in actionInfo (if any) is not used.
Sample time of the policy, specified as a positive scalar or as -1 (default). Setting this parameter to
-1 allows for event-based simulations.
Within a Simulink environment, the RL Agent block in which the policy is specified executes every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the policy is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience. If
SampleTime is -1, the sample time is treated as being equal to 1.
Example: 0.2
Object Functions
getAction Obtain action from agent, actor, or policy object given environment
observations
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
reset Reset environment, agent, experience buffer, or policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
Examples
3-145
3 Objects
Create observation and action specification objects. For this example, define the observation space as
a continuous four-dimensional space, so that a single observation is a column vector containing four
doubles, and the action space as a finite set consisting of two possible row vectors, [1 0] and [0 1].
obsInfo = rlNumericSpec([4 1]);
actInfo = rlFiniteSetSpec({[1 0],[0 1]});
Create a vector Q-value function approximator to use as critic. A vector Q-value function must accept
an observation as input and return a single vector with as many elements as the number of possible
discrete actions.
To approximate the vector Q-value function within the critic, use a neural network. Define a single
path from the network input to its output as an array of layer objects.
layers = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(10)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))
];
Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)
Initialized: true
Number of learnables: 72
Inputs:
1 'input' 4 features
Create a vector Q-value function using model, and the observation and action specifications.
qValueFcn = rlVectorQValueFunction(model,obsInfo,actInfo)
qValueFcn =
rlVectorQValueFunction with properties:
0.6486
-0.3103
3-146
rlEpsilonGreedyPolicy
policy = rlEpsilonGreedyPolicy(qValueFcn)
policy =
rlEpsilonGreedyPolicy with properties:
getAction(policy,{rand(obsInfo.Dimension)})
You can now train the policy with a custom training loop and then deploy it to your application.
Version History
Introduced in R2022a
See Also
Functions
rlMaxQPolicy | rlDeterministicActorPolicy | rlAdditiveNoisePolicy |
rlStochasticActorPolicy | rlQValueFunction | rlVectorQValueFunction | rlSARSAAgent
| rlQAgent | rlDQNAgent
Blocks
RL Agent
Topics
“Create Policies and Value Functions”
“Model-Based Reinforcement Learning Using Custom Training Loop”
“Train Reinforcement Learning Policy Using Custom Training Loop”
3-147
3 Objects
rlFiniteSetSpec
Create discrete action or observation data specifications for reinforcement learning environments
Description
An rlFiniteSetSpec object specifies discrete action or observation data specifications for
reinforcement learning environments.
Creation
Syntax
spec = rlFiniteSetSpec(elements)
Description
Properties
Elements — Set of valid actions or observations
vector | cell array
Set of valid actions or observations for the environment, specified as one of the following:
• Vector — Specify valid numeric values for a single action or single observation.
• Cell array — Specify valid numeric value combinations when you have more than one action or
observation. Each entry of the cell array must have the same dimensions.
Name of the rlFiniteSetSpec object, specified as a string. Use this property to set a meaningful
name for your finite set.
Description of the rlFiniteSetSpec object, specified as a string. Use this property to specify a
meaningful description of the finite set values.
3-148
rlFiniteSetSpec
If you specify Elements as a vector, then Dimension is [1 1]. Otherwise, if you specify a cell array,
then Dimension indicates the size of the entries in Elements.
Information about the type of data, specified as a string, such as "double" or "single".
Object Functions
rlSimulinkEnv Create reinforcement learning environment using dynamic model
implemented in Simulink
rlFunctionEnv Specify custom reinforcement learning environment dynamics
using functions
rlValueFunction Value function approximator object for reinforcement learning
agents
rlQValueFunction Q-Value function approximator object for reinforcement learning
agents
rlVectorQValueFunction Vector Q-value function approximator for reinforcement learning
agents
rlContinuousDeterministicActor Deterministic actor with a continuous action space for
reinforcement learning agents
rlDiscreteCategoricalActor Stochastic categorical actor with a discrete action space for
reinforcement learning agents
rlContinuousGaussianActor Stochastic Gaussian actor with a continuous action space for
reinforcement learning agents
Examples
For this example, consider the rlSimplePendulumModel Simulink model. The model is a simple
frictionless pendulum that initially hangs in a downward position.
mdl = 'rlSimplePendulumModel';
open_system(mdl)
Create rlNumericSpec and rlFiniteSetSpec objects for the observation and action information,
respectively.
The observation is a vector containing three signals: the sine, cosine, and time derivative of the
angle.
obsInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
3-149
3 Objects
The action is a scalar expressing the torque and can be one of three possible values, -2 Nm, 0 Nm and
2 Nm.
actInfo =
rlFiniteSetSpec with properties:
You can use dot notation to assign property values for the rlNumericSpec and rlFiniteSetSpec
objects.
obsInfo.Name = 'observations';
actInfo.Name = 'torque';
Assign the agent block path information, and create the reinforcement learning environment for the
Simulink model using the information extracted in the previous steps.
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on
You can also include a reset function using dot notation. For this example, randomly initialize theta0
in the model workspace.
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : @(in)setVariable(in,'theta0',randn,'Workspace',mdl)
UseFastRestart : on
3-150
rlFiniteSetSpec
If the actor for your reinforcement learning agent has multiple outputs, each with a discrete action
space, you can specify the possible discrete actions combinations using an rlFiniteSetSpec object.
Suppose that the valid values for a two-output system are [1 2] for the first output and [10 20 30]
for the second output. Create a discrete action space specification for all possible input combinations.
actionSpec =
rlFiniteSetSpec with properties:
Version History
Introduced in R2019a
See Also
rlNumericSpec | rlSimulinkEnv | getActionInfo | getObservationInfo |
rlValueRepresentation | rlQValueRepresentation |
rlDeterministicActorRepresentation | rlStochasticActorRepresentation |
rlFunctionEnv
3-151
3 Objects
rlFunctionEnv
Specify custom reinforcement learning environment dynamics using functions
Description
Use rlFunctionEnv to define a custom reinforcement learning environment. You provide MATLAB
functions that define the step and reset behavior for the environment. This object is useful when you
want to customize your environment beyond the predefined environments available with
rlPredefinedEnv.
Creation
Syntax
env = rlFunctionEnv(obsInfo,actInfo,stepfcn,resetfcn)
Description
Input Arguments
Properties
StepFcn — Step behavior for the environment
function name | function handle | anonymous function handle
Step behavior for the environment, specified as a function name, function handle, or anonymous
function.
StepFcn is a function that you provide which describes how the environment advances to the next
state from a given action. When using a function name or function handle, this function must have
two inputs and four outputs, as illustrated by the following signature.
3-152
rlFunctionEnv
[Observation,Reward,IsDone,LoggedSignals] = myStepFunction(Action,LoggedSignals)
To use additional input arguments beyond the required set, specify StepFcn using an anonymous
function handle.
The step function computes the values of the observation and reward for the given action in the
environment. The required input and output arguments are as follows.
• Action — Current action, which must match the dimensions and data type specified in actInfo.
• Observation — Returned observation, which must match the dimensions and data types
specified in obsInfo.
• Reward — Reward for the current step, returned as a scalar value.
• IsDone — Logical value indicating whether to end the simulation episode. The step function that
you define can include logic to decide whether to end the simulation based on the observation,
reward, or any other values.
• LoggedSignals — Any data that you want to pass from one step to the next, specified as a
structure.
For an example showing multiple ways to define a step function, see “Create MATLAB Environment
Using Custom Functions”.
Reset behavior for the environment, specified as a function, function handle, or anonymous function
handle.
The reset function that you provide must have no inputs and two outputs, as illustrated by the
following signature.
[InitialObservation,LoggedSignals] = myResetFunction
To use input arguments with your reset function, specify ResetFcn using an anonymous function
handle.
The reset function sets the environment to an initial state and computes the initial values of the
observation signals. For example, you can create a reset function that randomizes certain state
values, such that each training episode begins from different initial conditions.
The sim function calls the reset function to reset the environment at the start of each simulation, and
the train function calls it at the start of each training episode.
The InitialObservation output must match the dimensions and data type of obsInfo.
To pass information from the reset condition into the first step, specify that information in the reset
function as the output structure LoggedSignals.
For an example showing multiple ways to define a reset function, see “Create MATLAB Environment
Using Custom Functions”.
Information to pass to the next step, specified as a structure. When you create the environment,
whatever you define as the LoggedSignals output of ResetFcn initializes this property. When a
3-153
3 Objects
step occurs, the software populates this property with data to pass to the next step, as defined in
StepFcn.
Object Functions
getActionInfo Obtain action data specifications from reinforcement learning environment,
agent, or experience buffer
getObservationInfo Obtain observation data specifications from reinforcement learning
environment, agent, or experience buffer
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified environment
validateEnvironment Validate custom reinforcement learning environment
Examples
For this example, create an environment that represents a system for balancing a cart on a pole. The
observations from the environment are the cart position, cart velocity, pendulum angle, and
pendulum angle derivative. (For additional details about this environment, see “Create MATLAB
Environment Using Custom Functions”.) Create an observation specification for those signals.
The environment has a discrete action space where the agent can apply one of two possible force
values to the cart, –10 N or 10 N. Create the action specification for those actions.
Next, specify the custom step and reset functions. For this example, use the supplied functions
myResetFunction.m and myStepFunction.m. For details about these functions and how they are
constructed, see “Create MATLAB Environment Using Custom Functions”.
Construct the custom environment using the defined observation specification, action specification,
and function names.
env = rlFunctionEnv(oinfo,ActionInfo,'myStepFunction','myResetFunction');
You can create agents for env and train them within the environment as you would for any other
reinforcement learning environment.
As an alternative to using function names, you can specify the functions as function handles. For
more details and an example, see “Create MATLAB Environment Using Custom Functions”.
3-154
rlFunctionEnv
Version History
Introduced in R2019a
See Also
rlPredefinedEnv | rlCreateEnvTemplate
Topics
“Create MATLAB Environment Using Custom Functions”
3-155
3 Objects
rlIsDoneFunction
Is-done function approximator object for neural network-based environment
Description
When creating a neural network-based environment using rlNeuralNetworkEnvironment, you can
specify the is-done function approximator using an rlIsDoneFunction object. Do so when you do
not know a ground-truth termination signal for your environment.
The is-done function approximator object uses a deep neural network as internal approximation
model to predict the termination signal for the environment given one of the following input
combinations.
Creation
Syntax
isdFcnAppx = rlIsDoneFunction(net,observationInfo,actionInfo,Name=Value)
Description
isdFcnAppx = rlIsDoneFunction(net,observationInfo,actionInfo,Name=Value)
creates the is-done function approximator object isdFcnAppx using the deep neural network net
and sets the ObservationInfo and ActionInfo properties.
When creating an is-done function approximator you must specify the names of the deep neural
network inputs using one of the following combinations of name-value pair arguments.
You can also specify the UseDeterministicPredict and UseDevice properties using optional
name-value pair arguments. For example, to use a GPU for prediction, specify UseDevice="gpu".
Input Arguments
Deep neural network with a scalar output value, specified as a dlnetwork object.
3-156
rlIsDoneFunction
The input layer names for this network must match the input names specified using the
ObservationInputNames, ActionInputNames, and NextObservationInputNames. The
dimensions of the input layers must match the dimensions of the corresponding observation and
action specifications in ObservationInfo and ActionInfo, respectively.
The number of observation input names must match the length of ObservationInfo and the order
of the names must match the order of the specifications in ObservationInfo.
Action input layer names, specified as a string or string array. Specify ActionInputNames when you
expect the termination signal to depend on the current action value.
The number of action input names must match the length of ActionInfo and the order of the names
must match the order of the specifications in ActionInfo.
Next observation input layer names, specified as a string or string array. Specify
NextObservationInputNames when you expect the termination signal to depend on the next
environment observation.
The number of next observation input names must match the length of ObservationInfo and the
order of the names must match the order of the specifications in ObservationInfo.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
3-157
3 Objects
You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.
You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.
Option to predict the terminal signal deterministically, specified as one of the following values.
Computation device used to perform operations such as gradient computation, parameter updates,
and prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA-enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating a network on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations using a CPU.
Object Functions
rlNeuralNetworkEnvironment Environment model with deep neural network transition models
Examples
Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
3-158
rlIsDoneFunction
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
To approximate the is-done function, use a deep neural network. The network has one input channel
for the next observations. The single output channel is for the predicted termination signal.
net = layerGraph(commonPath);
plot(net)
Covert the network to a dlnetwork object and display the number of weights.
net = dlnetwork(net);
summary(net);
3-159
3 Objects
Initialized: true
Inputs:
1 'nextState' 4 features
isDoneFcnAppx = rlIsDoneFunction(...
net,obsInfo,actInfo,...
NextObservationInputNames="nextState");
Using this is-done function approximator object, you can predict the termination signal based on the
next observation. For example, predict the termination signal for a random next observation. Since
for this example the termination signal only depends on the next observation, use empty cell arrays
for the current action and observation inputs.
nxtobs = rand(obsInfo.Dimension);
predIsDone = predict(isDoneFcnAppx,{},{},{nxtobs})
predIsDone = 0
predIsDoneProb = evaluate(isDoneFcnAppx,{nxtobs})
predIsDoneProb{1}
0.5405
0.4595
The first number is the probability of obtaining a 0 (no termination predicted), the second one is the
probability of obtaining a 1 (termination predicted).
Version History
Introduced in R2022a
See Also
Objects
rlNeuralNetworkEnvironment | rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction | evaluate | accelerate | gradient
3-160
rlIsDoneFunction
Topics
“Model-Based Policy Optimization Agents”
3-161
3 Objects
rlMaxQPolicy
Policy object to generate discrete max-Q actions for custom training loops and application
deployment
Description
This object implements a max-Q policy, which returns the action that maximizes a discrete action-
space Q-value function, given an input observation. You can create an rlMaxQPolicy object from an
rlQValueFunction or rlVectorQValueFunction object, or extract it from an rlQAgent,
rlDQNAgent or rlSARSAAgent. You can then train the policy object using a custom training loop or
deploy it for your application using generatePolicyBlock or generatePolicyFunction. This
policy is always deterministic and does not perform any exploration. For more information on policies
and value functions, see “Create Policies and Value Functions”.
Creation
Syntax
policy = rlMaxQPolicy(qValueFunction)
Description
policy = rlMaxQPolicy(qValueFunction) creates the max-Q policy object policy from the
discrete action-space Q-value function qValueFunction. It also sets the QValueFunction property
of policy to the input argument qValueFunction.
Properties
QValueFunction — Discrete action-space Q-value function
rlQValueFunction object
Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of
the environment action channel, such as its dimensions, data type, and name. Note that the name of
the action channel specified in actionInfo (if any) is not used.
3-162
rlMaxQPolicy
Sample time of policy, specified as a positive scalar or as -1 (default). Setting this parameter to -1
allows for event-based simulations.
Within a Simulink environment, the RL Agent block in which the policy is specified executes every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the policy is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience. If
SampleTime is -1, the sample time is treated as being equal to 1.
Example: 0.2
Object Functions
generatePolicyBlock Generate Simulink block that evaluates policy of an agent or policy object
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
getAction Obtain action from agent, actor, or policy object given environment
observations
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
reset Reset environment, agent, experience buffer, or policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
Examples
Create observation and action specification objects. For this example, define the observation space as
a continuous four-dimensional space, so that a single observation is a column vector containing four
doubles, and the action space as a finite set consisting of two possible values, -1 and 1.
Alternatively, you can use getObservationInfo and getActionInfo to extract the specification
objects from an environment.
Create a vector Q-value function approximator to use as critic. A vector Q-value function must accept
an observation as input and return a single vector with as many elements as the number of possible
discrete actions.
To approximate the vector Q-value function within the critic, use a neural network. Define a single
path from the network input to its output as an array of layer objects.
layers = [
featureInputLayer(prod(obsInfo.Dimension))
3-163
3 Objects
fullyConnectedLayer(10)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))
];
Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)
Initialized: true
Number of learnables: 72
Inputs:
1 'input' 4 features
Create a vector Q-value function using model, and the observation and action specifications.
qValueFcn = rlVectorQValueFunction(model,obsInfo,actInfo)
qValueFcn =
rlVectorQValueFunction with properties:
getValue(qValueFcn,{rand(obsInfo.Dimension)})
0.6486
-0.3103
policy = rlMaxQPolicy(qValueFcn)
policy =
rlMaxQPolicy with properties:
getAction(policy,{rand(obsInfo.Dimension)})
3-164
rlMaxQPolicy
You can now train the policy with a custom training loop and then deploy it to your application.
Version History
Introduced in R2022a
See Also
Functions
rlEpsilonGreedyPolicy | rlDeterministicActorPolicy | rlAdditiveNoisePolicy |
rlStochasticActorPolicy | rlQValueFunction | rlVectorQValueFunction | rlSARSAAgent
| rlQAgent | rlDQNAgent | generatePolicyBlock | generatePolicyFunction
Blocks
RL Agent
Topics
“Create Policies and Value Functions”
“Model-Based Reinforcement Learning Using Custom Training Loop”
“Train Reinforcement Learning Policy Using Custom Training Loop”
3-165
3 Objects
rlMBPOAgent
Model-based policy optimization reinforcement learning agent
Description
A model-based policy optimization (MBPO) agent is a model-based, online, off-policy, reinforcement
learning method. An MBPO agent contains an internal model of the environment, which it uses to
generate additional experiences without interacting with the environment.
During training, the MBPO agent generates real experiences by interacting with the environment.
These experiences are used to train the internal environment model, which is used to generate
additional experiences. The training algorithm then uses both the real and generated experiences to
update the agent policy.
Creation
Syntax
agent = rlMBPOAgent(baseAgent,envModel)
agent = rlMBPOAgent( ___ ,agentOptions)
Description
Properties
BaseAgent — Base reinforcement learning agent
rlDQNAgent | rlDDPGAgent | rlTD3Agent | rlSACAgent
For environments with a discrete action space, specify a DQN agent using an rlDQNAgent object.
For environments with a continuous action space, use one of the following agent objects.
3-166
rlMBPOAgent
Current roll-out horizon value, specified as a positive integer. For more information on setting the
initial horizon value and the horizon update method, see rlMBPOAgentOptions.
Model experience buffer, specified as an rlReplayMemory object. During training the agent stores
each of its generated experiences (S,A,R,S',D) in a buffer. Here:
Option to use exploration policy when selecting actions, specified as one of the following logical
values.
• true — Use the base agent exploration policy when selecting actions.
• false — Use the base agent greedy policy when selecting actions.
The initial value of UseExplorationPolicy matches the value specified in BaseAgent. If you
change the value of UseExplorationPolicy in either the base agent or the MBPO agent, the same
value is used for the other agent.
3-167
3 Objects
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.
The initial value of SampleTime matches the value specified in BaseAgent. If you change the value
of SampleTime in either the base agent or the MBPO agent, the same value is used for the other
agent.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified environment
Examples
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a base off-policy agent. For this example, use a SAC agent.
agentOpts = rlSACAgentOptions;
agentOpts.MiniBatchSize = 256;
initOpts = rlAgentInitializationOptions(NumHiddenUnit=64);
baseagent = rlSACAgent(obsInfo,actInfo,initOpts,agentOpts);
getAction(baseagent,{rand(obsInfo.Dimension)})
3-168
rlMBPOAgent
The neural network environment uses a function approximator object to approximate the
environment transition function. The function approximator object uses one or more neural networks
as approximator model. To account for modeling uncertainty, you can specify multiple transition
models. For this example, create a single transition model.
Create a neural network to use as approximation model within the transition function object. Define
each network path as an array of layer objects. Specify a name for the input and output layers, so you
can later explicitly associate them with the appropriate channel.
% Observation and action paths
obsPath = featureInputLayer(obsInfo.Dimension(1),Name="obsIn");
actionPath = featureInputLayer(actInfo.Dimension(1),Name="actIn");
% Connect layers
transNet = connectLayers(transNet,"obsIn","concat/in1");
transNet = connectLayers(transNet,"actIn","concat/in2");
Initialized: true
Inputs:
1 'obsIn' 4 features
2 'actIn' 1 features
Create a neural network to use as a reward model for the reward function approximator object.
% Observation and action paths
actionPath = featureInputLayer(actInfo.Dimension(1),Name="actIn");
nextObsPath = featureInputLayer(obsInfo.Dimension(1),Name="nextObsIn");
3-169
3 Objects
% Connect layers
rewardNet = connectLayers(rewardNet,"nextObsIn","concat/in1");
rewardNet = connectLayers(rewardNet,"actIn","concat/in2");
Initialized: true
Inputs:
1 'obsIn' 4 features
2 'actIn' 1 features
3-170
rlMBPOAgent
Initialized: true
Inputs:
1 'obsIn' 4 features
2 'actIn' 1 features
Create the neural network environment using the observation and action specifications and the three
function approximator objects.
Specify options for creating an MBPO agent. Specify the optimizer options for the transition network
and use default values for all other options.
MBPOAgentOpts = rlMBPOAgentOptions;
MBPOAgentOpts.TransitionOptimizerOptions = rlOptimizerOptions(...
LearnRate=1e-4,...
GradientThreshold=1.0);
agent = rlMBPOAgent(baseagent,generativeEnv,MBPOAgentOpts);
getAction(agent,{rand(obsInfo.Dimension)})
Version History
Introduced in R2022a
See Also
Objects
rlMBPOAgentOptions | rlNeuralNetworkEnvironment
Topics
“Model-Based Policy Optimization Agents”
“Train MBPO Agent to Balance Cart-Pole System”
3-171
3 Objects
rlMBPOAgentOptions
Options for MBPO agent
Description
Use an rlMBPOAgentOptions object to specify options for model-based policy optimization (MBPO)
agents. To create an MBPO agent, use rlMBPOAgent.
Creation
Syntax
opt = rlMBPOAgentOptions
opt = rlMBPOAgentOptions(Name=Value)
Description
opt = rlMBPOAgentOptions creates an option object for use as an argument when creating an
MBPO agent using all default options. You can modify the object properties using dot notation.
Properties
NumEpochForTrainingModel — Number of epochs
5 (default) | positive integer
Number of epochs for training the environment model, specified as a positive integer.
Number of mini-batches used in each environment model training epoch, specified as a positive
scalar or "all". When you specify NumMiniBatches to "all", the agent selects the number of mini-
batches such that all samples in the base agents experience buffer are used to train the model.
Size of random experience mini-batch for training environment model, specified as a positive integer.
During each model training episode, the agent randomly samples experiences from the experience
buffer when computing gradients for updating the environment model properties. Large mini-batches
reduce the variance when computing gradients but increase the computational effort.
3-172
rlMBPOAgentOptions
Generated experience buffer size, specified as a positive integer. When the agent generates
experiences, they are added to the model experience buffer.
Ratio of real experiences in a mini-batch for agent training, specified as a nonnegative scalar less
than or equal to 1.
• rlOptimizerOptions object — When your neural network environment has a single transition
function or if you want to use the same options for multiple transition functions, specify a single
options object.
• Array of rlOptimizerOptions objects — When your neural network environment agent has
multiple transition functions and you want to use different optimizer options for the transition
functions, specify an array of options objects with length equal to the number of transition
functions.
Using these objects, you can specify training parameters for the transition deep neural network
approximators as well as the optimizer algorithms and parameters.
If you have previously trained transition models and do not want the MBPO agent to modify these
models during training, set TransitionOptimizerOptions.LearnRate to 0.
Reward function optimizer options, specified as an rlOptimizerOptions object. Using this object,
you can specify training parameters for the reward deep neural network approximator as well as the
optimizer algorithm and its parameters.
If you specify a ground-truth reward function using a custom function, the MBPO agent ignores these
options.
If you have a previously trained reward model and do not want the MBPO agent to modify the model
during training, set RewardOptimizerOptions.LearnRate to 0.
Is-done function optimizer options, specified as an rlOptimizerOptions object. Using this object,
you can specify training parameters for the is-done deep neural network approximator as well as the
optimizer algorithm and its parameters.
If you specify a ground-truth is-done function using a custom function, the MBPO agent ignores these
options.
3-173
3 Objects
If you have a previously trained is-done model and do not want the MBPO agent to modify the model
during training, set IsDoneOptimizerOptions.LearnRate to 0.
Model roll-out options for controlling the number and length of generated experience trajectories,
specified as an rlModelRolloutOptions object with the following fields. At the start of each epoch,
the agent generates the roll-out trajectories and adds them to the model experience buffer. To modify
the roll-out options, use dot notation.
Option for increasing the horizon length, specified as one of the following values.
Number of epochs after which the horizon increases, specified as a positive integer. When
RolloutHorizonSchedule is "none" this option is ignored.
Maximum horizon length, specified as a positive integer greater than or equal to RolloutHorizon.
When RolloutHorizonSchedule is "none" this option is ignored.
Exploration model options for generating experiences using the internal environment model,
specified as one of the following:
• [] — Use the exploration policy of the base agent. You must use this option when training a SAC
base agent.
3-174
rlMBPOAgentOptions
• EpsilonGreedyExploration object — You can use this option when training a DQN base agent.
• GaussianActionNoise object — You can use this option when training a DDPG or TD3 base
agent.
The exploration model uses only the initial noise option values and does not update the values during
training.
To specify NoiseOptions, create a default model object. Then, specify any nondefault model
properties using dot notation.
For more information on noise models, see “Noise Models” on page 3-176.
Object Functions
rlMBPOAgent Model-based policy optimization reinforcement learning agent
Examples
Create an MBPO agent options object, specifying the ratio of real experiences to use for training the
agent as 30%.
opt = rlMBPOAgentOptions(RealSampleRatio=0.3)
opt =
rlMBPOAgentOptions with properties:
NumEpochForTrainingModel: 1
NumMiniBatches: 10
MiniBatchSize: 128
TransitionOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
RewardOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
IsDoneOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
ModelExperienceBufferLength: 100000
ModelRolloutOptions: [1x1 rl.option.rlModelRolloutOptions]
RealSampleRatio: 0.3000
InfoToSave: [1x1 struct]
You can modify options using dot notation. For example, set the mini-batch size to 64.
opt.MiniBatchSize = 64;
3-175
3 Objects
Algorithms
Noise Models
Gaussian Action Noise
A GaussianActionNoise object has the following numeric value properties. When generating
experiences, MBPO agents do not update their exploration model parameters.
At each time step k, the Gaussian noise v is sampled as shown in the following code.
w = Mean + rand(ActionSize).*StandardDeviation(k);
v(k+1) = min(max(w,LowerLimit),UpperLimit);
3-176
rlMBPOAgentOptions
Version History
Introduced in R2022a
See Also
Objects
rlMBPOAgent | rlNeuralNetworkEnvironment
Topics
“Model-Based Policy Optimization Agents”
3-177
3 Objects
rlMDPEnv
Create Markov decision process environment for reinforcement learning
Description
A Markov decision process (MDP) is a discrete time stochastic control process. It provides a
mathematical framework for modeling decision making in situations where outcomes are partly
random and partly under the control of the decision maker. MDPs are useful for studying optimization
problems solved using reinforcement learning. Use rlMDPEnv to create a Markov decision process
environment for reinforcement learning in MATLAB.
Creation
Syntax
env = rlMDPEnv(MDP)
Description
env = rlMDPEnv(MDP) creates a reinforcement learning environment env with the specified MDP
model.
Input Arguments
Properties
Model — Markov decision process model
GridWorld object | GenericMDP object
3-178
rlMDPEnv
Object Functions
getActionInfo Obtain action data specifications from reinforcement learning environment,
agent, or experience buffer
getObservationInfo Obtain observation data specifications from reinforcement learning
environment, agent, or experience buffer
sim Simulate trained reinforcement learning agents within specified environment
train Train reinforcement learning agents within a specified environment
validateEnvironment Validate custom reinforcement learning environment
Examples
For this example, consider a 5-by-5 grid world with the following rules:
1 A 5-by-5 grid world bounded by borders, with 4 possible actions (North = 1, South = 2, East = 3,
West = 4).
2 The agent begins from cell [2,1] (second row, first column).
3 The agent receives reward +10 if it reaches the terminal state at cell [5,5] (blue).
4 The environment contains a special jump from cell [2,4] to cell [4,4] with +5 reward.
5 The agent is blocked by obstacles in cells [3,3], [3,4], [3,5] and [4,3] (black cells).
6 All other actions result in -1 reward.
GW =
GridWorld with properties:
3-179
3 Objects
GridSize: [5 5]
CurrentState: "[1,1]"
States: [25x1 string]
Actions: [4x1 string]
T: [25x25x4 double]
R: [25x25x4 double]
ObstacleStates: [0x1 string]
TerminalStates: [0x1 string]
ProbabilityTolerance: 8.8818e-16
GW.CurrentState = '[2,1]';
GW.TerminalStates = '[5,5]';
GW.ObstacleStates = ["[3,3]";"[3,4]";"[3,5]";"[4,3]"];
Update the state transition matrix for the obstacle states and set the jump rule over the obstacle
states.
updateStateTranstionForObstacles(GW)
GW.T(state2idx(GW,"[2,4]"),:,:) = 0;
GW.T(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 1;
nS = numel(GW.States);
nA = numel(GW.Actions);
GW.R = -1*ones(nS,nS,nA);
GW.R(state2idx(GW,"[2,4]"),state2idx(GW,"[4,4]"),:) = 5;
GW.R(:,state2idx(GW,GW.TerminalStates),:) = 10;
Now, use rlMDPEnv to create a grid world environment using the GridWorld object GW.
env = rlMDPEnv(GW)
env =
rlMDPEnv with properties:
You can visualize the grid world environment using the plot function.
plot(env)
3-180
rlMDPEnv
Version History
Introduced in R2019a
See Also
createGridWorld | rlPredefinedEnv
Topics
“Train Reinforcement Learning Agent in Basic Grid World”
“Create Custom Grid World Environments”
“Train Reinforcement Learning Agent in MDP Environment”
3-181
3 Objects
rlMultiAgentTrainingOptions
Options for training multiple reinforcement learning agents
Description
Use an rlMultiAgentTrainingOptions object to specify training options for multiple agents. To
train the agents, use train.
For more information on training agents, see “Train Reinforcement Learning Agents”.
Creation
Syntax
trainOpts = rlMultiAgentTrainingOptions
trainOpts = rlMultiAgentTrainingOptions(Name,Value)
Description
Properties
AgentGroups — Agent grouping indices
"auto" (default) | cell array of positive integers | cell array of integer arrays
Agent grouping indices, specified as a cell array of positive integers or a cell array of integer arrays.
For instance, consider a training scenario with 4 agents. You can group the agents in the following
ways:
trainOpts = rlMultiAgentTrainingOptions("AgentGroups","auto")
• Specify four agent groups with one agent in each group:
trainOpts = rlMultiAgentTrainingOptions("AgentGroups",{1,2,3,4})
• Specify two agent groups with two agents each:
trainOpts = rlMultiAgentTrainingOptions("AgentGroups",{[1,2],[3,4]})
• Specify three agent groups:
3-182
rlMultiAgentTrainingOptions
trainOpts = rlMultiAgentTrainingOptions("AgentGroups",{[1,4],2,3})
AgentGroups and LearningStrategy must be used together to specify whether agent groups
learn in a centralized manner or decentralized manner.
Example: AgentGroups={1,2,[3,4]}
Learning strategy for each agent group, specified as either "decentralized" or "centralized".
In decentralized training, agents collect their own set of experiences during the episodes and learn
independently from those experiences. In centralized training, the agents share the collected
experiences and learn from them together.
AgentGroups and LearningStrategy must be used together to specify whether agent groups
learn in a centralized manner or decentralized manner. For example, you can use the following
command to configure training for three agent groups with different learning strategies. The agents
with indices [1,2] and [3,5] learn in a centralized manner, while agent 4 learns in a decentralized
manner.
trainOpts = rlMultiAgentTrainingOptions(...
AgentGroups={[1,2],4,[3,5]},...
LearningStrategy=["centralized","decentralized","centralized"] )
Example: LearningStrategy="centralized"
Maximum number of episodes to train the agents, specified as a positive integer. Regardless of other
criteria for termination, training terminates after MaxEpisodes.
Example: MaxEpisodes=1000
Maximum number of steps to run per episode, specified as a positive integer. In general, you define
episode termination conditions in the environment. This value is the maximum number of steps to run
in the episode if other termination conditions are not met.
Example: MaxStepsPerEpisode=1000
Window length for averaging the scores, rewards, and number of steps for each agent, specified as a
scalar or vector.
Specify a scalar to apply the same window length to all agents. To use a different window length for
each agent, specify ScoreAveragingWindowLength as a vector. In this case, the order of the
elements in the vector correspond to the order of the agents used during environment creation.
3-183
3 Objects
Example: ScoreAveragingWindowLength=10
• "AverageSteps" — Stop training when the running average number of steps per episode equals
or exceeds the critical value specified by the option StopTrainingValue. The average is
computed using the window 'ScoreAveragingWindowLength'.
• "AverageReward" — Stop training when the running average reward equals or exceeds the
critical value.
• "EpisodeReward" — Stop training when the reward in the current episode equals or exceeds the
critical value.
• "GlobalStepCount" — Stop training when the total number of steps in all episodes (the total
number of times the agent is invoked) equals or exceeds the critical value.
• "EpisodeCount" — Stop training when the number of training episodes equals or exceeds the
critical value.
Example: StopTrainingCriteria="AverageReward"
Specify a scalar to apply the same termination criterion to all agents. To use a different termination
criterion for each agent, specify StopTrainingValue as a vector. In this case, the order of the
elements in the vector corresponds to the order of the agents used during environment creation.
For a given agent, training ends when the termination condition specified by the
StopTrainingCriteria option equals or exceeds this value. For the other agents, the training
continues until:
3-184
rlMultiAgentTrainingOptions
Condition for saving agents during training, specified as one of the following strings:
Set this option to store candidate agents that perform well according to the criteria you specify. When
you set this option to a value other than "none", the software sets the SaveAgentValue option to
500. You can change that value to specify the condition for saving the agent.
For instance, suppose you want to store for further testing any agent that yields an episode reward
that equals or exceeds 100. To do so, set SaveAgentCriteria to "EpisodeReward" and set the
SaveAgentValue option to 100. When an episode reward equals or exceeds 100, train saves the
corresponding agent in a MAT file in the folder specified by the SaveAgentDirectory option. The
MAT file is called AgentK.mat, where K is the number of the corresponding episode. The agent is
stored within that MAT file as saved_agent.
Example: SaveAgentCriteria="EpisodeReward"
Critical value of the condition for saving agents, specified as a scalar or a vector.
Specify a scalar to apply the same saving criterion to each agent. To save the agents when one meets
a particular criterion, specify SaveAgentValue as a vector. In this case, the order of the elements in
the vector corresponds to the order of the agents used when creating the environment. When a
criteria for saving an agent is met, all agents are saved in the same MAT file.
When you specify a condition for saving candidate agents using SaveAgentCriteria, the software
sets this value to 500. Change the value to specify the condition for saving the agent. See the
SaveAgentCriteria option for more details.
Example: SaveAgentValue=100
Folder for saved agents, specified as a string or character vector. The folder name can contain a full
or relative path. When an episode occurs that satisfies the condition specified by the
3-185
3 Objects
SaveAgentCriteria and SaveAgentValue options, the software saves the agents in a MAT file in
this folder. If the folder does not exist, train creates it. When SaveAgentCriteria is "none", this
option is ignored and train does not create a folder.
Example: SaveAgentDirectory = pwd + "\run1\Agents"
Option to stop training when an error occurs during an episode, specified as "on" or "off". When
this option is "off", errors are captured and returned in the SimulationInfo output of train, and
training continues to the next episode.
Example: StopOnError = "off"
Option to display training progress on the command line, specified as the logical value false (0) or
true (1). Set to true to write information from each training episode to the MATLAB command line
during training.
Example: Verbose = true
Object Functions
train Train reinforcement learning agents within a specified environment
Examples
Create an options set for training 5 reinforcement learning agents. Set the maximum number of
episodes and the maximum number of steps per episode to 1000. Configure the options to stop
training when the average reward equals or exceeds 480, and turn on both the command-line display
and Reinforcement Learning Episode Manager for displaying training results. You can set the options
using name-value pair arguments when you create the options set. Any options that you do not
explicitly set have their default values.
trainOpts = rlMultiAgentTrainingOptions(...
AgentGroups={[1,2],3,[4,5]},...
LearningStrategy=["centralized","decentralized","centralized"],...
MaxEpisodes=1000,...
MaxStepsPerEpisode=1000,...
3-186
rlMultiAgentTrainingOptions
StopTrainingCriteria="AverageReward",...
StopTrainingValue=480,...
Verbose=true,...
Plots="training-progress")
trainOpts =
rlMultiAgentTrainingOptions with properties:
Alternatively, create a default options set and use dot notation to change some of the values.
trainOpts = rlMultiAgentTrainingOptions;
trainOpts.AgentGroups = {[1,2],3,[4,5]};
trainOpts.LearningStrategy = ["centralized","decentralized","centralized"];
trainOpts.MaxEpisodes = 1000;
trainOpts.MaxStepsPerEpisode = 1000;
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 480;
trainOpts.Verbose = true;
trainOpts.Plots = "training-progress";
trainOpts
trainOpts =
rlMultiAgentTrainingOptions with properties:
You can now use trainOpts as an input argument to the train command.
3-187
3 Objects
Create an options object for concurrently training three agents in the same environment.
Set the maximum number of episodes and the maximum steps per episode to 1000. Configure the
options to stop training the first agent when its average reward over 5 episodes equals or exceeds
400, the second agent when its average reward over 10 episodes equals or exceeds 500, and the
third when its average reward over 15 episodes equals or exceeds 600. The order of agents is the one
used during environment creation.
Save the agents when the reward for the first agent in the current episode exceeds 100, or when the
reward for the second agent exceeds 120, the reward for the third agent equals or exceeds 140.
Turn on both the command-line display and Reinforcement Learning Episode Manager for displaying
training results. You can set the options using name-value pair arguments when you create the
options set. Any options that you do not explicitly set have their default values.
trainOpts = rlMultiAgentTrainingOptions(...
MaxEpisodes=1000,...
MaxStepsPerEpisode=1000,...
ScoreAveragingWindowLength=[5 10 15],...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=[400 500 600],...
SaveAgentCriteria="EpisodeReward",...
SaveAgentValue=[100 120 140],...
Verbose=true,...
Plots="training-progress")
trainOpts =
rlMultiAgentTrainingOptions with properties:
AgentGroups: "auto"
LearningStrategy: "decentralized"
MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: [5 10 15]
StopTrainingCriteria: "AverageReward"
StopTrainingValue: [400 500 600]
SaveAgentCriteria: "EpisodeReward"
SaveAgentValue: [100 120 140]
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"
Alternatively, create a default options set and use dot notation to change some of the values.
trainOpts = rlMultiAgentTrainingOptions;
trainOpts.MaxEpisodes = 1000;
trainOpts.MaxStepsPerEpisode = 1000;
trainOpts.ScoreAveragingWindowLength = [5 10 15];
trainOpts.StopTrainingCriteria = "AverageReward";
3-188
rlMultiAgentTrainingOptions
trainOpts.SaveAgentCriteria = "EpisodeReward";
trainOpts.SaveAgentValue = [100 120 140];
trainOpts.Verbose = true;
trainOpts.Plots = "training-progress";
trainOpts
trainOpts =
rlMultiAgentTrainingOptions with properties:
AgentGroups: "auto"
LearningStrategy: "decentralized"
MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: [5 10 15]
StopTrainingCriteria: "AverageReward"
StopTrainingValue: [400 500 600]
SaveAgentCriteria: "EpisodeReward"
SaveAgentValue: [100 120 140]
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"
You can specify a scalar to apply the same criterion to all agents. For example, use a window length of
10 for all three agents.
trainOpts.ScoreAveragingWindowLength = 10
trainOpts =
rlMultiAgentTrainingOptions with properties:
AgentGroups: "auto"
LearningStrategy: "decentralized"
MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: 10
StopTrainingCriteria: "AverageReward"
StopTrainingValue: [400 500 600]
SaveAgentCriteria: "EpisodeReward"
SaveAgentValue: [100 120 140]
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"
You can now use trainOpts as an input argument to the train command.
Version History
Introduced in R2022a
3-189
3 Objects
See Also
train | rlTrainingOptions
Topics
“Reinforcement Learning Agents”
3-190
rlNeuralNetworkEnvironment
rlNeuralNetworkEnvironment
Environment model with deep neural network transition models
Description
Use an rlNeuralNetworkEnvironment object to create a reinforcement learning environment that
computes state transitions using deep neural networks.
• Create an internal environment model for a model-based policy optimization (MBPO) agent.
• Create an environment for training other types of reinforcement learning agents. You can identify
the state-transition network using experimental or simulated data.
Such environments can compute environment rewards and termination conditions using deep neural
networks or custom functions.
Creation
Syntax
env = rlNeuralNetworkEnvironment(obsInfo,actInfo,transitionFcn,rewardFcn,
isDoneFcn)
Description
env = rlNeuralNetworkEnvironment(obsInfo,actInfo,transitionFcn,rewardFcn,
isDoneFcn) creates a model for an environment with the observation and action specifications
specified in obsInfo and actInfo, respectively. This syntax sets the TransitionFcn, RewardFcn,
and IsDoneFcn properties.
Input Arguments
You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.
3-191
3 Objects
You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.
Properties
TransitionFcn — Environment transition function
rlContinuousDeterministicTransitionFunction object |
rlContinuousGaussianTransitionFunction object | array of transition objects
• rlIsDoneFunction object — Use this option when you do not know a ground-truth termination
signal for your environment.
• Function handle — Use this option when you know a ground-truth termination signal for your
environment. When you use an rlNeuralNetworkEnvironment object to create an
rlMBPOAgent object, the custom is-done function must return a batch of termination signals
given a batch of inputs.
3-192
rlNeuralNetworkEnvironment
Observation values, specified as a cell array with length equal to the number of specification objects
in obsInfo. The order of the observations in Observation must match the order in obsInfo. Also,
the dimensions of each element of the cell array must match the dimensions of the corresponding
observation specification in obsInfo.
To evaluate whether the transition models are well-trained, you can manually evaluate the
environment for a given observation value using the step function. Specify the observation values
before calling step.
When you use this neural network environment object within an MBPO agent, this property is
ignored.
To evaluate whether the transition models are well-trained, you can manually evaluate the
environment for a given observation value using the step function. To select which transition model
in TransitionFcn to evaluate, specify the transition model index before calling step.
When you use this neural network environment object within an MBPO agent, this property is
ignored.
Object Functions
rlMBPOAgent Model-based policy optimization reinforcement learning agent
Examples
Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a deterministic transition function based on a deep neural network with two input channels
(current observations and actions) and one output channel (predicted next observation).
% Create network layers.
statePath = featureInputLayer(numObservations, ...
Normalization="none",Name="state");
actionPath = featureInputLayer(numActions, ...
Normalization="none",Name="action");
commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
3-193
3 Objects
fullyConnectedLayer(64, Name="FC3")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(numObservations,Name="nextObservation")];
Create a deterministic reward function with two input channels (current action and next
observations) and one output channel (predicted reward value).
3-194
rlNeuralNetworkEnvironment
Create an is-done function with one input channel (next observations) and one output channel
(predicted termination signal).
% Create network layers.
commonPath = [featureInputLayer(numObservations, ...
Normalization="none",Name="nextState");
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64,Name="FC3")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(2,Name="isdone0")
softmaxLayer(Name="isdone")];
isDoneNetwork = layerGraph(commonPath);
Create a neural network environment using the transition, reward, and is-done functions.
env = rlNeuralNetworkEnvironment( ...
obsInfo,actInfo, ...
transitionFcn,rewardFcn,isDoneFcn);
Create an environment interface and extract observation and action specifications. Alternatively, you
can create specifications using rlNumericSpec and rlFiniteSetSpec.
env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
numObservations = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
numActions = actInfo.Dimension(1);
Create a deterministic transition function based on a deep neural network with two input channels
(current observations and actions) and one output channel (predicted next observation).
% Create network layers.
statePath = featureInputLayer(numObservations,...
3-195
3 Objects
Normalization="none",Name="state");
actionPath = featureInputLayer(numActions,...
Normalization="none",Name="action");
commonPath = [concatenationLayer(1,2,Name="concat")
fullyConnectedLayer(64,Name="FC1")
reluLayer(Name="CriticRelu1")
fullyConnectedLayer(64, Name="FC3")
reluLayer(Name="CriticCommonRelu2")
fullyConnectedLayer(numObservations,Name="nextObservation")];
You can define a known reward function for your environment using a custom function. Your custom
reward function must take the observations, actions, and next observations as cell-array inputs and
return a scalar reward value. For this example, use the following custom reward function, which
computes the reward based on the next observation.
type cartPoleRewardFunction.m
if iscell(nextObs)
nextObs = nextObs{1};
end
x = nextObs(1,:);
distReward = 1 - abs(x)/xThreshold;
isDone = cartPoleIsDoneFunction(obs,action,nextObs);
reward = zeros(size(isDone));
reward(logical(isDone)) = penaltyForFalling;
reward(~logical(isDone)) = ...
3-196
rlNeuralNetworkEnvironment
You can define a known is-done function for your environment using a custom function. Your custom
is-done function must take the observations, actions, and next observations as cell-array inputs and
return a logical termination signal. For this example, use the following custom is-done function, which
computes the termination signal based on the next observation.
type cartPoleIsDoneFunction.m
if iscell(nextObs)
nextObs = nextObs{1};
end
x = nextObs(1,:);
theta = nextObs(3,:);
Create a neural network environment using the transition function object and the custom reward and
is-done functions.
env = rlNeuralNetworkEnvironment(obsInfo,actInfo,transitionFcn,...
@cartPoleRewardFunction,@cartPoleIsDoneFunction);
Version History
Introduced in R2022a
See Also
Objects
rlMBPOAgent | rlMBPOAgentOptions | rlContinuousDeterministicTransitionFunction |
rlContinuousGaussianTransitionFunction |
rlContinuousDeterministicRewardFunction | rlContinuousGaussianRewardFunction |
rlIsDoneFunction
Topics
“Model-Based Policy Optimization Agents”
3-197
3 Objects
rlNumericSpec
Create continuous action or observation data specifications for reinforcement learning environments
Description
An rlNumericSpec object specifies continuous action or observation data specifications for
reinforcement learning environments.
Creation
Syntax
spec = rlNumericSpec(dimension)
spec = rlNumericSpec(dimension,Name,Value)
Description
Properties
LowerLimit — Lower limit of the data space
-Inf (default) | scalar | matrix
Lower limit of the data space, specified as a scalar or matrix of the same size as the data space. When
LowerLimit is specified as a scalar, rlNumericSpec applies it to all entries in the data space.
Upper limit of the data space, specified as a scalar or matrix of the same size as the data space. When
UpperLimit is specified as a scalar, rlNumericSpec applies it to all entries in the data space.
3-198
rlNumericSpec
Information about the type of data, specified as a string, such as "double" or "single".
Object Functions
rlSimulinkEnv Create reinforcement learning environment using dynamic model
implemented in Simulink
rlFunctionEnv Specify custom reinforcement learning environment dynamics
using functions
rlValueFunction Value function approximator object for reinforcement learning
agents
rlQValueFunction Q-Value function approximator object for reinforcement learning
agents
rlVectorQValueFunction Vector Q-value function approximator for reinforcement learning
agents
rlContinuousDeterministicActor Deterministic actor with a continuous action space for
reinforcement learning agents
rlDiscreteCategoricalActor Stochastic categorical actor with a discrete action space for
reinforcement learning agents
rlContinuousGaussianActor Stochastic Gaussian actor with a continuous action space for
reinforcement learning agents
Examples
For this example, consider the rlSimplePendulumModel Simulink model. The model is a simple
frictionless pendulum that initially hangs in a downward position.
Create rlNumericSpec and rlFiniteSetSpec objects for the observation and action information,
respectively.
The observation is a vector containing three signals: the sine, cosine, and time derivative of the
angle.
obsInfo = rlNumericSpec([3 1])
obsInfo =
rlNumericSpec with properties:
3-199
3 Objects
LowerLimit: -Inf
UpperLimit: Inf
Name: [0×0 string]
Description: [0×0 string]
Dimension: [3 1]
DataType: "double"
The action is a scalar expressing the torque and can be one of three possible values, -2 Nm, 0 Nm and
2 Nm.
actInfo = rlFiniteSetSpec([-2 0 2])
actInfo =
rlFiniteSetSpec with properties:
You can use dot notation to assign property values for the rlNumericSpec and rlFiniteSetSpec
objects.
obsInfo.Name = 'observations';
actInfo.Name = 'torque';
Assign the agent block path information, and create the reinforcement learning environment for the
Simulink model using the information extracted in the previous steps.
agentBlk = [mdl '/RL Agent'];
env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo)
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on
You can also include a reset function using dot notation. For this example, randomly initialize theta0
in the model workspace.
env.ResetFcn = @(in) setVariable(in,'theta0',randn,'Workspace',mdl)
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : @(in)setVariable(in,'theta0',randn,'Workspace',mdl)
UseFastRestart : on
3-200
rlNumericSpec
Version History
Introduced in R2019a
See Also
rlFiniteSetSpec | rlSimulinkEnv | getActionInfo | getObservationInfo |
rlValueRepresentation | rlQValueRepresentation |
rlDeterministicActorRepresentation | rlStochasticActorRepresentation |
rlFunctionEnv
Topics
“Train DDPG Agent for Adaptive Cruise Control”
3-201
3 Objects
rlOptimizerOptions
Optimization options for actors and critics
Description
Use an rlOptimizerOptions object to specify an optimization options set for actors and critics.
Creation
Syntax
optOpts = rlOptimizerOptions
optOpts = rlOptimizerOptions(Name=Value)
Description
Properties
LearnRate — Learning rate used in training the actor or critic function approximator
0.01 (default) | positive scalar
Learning rate used in training the actor or critic function approximator, specified as a positive scalar.
If the learning rate is too low, then training takes a long time. If the learning rate is too high, then
training might reach a suboptimal result or diverge.
Example: LearnRate=0.025
GradientThreshold — Gradient threshold value for the training of the actor or critic
function approximator
Inf (default) | positive scalar
Gradient threshold value used in training the actor or critic function approximator, specified as Inf
or a positive scalar. If the gradient exceeds this value, the gradient is clipped as specified by the
GradientThresholdMethod option. Clipping the gradient limits how much the network parameters
can change in a training iteration.
Example: GradientThreshold=1
3-202
rlOptimizerOptions
Gradient threshold method used in training the actor or critic function approximator. This is the
specific method used to clip gradient values that exceed the gradient threshold, and it is specified as
one of the following values.
For more information, see “Gradient Clipping” in the Algorithms section of trainingOptions in
Deep Learning Toolbox.
Example: GradientThresholdMethod="absolute-value"
Factor for L2 regularization (weight decay) used in training the actor or critic function approximator,
specified as a nonnegative scalar. For more information, see “L2 Regularization” in the Algorithms
section of trainingOptions in Deep Learning Toolbox.
To avoid overfitting when using a representation with many parameters, consider increasing the
L2RegularizationFactor option.
Example: L2RegularizationFactor=0.0005
Algorithm used for training the actor or critic function approximator, specified as one of the following
values.
• "adam" — Use the Adam (adaptive movement estimation) algorithm. You can specify the decay
rates of the gradient and squared gradient moving averages using the GradientDecayFactor
and SquaredGradientDecayFactor fields of the OptimizerParameters option.
• "sgdm" — Use the stochastic gradient descent with momentum (SGDM) algorithm. You can
specify the momentum value using the Momentum field of the OptimizerParameters option.
• "rmsprop" — Use the RMSProp algorithm. You can specify the decay rate of the squared gradient
moving average using the SquaredGradientDecayFactor fields of the OptimizerParameters
option.
For more information about these algorithms, see “Stochastic Gradient Descent” in the Algorithms
section of trainingOptions in Deep Learning Toolbox.
Example: Optimizer="sgdm"
OptimizerParameters — Parameters for the training algorithm used for training the actor
or critic function approximator
OptimizerParameters object
3-203
3 Objects
Parameters for the training algorithm used for training the actor or critic function approximator,
specified as an OptimizerParameters object with the following parameters.
Parameter Description
Momentum Contribution of previous step, specified as a
scalar from 0 to 1. A value of 0 means no
contribution from the previous step. A value of 1
means maximal contribution.
To change property values, create an rlOptimizerOptions object and use dot notation to access
and change the properties of OptimizerParameters.
repOpts = rlRepresentationOptions;
repOpts.OptimizerParameters.GradientDecayFactor = 0.95;
Object Functions
rlQAgentOptions Options for Q-learning agent
rlSARSAAgentOptions Options for SARSA agent
rlDQNAgentOptions Options for DQN agent
rlPGAgentOptions Options for PG agent
rlDDPGAgentOptions Options for DDPG agent
rlTD3AgentOptions Options for TD3 agent
rlACAgentOptions Options for AC agent
3-204
rlOptimizerOptions
Examples
Use rlOprimizerOptions to create a default optimizer option object to use for the training of a
critic function approximator.
myCriticOpts = rlOptimizerOptions
myCriticOpts =
rlOptimizerOptions with properties:
LearnRate: 0.0100
GradientThreshold: Inf
GradientThresholdMethod: "l2norm"
L2RegularizationFactor: 1.0000e-04
Algorithm: "adam"
OptimizerParameters: [1x1 rl.option.OptimizerParameters]
Using dot notation, change the training algorithm to stochastic gradient descent with momentum and
set the value of the momentum parameter to 0.6.
myCriticOpts.Algorithm = "sgdm";
myCriticOpts.OptimizerParameters.Momentum = 0.6;
myAgentOpt = rlACAgentOptions;
myAgentOpt.CriticOptimizerOptions = myCriticOpts;
You can now use myAgentOpt as last input argument to rlACAgent when creating your AC agent.
Use rlOprimizerOptions to create an optimizer option object to use for the training of an actor
function approximator. Specify a learning rate of 0.2 and set the GradientThresholdMethod to
"absolute-value".
myActorOpts=rlOptimizerOptions(LearnRate=0.2, ...
GradientThresholdMethod="absolute-value")
myActorOpts =
rlOptimizerOptions with properties:
LearnRate: 0.2000
GradientThreshold: Inf
3-205
3 Objects
GradientThresholdMethod: "absolute-value"
L2RegularizationFactor: 1.0000e-04
Algorithm: "adam"
OptimizerParameters: [1x1 rl.option.OptimizerParameters]
myActorOpts.GradientThreshold = 10;
Create an AC agent option object and set its ActorOptimizerOptions property to myActorOpts.
You can now use myAgentOpt as last input argument to rlACAgent when creating your AC agent.
Version History
Introduced in R2022a
See Also
Functions
rlOptimizer
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-206
rlPGAgent
rlPGAgent
Policy gradient reinforcement learning agent
Description
The policy gradient (PG) algorithm is a model-free, online, on-policy reinforcement learning method.
A PG agent is a policy-based reinforcement learning agent that uses the REINFORCE algorithm to
directly compute an optimal policy which maximizes the long-term reward. The action space can be
either discrete or continuous.
For more information on PG agents and the REINFORCE algorithm, see “Policy Gradient Agents”. For
more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
agent = rlPGAgent(observationInfo,actionInfo)
agent = rlPGAgent(observationInfo,actionInfo,initOpts)
agent = rlPGAgent(actor)
agent = rlPGAgent(actor,critic)
Description
Create Agent from Observation and Action Specifications
agent = rlPGAgent(actor) creates a PG agent with the specified actor network. By default, the
UseBaseline property of the agent is false in this case.
3-207
3 Objects
agent = rlPGAgent(actor,critic) creates a PG agent with the specified actor and critic
networks. By default, the UseBaseline option is true in this case.
Specify Agent Options
agent = rlPGAgent( ___ ,agentOptions) creates a PG agent and sets the AgentOptions
property to the agentOptions input argument. Use this syntax after any of the input arguments in
the previous syntaxes.
Input Arguments
actor — Actor
rlDiscreteCategoricalActor object | rlContinuousGaussianActor object
Baseline critic that estimates the discounted long-term reward, specified as an rlValueFunction
object. For more information on creating critic approximators, see “Create Policies and Value
Functions”.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.
For a discrete action space, you must specify actionInfo as an rlFiniteSetSpec object.
3-208
rlPGAgent
For a continuous action space, you must specify actionInfo as an rlNumericSpec object.
If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.
You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.
Option to use exploration policy when selecting actions, specified as a one of the following logical
values.
• true — Use the base agent exploration policy when selecting actions in sim and
generatePolicyFunction. In this case, the agent selects its actions by sampling its probability
distribution, the policy is therefore stochastic and the agent explores its observation space.
• false — Use the base agent greedy policy (the action with maximum likelihood) when selecting
actions in sim and generatePolicyFunction. In this case, the simulated agent and generated
policy behave deterministically.
Note This option affects only simulation and deployment; it does not affect training.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
3-209
3 Objects
Examples
Create Discrete Policy Gradient Agent from Observation and Action Specifications
Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to the pole).
env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
Create a policy gradient agent from the environment observation and action specifications.
agent = rlPGAgent(obsInfo,actInfo);
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment.
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.
env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");
3-210
rlPGAgent
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256). Policy gradient agents do not
support recurrent networks, so setting the UseRNN option to true generates an error when the agent
is created.
initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
Create a policy gradient agent from the environment observation and action specifications.
agent = rlPGAgent(obsInfo,actInfo,initOpts);
Extract the deep neural networks from both the agent actor and critic.
actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));
Display the layers of the critic network, and verify that each hidden fully connected layer has 128
neurons
criticNet.Layers
ans =
11x1 Layer array with layers:
Plot actor and critic networks, and display their number of weights.
plot(layerGraph(actorNet))
3-211
3 Objects
summary(actorNet)
Initialized: true
Inputs:
1 'input_1' 50x50x1 images
2 'input_2' 1 features
plot(layerGraph(criticNet))
3-212
rlPGAgent
summary(criticNet)
Initialized: true
Inputs:
1 'input_1' 50x50x1 images
2 'input_2' 1 features
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment.
Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Train PG Agent with
Baseline to Control Double Integrator System”. The observation from the environment is a vector
3-213
3 Objects
containing the position and velocity of a mass. The action is a scalar representing a force, applied to
the mass, having three possible values (-2, 0, or 2 Newton).
env = rlPredefinedEnv("DoubleIntegrator-Discrete");
obsInfo = getObservationInfo(env)
obsInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"
actInfo = getActionInfo(env)
actInfo =
rlFiniteSetSpec with properties:
Elements: [-2 0 2]
Name: "force"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"
For policy gradient agents, the baseline critic estimates a value function, therefore it must take the
observation signal as input and return a scalar value.
Define the network as an array of layer objects, and get the dimension of the observation space from
the environment specification object.
baselineNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(1)];
Initialized: true
Inputs:
1 'input' 2 features
Create a critic to use as a baseline. Policy gradient agents use an rlValueFunction object to
implement the critic.
baseline = rlValueFunction(baselineNet,obsInfo);
3-214
rlPGAgent
getValue(baseline,{rand(obsInfo.Dimension)})
ans = single
-0.1204
To approximate the policy within the actor, use a deep neural network. For policy gradient agents, the
actor executes a stochastic policy, which for discrete action spaces is implemented by a discrete
categorical actor. In this case the network must take the observation signal as input and return a
probability for each action. Therefore the output layer must have as many elements as the number of
possible actions.
Define the network as an array of layer objects, and get the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))];
actorNet = dlnetwork(actorNet);
summary(actorNet)
Initialized: true
Inputs:
1 'input' 2 features
Create the actor using rlDiscreteCategoricalActor, as well as the observation and action
specifications.
actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);
getAction(actor,{rand(obsInfo.Dimension)})
Create the PG agent using the actor and the baseline critic.
agent = rlPGAgent(actor,baseline)
agent =
rlPGAgent with properties:
3-215
3 Objects
Specify options for the agent, including training options for the actor and critic.
agent.AgentOptions.UseBaseline = true;
agent.AgentOptions.DiscountFactor = 0.99;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the double integrator continuous action space environment used
in the example “Train DDPG Agent to Control Double Integrator System”.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env)
obsInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"
actInfo = getActionInfo(env)
actInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"
3-216
rlPGAgent
In this example, the action is a scalar value representing a force ranging from -2 to 2 Newton. To
make sure that the output from the agent is in this range, you perform an appropriate scaling
operation. Store these limits so you can easily access them later.
actInfo.LowerLimit = -2;
actInfo.UpperLimit = 2;
For policy gradient agents, the baseline critic estimates a value function, therefore it must take the
observation signal as input and return a scalar value. To approximate the value function within the
baseline, use a neural network.
Define the network as an array of layer objects, and get the dimensions of the observation space from
the environment specification object.
baselineNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(1)];
baselineNet = dlnetwork(baselineNet);
summary(baselineNet)
Initialized: true
Inputs:
1 'input' 2 features
Create a critic to use as a baseline. Policy gradient agents use an rlValueFunction object to
implement the critic.
baseline = rlValueFunction(baselineNet,obsInfo);
getValue(baseline,{rand(obsInfo.Dimension)})
ans = single
-0.1204
To approximate the policy within the actor, use a deep neural network as approximation model. For
policy gradient agents, the actor executes a stochastic policy, which for continuous action spaces is
implemented by a continuous Gaussian actor. In this case the network must take the observation
signal as input and return both a mean value and a standard deviation value for each action.
Therefore it must have two output layers (one for the mean values the other for the standard
deviation values), each having as many elements as the dimension of the action space.
Note that standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.
3-217
3 Objects
Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects, and specify a name for the input layers, so
you can later explicitly associate them with the appropriate environment channel.
% Input path
inPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="obs_in")
fullyConnectedLayer(32)
reluLayer(Name="ip_out") ];
% Mean path
meanPath = [
fullyConnectedLayer(16,Name="mp_fc1")
reluLayer
fullyConnectedLayer(1)
tanhLayer(Name="tanh"); % range: -1,1
scalingLayer(Name="mp_out", ...
Scale=actInfo.UpperLimit) ]; % range: -2,2
% plot network
plot(actorNet)
3-218
rlPGAgent
Initialized: true
Inputs:
1 'obs_in' 2 features
Create the actor using rlContinuousGaussianActor, together with actorNet, the observation
and action specifications, as well as the names of the network input and output layers.
actor = rlContinuousGaussianActor(actorNet, ...
obsInfo,actInfo, ...
ObservationInputNames="obs_in", ...
ActionMeanOutputNames="mp_out", ...
ActionStandardDeviationOutputNames="sp_out");
3-219
3 Objects
Create the PG agent using the actor and the baseline critic.
agent = rlPGAgent(actor,baseline)
agent =
rlPGAgent with properties:
Specify options for the agent, including training options for the actor and critic.
agent.AgentOptions.UseBaseline = true;
agent.AgentOptions.DiscountFactor = 0.99;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 5e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
For this example, load the environment used in the example “Train PG Agent with Baseline to Control
Double Integrator System”. The observation from the environment is a vector containing the position
and velocity of a mass. The action is a scalar representing a force, applied to the mass, having three
possible values (-2, 0, or 2 Newton).
env = rlPredefinedEnv("DoubleIntegrator-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a critic to use as a baseline. For policy gradient agents, the baseline critic estimates a value
function, therefore it must take the observation signal as input and return a scalar value. To
approximate the value function within the critic, use a recurrent neural network.
Define the network as an array of layer objects, and get the dimension of the observation space from
the environment specification object. To create a recurrent neural network for the critic, use
3-220
rlPGAgent
sequenceInputLayer as the input layer and include an lstmLayer as one of the other network
layers.
baselineNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
lstmLayer(32)
reluLayer
fullyConnectedLayer(1)];
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 2 dimensions
Create the critic based on the network approximator model. Policy gradient agents use an
rlValueFunction object to implement the critic.
baseline = rlValueFunction(baselineNet,obsInfo);
ans = single
-0.0065
Since the critic has a recurrent network, the actor must have a recurrent network too. Define a
recurrent neural network for the actor. For policy gradient agents, the actor executes a stochastic
policy, which for discrete action spaces is implemented by a discrete categorical actor. In this case
the network must take the observation signal as input and return a probability for each action.
Therefore the output layer must have as many elements as the number of possible actions.
Define the network as an array of layer objects, and get the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
lstmLayer(32)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))];
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 2 dimensions
3-221
3 Objects
Create the actor. Policy gradient agents use stochastic actors, which for discrete action spaces are
implemented by rlDiscreteCategoricalActor objects.
actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);
getAction(actor,{rand(obsInfo.Dimension)})
Specify agent options, including training options for the actor and the critic.
agentOpts = rlPGAgentOptions(...
'UseBaseline',true, ...
'DiscountFactor', 0.99, ...
'CriticOptimizerOptions',baselineOpts, ...
'ActorOptimizerOptions', actorOpts);
create a PG agent using the actor, the critic and the agent option object.
agent = rlPGAgent(actor,baseline,agentOpts);
For PG agent with recurrent neural networks, the training sequence length is the whole episode.
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
Tips
• For continuous action spaces, the rlPGAgent agent does not enforce the constraints set by the
action specification, so you must enforce action space constraints within the environment.
Version History
Introduced in R2019a
3-222
rlPGAgent
See Also
rlAgentInitializationOptions | rlPGAgentOptions | rlQValueFunction |
rlDiscreteCategoricalActor | rlContinuousGaussianActor | Deep Network Designer
Topics
“Policy Gradient Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
3-223
3 Objects
rlPGAgentOptions
Options for PG agent
Description
Use an rlPGAgentOptions object to specify options for policy gradient (PG) agents. To create a PG
agent, use rlPGAgent
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
opt = rlPGAgentOptions
opt = rlPGAgentOptions(Name,Value)
Description
Properties
UseBaseline — Use baseline for learning
true (default) | false
Option to use baseline for learning, specified as a logical value. When UseBaseline is true, you
must specify a critic network as the baseline function approximator.
In general, for simpler problems with smaller actor networks, PG agents work better without a
baseline.
Entropy loss weight, specified as a scalar value between 0 and 1. A higher entropy loss weight value
promotes agent exploration by applying a penalty for being too certain about which action to take.
Doing so can help the agent move out of local optima.
3-224
rlPGAgentOptions
When gradients are computed during training, an additional gradient component is computed for
minimizing this loss function.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.
Object Functions
rlPGAgent Policy gradient reinforcement learning agent
Examples
This example shows how to create and modify a PG agent options object.
opt = rlPGAgentOptions('DiscountFactor',0.9)
3-225
3 Objects
opt =
rlPGAgentOptions with properties:
UseBaseline: 1
EntropyLossWeight: 0
ActorOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
SampleTime: 1
DiscountFactor: 0.9000
InfoToSave: [1x1 struct]
You can modify options using dot notation. For example, set the agent sample time to 0.5.
opt.SampleTime = 0.5;
Version History
Introduced in R2019a
• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.AgentOptions.UseDeterministicExploitation = true;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.AgentOptions.UseDeterministicExploitation = false;
• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.UseExplorationPolicy = false;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.UseExplorationPolicy = true;
3-226
rlPGAgentOptions
See Also
Topics
“Policy Gradient Agents”
3-227
3 Objects
rlPPOAgent
Proximal policy optimization reinforcement learning agent
Description
Proximal policy optimization (PPO) is a model-free, online, on-policy, policy gradient reinforcement
learning method. This algorithm alternates between sampling data through environmental interaction
and optimizing a clipped surrogate objective function using stochastic gradient descent. The action
space can be either discrete or continuous.
For more information on PPO agents, see “Proximal Policy Optimization Agents”. For more
information on the different types of reinforcement learning agents, see “Reinforcement Learning
Agents”.
Creation
Syntax
agent = rlPPOAgent(observationInfo,actionInfo)
agent = rlPPOAgent(observationInfo,actionInfo,initOpts)
agent = rlPPOAgent(actor,critic)
Description
agent = rlPPOAgent(actor,critic) creates a PPO agent with the specified actor and critic,
using the default options for the agent.
3-228
rlPPOAgent
agent = rlPPOAgent( ___ ,agentOptions) creates a PPO agent and sets the AgentOptions
property to the agentOptions input argument. Use this syntax after any of the input arguments in
the previous syntaxes.
Input Arguments
actor — Actor
rlDiscreteCategoricalActor object | rlContinuousGaussianActor object
critic — Critic
rlValueFunction object
Critic that estimates the discounted long-term reward, specified as an rlValueFunction object. For
more information on creating critic approximators, see “Create Policies and Value Functions”.
Your critic can use a recurrent neural network as its function approximator. In this case, your actor
must also use a recurrent neural network. For an example, see “Create PPO Agent with Recurrent
Neural Networks” on page 3-240.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.
For a discrete action space, you must specify actionInfo as an rlFiniteSetSpec object.
For a continuous action space, you must specify actionInfo as an rlNumericSpec object.
3-229
3 Objects
If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.
You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.
Option to use exploration policy when selecting actions, specified as a one of the following logical
values.
• true — Use the base agent exploration policy when selecting actions in sim and
generatePolicyFunction. In this case, the agent selects its actions by sampling its probability
distribution, the policy is therefore stochastic and the agent explores its observation space.
• false — Use the base agent greedy policy (the action with maximum likelihood) when selecting
actions in sim and generatePolicyFunction. In this case, the simulated agent and generated
policy behave deterministically.
Note This option affects only simulation and deployment; it does not affect training.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
3-230
rlPPOAgent
Examples
Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to a swinging pole).
env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
Create a PPO agent from the environment observation and action specifications.
agent = rlPPOAgent(obsInfo,actInfo);
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment. You can also use getActor and
getCritic to extract the actor and critic, respectively, and getModel to extract the approximator
model (by default a deep neural network) from the actor or critic.
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.
env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
3-231
3 Objects
Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256).
initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
Create a PPO actor-critic agent from the environment observation and action specifications.
agent = rlPPOAgent(obsInfo,actInfo,initOpts);
Extract the deep neural networks from both the agent actor and critic.
actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));
Display the layers of the critic network, and verify that each hidden fully connected layer has 128
neurons
criticNet.Layers
ans =
11x1 Layer array with layers:
plot(layerGraph(actorNet))
3-232
rlPPOAgent
plot(layerGraph(criticNet))
3-233
3 Objects
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment.
Create an environment interface, and obtain its observation and action specifications.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
For PPO agents, the critic estimates a value function, therefore it must take the observation signal as
input and return a scalar value. Create a deep neural network to be used as approximation model
within the critic. Define the network as an array of layer objects.
criticNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(100)
3-234
rlPPOAgent
reluLayer
fullyConnectedLayer(1)
];
Initialized: true
Inputs:
1 'input' 4 features
Create the critic using criticNet. PPO agents use an rlValueFunction object to implement the
critic.
critic = rlValueFunction(criticNet,obsInfo);
ans = single
-0.2479
To approximate the policy within the actor use a neural network. For PPO agents, the actor executes
a stochastic policy, which for discrete action spaces is implemented by a discrete categorical actor. In
this case the approximator must take the observation signal as input and return a probability for each
action. Therefore the output layer must have as many elements as the number of possible actions.
Define the network as an array of layer objects, getting the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(200)
reluLayer
fullyConnectedLayer(numel(actInfo.Dimension))
];
Initialized: true
Inputs:
1 'input' 4 features
Create the actor using actorNet. PPO agents use an rlDiscreteCategoricalActor object to
implement the actor for discrete action spaces.
actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);
3-235
3 Objects
getAction(actor,{rand(obsInfo.Dimension)})
agent = rlPPOAgent(actor,critic)
agent =
rlPPOAgent with properties:
Specify agent options, including training options for the actor and the critic.
agent.AgentOptions.ExperienceHorizon = 1024;
agent.AgentOptions.DiscountFactor = 0.95;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent against the environment.
Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the double integrator continuous action space environment used
in the example “Train DDPG Agent to Control Double Integrator System”. The observation from the
environment is a vector containing the position and velocity of a mass. The action is a scalar
representing a force, applied to the mass, ranging continuously from -2 to 2 Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env)
obsInfo =
rlNumericSpec with properties:
3-236
rlPPOAgent
LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"
actInfo = getActionInfo(env)
actInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"
In this example, the action is a scalar value representing a force ranging from -2 to 2 Newton. To
make sure that the output from the agent is in this range, you perform an appropriate scaling
operation. Store these limits so you can easily access them later.
actInfo.LowerLimit=-2;
actInfo.UpperLimit=2;
The actor and critic networks are initialized randomly. Ensure reproducibility by fixing the seed of the
random generator.
rng(0)
For PPO agents, the critic estimates a value function, therefore it must take the observation signal as
input and return a scalar value. To approximate the value function within the critic, use a neural
network. Define the network as an array of layer objects.
criticNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(100)
reluLayer
fullyConnectedLayer(1)];
Initialized: true
Inputs:
1 'input' 2 features
Create the critic using criticNet. PPO agents use an rlValueFunction object to implement the
critic.
critic = rlValueFunction(criticNet,obsInfo);
3-237
3 Objects
getValue(critic,{rand(obsInfo.Dimension)})
ans = single
-0.0899
To approximate the policy within the actor, use a neural network. For PPO agents, the actor executes
a stochastic policy, which for continuous action spaces is implemented by a continuous Gaussian
actor. In this case the network must take the observation signal as input and return both a mean
value and a standard deviation value for each action. Therefore it must have two output layers (one
for the mean values the other for the standard deviation values), each having as many elements as
the dimension of the action space.
Note that standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.
Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces, and the action range limits from the environment specification objects. Specify a name
for the input and output layers, so you can later explicitly associate them with the appropriate
environment channel.
% Connect paths
actorNet = connectLayers(actorNet,"comPathOut","meanPathIn/in");
actorNet = connectLayers(actorNet,"comPathOut",'stdPathIn/in');
% Plot network
plot(actorNet)
3-238
rlPPOAgent
Initialized: true
Inputs:
1 'comPathIn' 2 features
Create the actor using actorNet. PPO agents use an rlContinuousGaussianActor object to
implement the actor for continuous action spaces.
getAction(actor,{rand(obsInfo.Dimension)})
3-239
3 Objects
agent = rlPPOAgent(actor,critic)
agent =
rlPPOAgent with properties:
Specify agent options, including training options for the actor and the critic.
agent.AgentOptions.ExperienceHorizon = 1024;
agent.AgentOptions.DiscountFactor = 0.95;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
For this example load the predefined environment used for the “Train DQN Agent to Balance Cart-
Pole System” example.
env = rlPredefinedEnv("CartPole-Discrete");
Get observation and action information. This environment has a continuous four-dimensional
observation space (the positions and velocities of both cart and pole) and a discrete one-dimensional
action space consisting on the application of two possible forces, -10N or 10N.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
For PPO agents, the critic estimates a value function, therefore it must take the observation signal as
input and return a scalar value. To approximate the value function within the critic, use a neural
network.
Define the network as an array of layer objects, and get the dimensions of the observation space from
the environment specification object. To create a recurrent neural network, use a
3-240
rlPPOAgent
sequenceInputLayer as the input layer and include an lstmLayer as one of the other network
layers.
criticNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(8)
reluLayer
lstmLayer(8)
fullyConnectedLayer(1)];
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
Create the critic using criticNetwork. PPO agents use an rlValueFunction object to implement
the critic.
critic = rlValueFunction(criticNet,obsInfo);
ans = single
0.0017
Since the critic has a recurrent network, the actor must have a recurrent network too. For PPO
agents, the actor executes a stochastic policy, which for discrete action spaces is implemented by a
discrete categorical actor. In this case the network must take the observation signal as input and
return a probability for each action. Therefore the output layer must have as many elements as the
number of possible actions.
Define the network as an array of layer objects, and get the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(100)
reluLayer
lstmLayer(8)
fullyConnectedLayer(numel(actInfo.Elements))
softmaxLayer
];
Convert the network to a dlnetwork object and display the number of learnable parameters.
actorNet = dlnetwork(actorNet);
summary(actorNet)
Initialized: true
3-241
3 Objects
Number of learnables: 4k
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
Create the actor using actorNetwork. PPO agents use an rlDiscreteCategoricalActor object
to implement the actor for discrete action spaces.
actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);
getAction(actor,{rand(obsInfo.Dimension)})
agentOptions = rlPPOAgentOptions(...
AdvantageEstimateMethod="finite-horizon", ...
ClipFactor=0.1, ...
CriticOptimizerOptions=criticOptions, ...
ActorOptimizerOptions=actorOptions);
When recurrent neural networks are used, the MiniBatchSize property is the length of the learning
trajectory.
agentOptions.MiniBatchSize
ans = 128
Create the agent using the actor and critic, as well as the agent options object.
agent = rlPPOAgent(actor,critic,agentOptions);
getAction(agent,rand(obsInfo.Dimension))
3-242
rlPPOAgent
Tips
• For continuous action spaces, this agent does not enforce the constraints set by the action
specification. In this case, you must enforce action space constraints within the environment.
• While tuning the learning rate of the actor network is necessary for PPO agents, it is not
necessary for TRPO agents.
Version History
Introduced in R2019b
See Also
rlAgentInitializationOptions | rlPPOAgentOptions | rlValueFunction |
rlDiscreteCategoricalActor | rlContinuousGaussianActor | Deep Network Designer
Topics
“Proximal Policy Optimization Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
3-243
3 Objects
rlPPOAgentOptions
Options for PPO agent
Description
Use an rlPPOAgentOptions object to specify options for proximal policy optimization (PPO) agents.
To create a PPO agent, use rlPPOAgent.
For more information on PPO agents, see “Proximal Policy Optimization Agents”.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
opt = rlPPOAgentOptions
opt = rlPPOAgentOptions(Name,Value)
Description
Properties
ExperienceHorizon — Number of steps the agent interacts with the environment before
learning
512 (default) | positive integer
Number of steps the agent interacts with the environment before learning from its experience,
specified as a positive integer.
The ExperienceHorizon value must be greater than or equal to the MiniBatchSize value.
Mini-batch size used for each learning epoch, specified as a positive integer. When the agent uses a
recurrent neural network, MiniBatchSize is treated as the training trajectory length.
3-244
rlPPOAgentOptions
The MiniBatchSize value must be less than or equal to the ExperienceHorizon value.
Clip factor for limiting the change in each policy update step, specified as a positive scalar less than
1.
Entropy loss weight, specified as a scalar value between 0 and 1. A higher entropy loss weight value
promotes agent exploration by applying a penalty for being too certain about which action to take.
Doing so can help the agent move out of local optima.
When gradients are computed during training, an additional gradient component is computed for
minimizing this loss function. For more information, see “Entropy Loss”.
Number of epochs for which the actor and critic networks learn from the current experience set,
specified as a positive integer.
For more information on these methods, see the training algorithm information in “Proximal Policy
Optimization Agents”.
Smoothing factor for generalized advantage estimator, specified as a scalar value between 0 and 1,
inclusive. This option applies only when the AdvantageEstimateMethod option is "gae"
Method for normalizing advantage function values, specified as one of the following:
3-245
3 Objects
In some environments, you can improve agent performance by normalizing the advantage function
during training. The agent normalizes the advantage function by subtracting the mean advantage
value and scaling by the standard deviation.
Window size for normalizing advantage function values, specified as a positive integer. Use this
option when the NormalizedAdvantageMethod option is "moving".
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.
Object Functions
rlPPOAgent Proximal policy optimization reinforcement learning agent
Examples
3-246
rlPPOAgentOptions
opt = rlPPOAgentOptions('ExperienceHorizon',256)
opt =
rlPPOAgentOptions with properties:
ExperienceHorizon: 256
MiniBatchSize: 128
ClipFactor: 0.2000
EntropyLossWeight: 0.0100
NumEpoch: 3
AdvantageEstimateMethod: "gae"
GAEFactor: 0.9500
NormalizedAdvantageMethod: "none"
AdvantageNormalizingWindow: 1000000
ActorOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
SampleTime: 1
DiscountFactor: 0.9900
InfoToSave: [1x1 struct]
You can modify options using dot notation. For example, set the agent sample time to 0.5.
opt.SampleTime = 0.5;
Version History
Introduced in R2019b
• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.AgentOptions.UseDeterministicExploitation = true;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.AgentOptions.UseDeterministicExploitation = false;
• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.UseExplorationPolicy = false;
3-247
3 Objects
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.UseExplorationPolicy = true;
See Also
Topics
“Proximal Policy Optimization Agents”
3-248
rlPrioritizedReplayMemory
rlPrioritizedReplayMemory
Replay memory experience buffer with prioritized sampling
Description
An off-policy reinforcement learning agent stores experiences in a circular experience buffer. During
training, the agent samples mini-batches of experiences from the buffer and uses these mini-batches
to update its actor and critic function approximators.
By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object
as their experience buffer. Agents uniformly sample data from this buffer. To perform nonuniform
prioritized sampling [1], which can improve sample efficiency when training your agent, use an
rlPrioritizedReplayMemory object. For more information on prioritized sampling, see
“Algorithms” on page 3-251.
Creation
Syntax
buffer = rlPrioritizedReplayMemory(obsInfo,actInfo)
buffer = rlPrioritizedReplayMemory(obsInfo,actInfo,maxLength)
Description
Input Arguments
You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.
3-249
3 Objects
You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.
Properties
MaxLength — Maximum buffer length
10000 (default) | positive integer
Priority exponent to control the impact of prioritization during probability computation, specified as a
nonnegative scalar less than or equal to 1.
Initial value of the importance sampling exponent, specified as a nonnegative scalar less than or
equal to 1
Number of annealing steps for updating the importance sampling exponent, specified as a positive
integer.
Current value of the importance sampling exponent, specified as a nonnegative scalar less than or
equal to 1.
Object Functions
append Append experiences to replay memory buffer
3-250
rlPrioritizedReplayMemory
Examples
Create an environment for training the agent. For this example, load a predefined environment.
env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");
By default, the agent uses a replay memory experience buffer with uniform sampling.
Replace the default experience buffer with a prioritized replay memory buffer.
agent.ExperienceBuffer = rlPrioritizedReplayMemory(obsInfo,actInfo);
Configure the prioritized replay memory options. For example, set the initial importance sampling
exponent to 0.5 and the number of annealing steps for updating the exponent during training to 1e4.
agent.ExperienceBuffer.NumAnnealingSteps = 1e4;
agent.ExperienceBuffer.PriorityExponent = 0.5;
agent.ExperienceBuffer.InitialImportanceSamplingExponent = 0.5;
Limitations
• Prioritized experience replay does not support agents that use recurrent neural networks.
Algorithms
Prioritized replay memory samples experiences according to experience priorities. For a given
experience, the priority is defined as the absolute value of the associated temporal difference (TD)
error. A larger TD error indicates that the critic network is not well-trained for the corresponding
experience. Therefore, sampling such experiences during critic updates can help efficiently improve
the critic performance, which often improves the sample efficiency of agent training.
When using prioritized replay memory, agents use the following process when sampling a mini-batch
of experiences and updating a critic.
3-251
3 Objects
1 Compute the sampling probability P for each experience in the buffer based on the experience
priority.
α
p j
P( j) = α
∑iN= 1 p i
Here:
−β
w′( j) = N ⋅ P j
w′ j
w j
max w′ i
i ∈ mini‐batch
p j = δ
6 Update the importance sampling exponent β by linearly annealing the exponent value until it
reaches 1.
1 − β0
β β+
NS
Here:
Version History
Introduced in R2022b
References
[1] Schaul, Tom, John Quan, Ioannis Antonoglou, and David Silver. 'Prioritized experience replay'.
arXiv:1511.05952 [Cs] 25 February 2016. https://arxiv.org/abs/1511.05952.
3-252
rlPrioritizedReplayMemory
See Also
rlReplayMemory
3-253
3 Objects
rlQAgent
Q-learning reinforcement learning agent
Description
The Q-learning algorithm is a model-free, online, off-policy reinforcement learning method. A Q-
learning agent is a value-based reinforcement learning agent which trains a critic to estimate the
return or future rewards.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
agent = rlQAgent(critic,agentOptions)
Description
Input Arguments
critic — Critic
rlQValueFunction object
Critic, specified as an rlQValueFunction object. For more information on creating critics, see
“Create Policies and Value Functions”.
Properties
AgentOptions — Agent options
rlQAgentOptions object
Option to use exploration policy when selecting actions, specified as a one of the following logical
values.
3-254
rlQAgent
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
Examples
3-255
3 Objects
Create an environment interface. For this example, use the same environment as in the example
“Train Reinforcement Learning Agent in Basic Grid World”.
env = rlPredefinedEnv("BasicGridWorld");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a table approximation model derived from the environment observation and action
specifications.
qTable = rlTable(obsInfo,actInfo);
Create the critic using qTable. Q agents use an rlValueFunction object to implement the critic.
critic = rlQValueFunction(qTable,obsInfo,actInfo);
Create a Q-learning agent using the specified critic and an epsilon value of 0.05.
opt = rlQAgentOptions;
opt.EpsilonGreedyExploration.Epsilon = 0.05;
agent = rlQAgent(critic,opt)
agent =
rlQAgent with properties:
To check your agent, use getAction to return the action from a random observation.
act = getAction(agent,{randi(numel(obsInfo.Elements))});
act{1}
ans = 1
You can now test and train the agent against the environment.
Version History
Introduced in R2019a
See Also
Functions
rlQAgentOptions | rlQValueFunction
3-256
rlQAgent
Topics
“Q-Learning Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
3-257
3 Objects
rlQAgentOptions
Options for Q-learning agent
Description
Use an rlQAgentOptions object to specify options for creating Q-learning agents. To create a Q-
learning agent, use rlQAgent
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
opt = rlQAgentOptions
opt = rlQAgentOptions(Name,Value)
Description
Properties
EpsilonGreedyExploration — Options for epsilon-greedy exploration
EpsilonGreedyExploration object
3-258
rlQAgentOptions
At the end of each training time step, if Epsilon is greater than EpsilonMin, then it is updated
using the following formula.
Epsilon = Epsilon*(1-EpsilonDecay)
If your agent converges on local optima too quickly, you can promote agent exploration by increasing
Epsilon.
To specify exploration options, use dot notation after creating the rlQAgentOptions object opt. For
example, set the epsilon value to 0.9.
opt.EpsilonGreedyExploration.Epsilon = 0.9;
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.
3-259
3 Objects
Object Functions
rlQAgent Q-learning reinforcement learning agent
Examples
This example shows how to create an options object for a Q-Learning agent.
opt = rlQAgentOptions('SampleTime',0.5)
opt =
rlQAgentOptions with properties:
You can modify options using dot notation. For example, set the agent discount factor to 0.95.
opt.DiscountFactor = 0.95;
Version History
Introduced in R2019a
See Also
Topics
“Q-Learning Agents”
3-260
rlQValueFunction
rlQValueFunction
Q-Value function approximator object for reinforcement learning agents
Description
This object implements a Q-value function approximator that you can use as a critic for a
reinforcement learning agent. A Q-value function maps an environment state-action pair to a scalar
value representing the predicted discounted cumulative long-term reward when the agent starts from
the given state and executes the given action. A Q-value function critic therefore needs both the
environment state and an action as inputs. After you create an rlQValueFunction critic, use it to
create an agent such as rlQAgent, rlDQNAgent, rlSARSAAgent, rlDDPGAgent, or rlTD3Agent.
For more information on creating representations, see “Create Policies and Value Functions”.
Creation
Syntax
critic = rlQValueFunction(net,observationInfo,actionInfo)
critic = rlQValueFunction(tab,observationInfo,actionInfo)
critic = rlQValueFunction({basisFcn,W0},observationInfo,actionInfo)
Description
3-261
3 Objects
Input Arguments
Deep neural network used as the underlying approximator within the critic, specified as one of the
following:
The network must have both the environment observation and action as inputs and a single scalar as
output.
Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
Deep Learning Toolbox neural network object. The resulting dlnet is the dlnetwork object that you
use for your critic or actor. This practice allows a greater level of insight and control for cases in
which the conversion is not straightforward and might require additional specifications.
The learnable parameters of the critic are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.
Q-value table, specified as an rlTable object containing an array with as many rows as the possible
observations and as many columns as the possible actions. The element (s,a) is the expected
cumulative long-term reward for taking action a from observed state s. The elements of this array are
the learnable parameters of the critic.
3-262
rlQValueFunction
Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The output
of the critic is the scalar c = W'*B, where W is a weight vector containing the learnable parameters,
and B is the column vector returned by the custom basis function.
Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the
environment observation channels defined in observationInfo and act is an input with the same
data type and dimension as the environment action channel defined in actionInfo.
For an example on how to use a basis function to create a Q-value function critic with a mixed
continuous and discrete observation space, see “Create Mixed Observation Space Q-Value Function
Critic from Custom Basis Function” on page 3-273.
Example: @(obs1,obs2,act) [act(2)*obs1(1)^2; abs(obs2(5)+act(1))]
Initial value of the basis function weights W, specified as a column vector having the same length as
the vector returned by the basis function.
Name-Value Pair Arguments
Network input layers name corresponding to the environment action channel, specified as a string
array or a cell array of character vectors. The function assigns the environment action channel
specified in actionInfo to the specified network input layer. Therefore, the specified network input
layer must have the same data type and dimensions as defined in actionInfo.
Note The function does not use the name or the description (if any) of the action channel specified in
actionInfo.
This name-value argument is supported only when the approximation model is a deep neural network.
Example: ActionInputNames="myNetOutput_Force"
Network input layers names corresponding to the environment observation channels, specified as a
string array or a cell array of character vectors. The function assigns, in sequential order, each
3-263
3 Objects
Note Of the information specified in observationInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.
This name-value argument is supported only when the approximation model is a deep neural network.
Example: ObservationInputNames={"NetInput1_airspeed","NetInput2_altitude"}
Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array
You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specifications manually.
Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".
3-264
rlQValueFunction
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.
To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: "gpu"
Object Functions
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning
agent
rlDQNAgent Deep Q-network (DQN) reinforcement learning agent
rlQAgent Q-learning reinforcement learning agent
rlSARSAAgent SARSA reinforcement learning agent
rlSACAgent Soft actor-critic reinforcement learning agent
getValue Obtain estimated value from a critic given environment observations and
actions
getMaxQValue Obtain maximum estimated value over all possible actions from a Q-value
function critic with discrete action space, given environment observations
evaluate Evaluate function approximator object given observation (or observation-
action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data
accelerate Option to accelerate computation of gradient for approximator object
based on neural network
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
setModel Set function approximation model for actor or critic
getModel Get function approximator model from actor or critic
Examples
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
3-265
3 Objects
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.
To approximate the Q-value function within the critic, use a deep neural network.
The network must have two inputs, one for the observation and one for the action. The observation
input must accept a four-element vector (the observation vector defined by obsInfo). The action
input must accept a two-element vector (the action vector defined by actInfo). The output of the
network must be a scalar, representing the expected cumulative long-term reward when the agent
starts from the given observation and takes the given action.
You can also obtain the number of observations from the obsInfo specification (regardless of
whether the observation space is a column vector, row vector, or matrix,
prod(obsInfo.Dimension) is its total number of dimensions, in this case four, similarly,
prod(actInfo.Dimension) is the number of dimension of the action space, in the case two).
% Connect layers
net = connectLayers(net,"obsout","cct/in1");
net = connectLayers(net,"actout","cct/in2");
% Plot network
plot(net)
3-266
rlQValueFunction
% Summarize properties
summary(net)
Initialized: true
Inputs:
1 'input' 4 features
2 'input_1' 2 features
Create the critic with rlQValueFunction, using the network as well as the observations and action
specification objects. When using this syntax, the network input layers are automatically associated
with the components of the observation and action signals according to the dimension specifications
in obsInfo and actInfo.
critic = rlQValueFunction(net,obsInfo,actInfo)
critic =
rlQValueFunction with properties:
3-267
3 Objects
To check your critic, use the getValue function to return the value of a random observation and
action, given the current network weights.
v = getValue(critic, ...
{rand(obsInfo.Dimension)}, ...
{rand(actInfo.Dimension)})
v = single
-1.1006
You can now use the critic (along with an with an actor) to create an agent relying on a Q-value
function critic (such as rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent).
Create Q-Value Function Critic from Deep Neural Network Specifying Layer Names
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.
To approximate the Q-value function within the critic, use a deep neural network .
The network must have two inputs, one for the observation and one for the action. The observation
input (here called netObsInput) must accept a four-element vector (the observation vector defined
by obsInfo). The action input (here called netActInput) must accept a two-element vector (the
action vector defined by actInfo). The output of the network must be a scalar, representing the
expected cumulative long-term reward when the agent starts from the given observation and takes
the given action.
You can also obtain the number of observations from the obsInfo specification object (regardless of
whether the observation space is a column vector, row vector, or matrix,
prod(obsInfo.Dimension) is its number of dimensions, in this case four, similarly,
prod(actInfo.Dimension) is the number of dimension of the action space, in the case two).
To create the neural network paths, use vectors of layer objects. Name the network input layers for
the observation and action netObsInput and netActInput, respectively.
3-268
rlQValueFunction
% Connect layers
net = connectLayers(net,"obsout","cct/in1");
net = connectLayers(net,"actout","cct/in2");
% Plot network
plot(net)
3-269
3 Objects
net = dlnetwork(net);
% Summarize properties
summary(net);
Initialized: true
Inputs:
1 'netObsInput' 4 features
2 'netActInput' 2 features
Create the critic with rlQValueFunction, using the network, the observations and action
specification objects, and the names of the network input layers to be associated with the observation
and action from the environment.
critic = rlQValueFunction(net,...
obsInfo,actInfo, ...
ObservationInputNames="netObsInput",...
ActionInputNames="netActInput")
critic =
rlQValueFunction with properties:
To check your critic, use the getValue function to return the value of a random observation and
action, given the current network weights.
v = getValue(critic, ...
{rand(obsInfo.Dimension)}, ...
{rand(actInfo.Dimension)})
v = single
-1.1006
You can now use the critic (along with an with an actor) to create an agent relying on a Q-value
function critic (such as rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent).
Create a finite set observation specification object (or alternatively use getObservationInfo to
extract the specification object from an environment with a discrete observation space). For this
example define the observation space as a finite set with of four possible values.
obsInfo = rlFiniteSetSpec([7 5 3 1]);
Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example define the
action space as a finite set with 2 possible values.
actInfo = rlFiniteSetSpec([4 8]);
3-270
rlQValueFunction
Create a table to approximate the value function within the critic. rlTable creates a value table
object from the observation and action specifications objects.
qTable = rlTable(obsInfo,actInfo);
The table stores a value (representing the expected cumulative long term reward) for each possible
observation-action pair. Each row corresponds to an observation and each column corresponds to an
action. You can access the table using the Table property of the vTable object. The initial value of
each element is zero.
qTable.Table
ans = 4×2
0 0
0 0
0 0
0 0
You can initialize the table to any value, in this case an array containing the integer from 1 through 8.
qTable.Table=reshape(1:8,4,2)
qTable =
rlTable with properties:
Create the critic using the table as well as the observations and action specification objects.
critic = rlQValueFunction(qTable,obsInfo,actInfo)
critic =
rlQValueFunction with properties:
To check your critic, use the getValue function to return the value of a given observation and action,
using the current table entries.
v = getValue(critic,{5},{8})
v = 6
You can now use the critic (along with an with an actor) to create a discrete action space agent
relying on a Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
3-271
3 Objects
continuous three-dimensional space, so that a single observation is a column vector containing three
doubles.
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.
Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations and
actions respectively defined by obsInfo and actInfo.
The output of the critic is the scalar W'*myBasisFcn(myobs,myact), where W is a weight column
vector which must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the best
possible action. The elements of W are the learnable parameters.
W0 = [1;4;4;2];
Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second and third arguments are, respectively, the
observation and action specification objects.
critic =
rlQValueFunction with properties:
To check your critic, use getValue to return the value of a given observation-action pair, using the
current parameter vector.
v = 252.3926
You can now use the critic (along with an with an actor) to create an agent relying on a Q-value
function critic (such as rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent).
3-272
rlQValueFunction
Create Mixed Observation Space Q-Value Function Critic from Custom Basis Function
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as
consisting of two channels, the first is a vector over a continuous two-dimensional space and the
second is a vector over a three-dimensional space that can assume only four values.
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as discrete set
consisting of three possible actions, 1, 2, and 3.
actInfo = rlFiniteSetSpec({1,2,3});
Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations and
the action respectively defined by obsInfo and actInfo. Note that the selected action, as defined,
has only one element, while each observation channel has two elements.
The output of the critic is the scalar W'*myBasisFcn(obsA,obsB,act), where W is a weight column
vector that must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the action
specified as last input. The elements of W are the learnable parameters.
W0 = ones(4,1);
Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second and third arguments are, respectively, the
observation and action specification objects.
critic = rlQValueFunction({myBasisFcn,W0},obsInfo,actInfo)
critic =
rlQValueFunction with properties:
To check your critic, use the getValue function to return the value of a given observation-action pair,
using the current parameter vector.
3-273
3 Objects
v = -0.9000
Note that the critic does not enforce the set constraint for the discrete set elements.
v = -21.0000
You can now use the critic (along with an with an actor) to create an agent with a discrete action
space relying on a Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).
Version History
Introduced in R2022a
See Also
Functions
rlValueFunction | rlVectorQValueFunction | rlTable | getActionInfo |
getObservationInfo
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-274
rlQValueRepresentation
rlQValueRepresentation
(Not recommended) Q-Value function critic representation for reinforcement learning agents
Description
This object implements a Q-value function approximator to be used as a critic within a reinforcement
learning agent. A Q-value function is a function that maps an observation-action pair to a scalar value
representing the expected total long-term rewards that the agent is expected to accumulate when it
starts from the given observation and executes the given action. Q-value function critics therefore
need both observations and actions as inputs. After you create an rlQValueRepresentation critic,
use it to create an agent relying on a Q-value function critic, such as an rlQAgent, rlDQNAgent,
rlSARSAAgent, rlDDPGAgent, or rlTD3Agent. For more information on creating representations,
see “Create Policies and Value Functions”.
Creation
Syntax
critic = rlQValueRepresentation(net,observationInfo,actionInfo,'Observation',
obsName,'Action',actName)
critic = rlQValueRepresentation(tab,observationInfo,actionInfo)
critic = rlQValueRepresentation({basisFcn,W0},observationInfo,actionInfo)
critic = rlQValueRepresentation(net,observationInfo,actionInfo,'Observation',
obsName)
critic = rlQValueRepresentation({basisFcn,W0},observationInfo,actionInfo)
Description
Scalar Output Q-Value Critic
critic = rlQValueRepresentation(net,observationInfo,actionInfo,'Observation',
obsName,'Action',actName) creates the Q-value function critic. net is the deep neural
network used as an approximator, and must have both observations and action as inputs, and a single
scalar output. This syntax sets the ObservationInfo and ActionInfo properties of critic respectively
to the inputs observationInfo and actionInfo, containing the observations and action
specifications. obsName must contain the names of the input layers of net that are associated with
the observation specifications. The action name actName must be the name of the input layer of net
that is associated with the action specifications.
3-275
3 Objects
tab is a rlTable object containing a table with as many rows as the possible observations and as
many columns as the possible actions. This syntax sets the ObservationInfo and ActionInfo properties
of critic respectively to the inputs observationInfo and actionInfo, which must be
rlFiniteSetSpec objects containing the specifications for the discrete observations and action
spaces, respectively.
critic = rlQValueRepresentation({basisFcn,W0},observationInfo,actionInfo)
creates a Q-value function based critic using a custom basis function as underlying approximator.
The first input argument is a two-elements cell in which the first element contains the handle
basisFcn to a custom basis function, and the second element contains the initial weight vector W0.
Here the basis function must have both observations and action as inputs and W0 must be a column
vector. This syntax sets the ObservationInfo and ActionInfo properties of critic respectively to the
inputs observationInfo and actionInfo.
Multi-Output Discrete Action Space Q-Value Critic
critic = rlQValueRepresentation(net,observationInfo,actionInfo,'Observation',
obsName) creates the multi-output Q-value function critic for a discrete action space. net is the
deep neural network used as an approximator, and must have only the observations as input and a
single output layer having as many elements as the number of possible discrete actions. This syntax
sets the ObservationInfo and ActionInfo properties of critic respectively to the inputs
observationInfo and actionInfo, containing the observations and action specifications. Here,
actionInfo must be an rlFiniteSetSpec object containing the specifications for the discrete
action space. The observation names obsName must be the names of the input layers of net.
critic = rlQValueRepresentation({basisFcn,W0},observationInfo,actionInfo)
creates the multi-output Q-value function critic for a discrete action space using a custom basis
function as underlying approximator. The first input argument is a two-elements cell in which the first
element contains the handle basisFcn to a custom basis function, and the second element contains
the initial weight matrix W0. Here the basis function must have only the observations as inputs, and
W0 must have as many columns as the number of possible actions. This syntax sets the
ObservationInfo and ActionInfo properties of critic respectively to the inputs observationInfo
and actionInfo.
Options
critic = rlQValueRepresentation( ___ ,options) creates the value function based critic
using the additional option set options, which is an rlRepresentationOptions object. This
syntax sets the Options property of critic to the options input argument. You can use this syntax
with any of the previous input-argument combinations.
Input Arguments
Deep neural network used as the underlying approximator within the critic, specified as one of the
following:
3-276
rlQValueRepresentation
• SeriesNetwork object
• dlnetwork object
For single output critics, net must have both observations and actions as inputs, and a scalar output,
representing the expected cumulative long-term reward when the agent starts from the given
observation and takes the given action. For multi-output discrete action space critics, net must have
only the observations as input and a single output layer having as many elements as the number of
possible discrete actions. Each output element represents the expected cumulative long-term reward
when the agent starts from the given observation and takes the corresponding action. The learnable
parameters of the critic are the weights of the deep neural network.
The network input layers must be in the same order and with the same data type and dimensions as
the signals defined in ObservationInfo. Also, the names of these input layers must match the
observation names listed in obsName.
The network output layer must have the same data type and dimension as the signal defined in
ActionInfo. Its name must be the action name specified in actName.
For a list of deep neural network layers, see “List of Deep Learning Layers”. For more information on
creating deep neural networks for reinforcement learning, see “Create Policies and Value Functions”.
Observation names, specified as a cell array of strings or character vectors. The observation names
must be the names of the observation input layers in net.
Example: {'my_obs'}
Action name, specified as a single-element cell array that contains a string or character vector. It
must be the name of the action input layer of net.
Example: {'my_act'}
Q-value table, specified as an rlTable object containing an array with as many rows as the possible
observations and as many columns as the possible actions. The element (s,a) is the expected
cumulative long-term reward for taking action a from observed state s. The elements of this array are
the learnable parameters of the critic.
Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The output
of the critic is c = W'*B, where W is a weight vector or matrix containing the learnable parameters,
and B is the column vector returned by the custom basis function.
3-277
3 Objects
For a single-output Q-value critic, c is a scalar representing the expected cumulative long term
reward when the agent starts from the given observation and takes the given action. In this case,
your basis function must have the following signature.
B = myBasisFunction(obs1,obs2,...,obsN,act)
For a multiple-output Q-value critic with a discrete action space, c is a vector in which each element
is the expected cumulative long term reward when the agent starts from the given observation and
takes the action corresponding to the position of the considered element. In this case, your basis
function must have the following signature.
B = myBasisFunction(obs1,obs2,...,obsN)
Here, obs1 to obsN are observations in the same order and with the same data type and dimensions
as the signals defined in observationInfo and act has the same data type and dimensions as the
action specifications in actionInfo.
Example: @(obs1,obs2,act) [act(2)*obs1(1)^2; abs(obs2(5)+act(1))]
Initial value of the basis function weights, W. For a single-output Q-value critic, W is a column vector
having the same length as the vector returned by the basis function. For a multiple-output Q-value
critic with a discrete action space, W is a matrix which must have as many rows as the length of the
basis function output, and as many columns as the number of possible actions.
Properties
Options — Representation options
rlRepresentationOptions object
3-278
rlQValueRepresentation
You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specifications manually.
Object Functions
rlDDPGAgent Deep deterministic policy gradient (DDPG) reinforcement learning agent
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning agent
rlDQNAgent Deep Q-network (DQN) reinforcement learning agent
rlQAgent Q-learning reinforcement learning agent
rlSARSAAgent SARSA reinforcement learning agent
rlSACAgent Soft actor-critic reinforcement learning agent
getValue Obtain estimated value from a critic given environment observations and actions
getMaxQValue Obtain maximum estimated value over all possible actions from a Q-value function
critic with discrete action space, given environment observations
Examples
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing two doubles.
Create a deep neural network to approximate the Q-value function. The network must have two
inputs, one for the observation and one for the action. The observation input (here called myobs)
must accept a four-element vector (the observation vector defined by obsInfo). The action input
(here called myact) must accept a two-element vector (the action vector defined by actInfo). The
output of the network must be a scalar, representing the expected cumulative long-term reward when
the agent starts from the given observation and takes the given action.
3-279
3 Objects
net = addLayers(layerGraph(obsPath),actPath);
net = addLayers(net,comPath);
% connect layers
net = connectLayers(net,'obsout','add/in1');
net = connectLayers(net,'actout','add/in2');
Create the critic with rlQValueRepresentation, using the network, the observations and action
specification objects, as well as the names of the network input layers.
critic =
rlQValueRepresentation with properties:
To check your critic, use the getValue function to return the value of a random observation and
action, using the current network weights.
v = getValue(critic,{rand(4,1)},{rand(2,1)})
v = single
0.1102
You can now use the critic (along with an with an actor) to create an agent relying on a Q-value
function critic (such as an rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent agent).
This example shows how to create a multi-output Q-value function critic for a discrete action space
using a deep neural network approximator.
This critic takes only the observation as input and produces as output a vector with as many elements
as the possible actions. Each element represents the expected cumulative long term reward when the
agent starts from the given observation and takes the action corresponding to the position of the
element in the output vector.
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example, define the
action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).
3-280
rlQValueRepresentation
Create a deep neural network approximator to approximate the Q-value function within the critic.
The input of the network (here called myobs) must accept a four-element vector, as defined by
obsInfo. The output must be a single output layer having as many elements as the number of
possible discrete actions (three in this case, as defined by actInfo).
net = [featureInputLayer(4,...
'Normalization','none','Name','myobs')
fullyConnectedLayer(3,'Name','value')];
Create the critic using the network, the observations specification object, and the name of the
network input layer.
critic = rlQValueRepresentation(net,obsInfo,actInfo,...
'Observation',{'myobs'})
critic =
rlQValueRepresentation with properties:
To check your critic, use the getValue function to return the values of a random observation, using
the current network weights. There is one value for each of the three possible actions.
v = getValue(critic,{rand(4,1)})
0.7232
0.8177
-0.2212
You can now use the critic (along with an actor) to create a discrete action space agent relying on a
Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).
Create a finite set observation specification object (or alternatively use getObservationInfo to
extract the specification object from an environment with a discrete observation space). For this
example define the observation space as a finite set with of 4 possible values.
Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example define the
action space as a finite set with 2 possible values.
Create a table to approximate the value function within the critic. rlTable creates a value table
object from the observation and action specifications objects.
3-281
3 Objects
qTable = rlTable(obsInfo,actInfo);
The table stores a value (representing the expected cumulative long term reward) for each possible
observation-action pair. Each row corresponds to an observation and each column corresponds to an
action. You can access the table using the Table property of the vTable object. The initial value of
each element is zero.
qTable.Table
ans = 4×2
0 0
0 0
0 0
0 0
You can initialize the table to any value, in this case, an array containing the integer from 1 through
8.
qTable.Table=reshape(1:8,4,2)
qTable =
rlTable with properties:
Create the critic using the table as well as the observations and action specification objects.
critic = rlQValueRepresentation(qTable,obsInfo,actInfo)
critic =
rlQValueRepresentation with properties:
To check your critic, use the getValue function to return the value of a given observation and action,
using the current table entries.
v = getValue(critic,{5},{8})
v = 6
You can now use the critic (along with an with an actor) to create a discrete action space agent
relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 3
doubles.
3-282
rlQValueRepresentation
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing 2 doubles.
Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations and
actions respectively defined by obsInfo and actInfo.
The output of the critic is the scalar W'*myBasisFcn(myobs,myact), where W is a weight column
vector which must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the best
possible action. The elements of W are the learnable parameters.
W0 = [1;4;4;2];
Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second and third arguments are, respectively, the
observation and action specification objects.
critic = rlQValueRepresentation({myBasisFcn,W0},...
obsInfo,actInfo)
critic =
rlQValueRepresentation with properties:
To check your critic, use the getValue function to return the value of a given observation-action pair,
using the current parameter vector.
v =
1×1 dlarray
252.3926
3-283
3 Objects
You can now use the critic (along with an with an actor) to create an agent relying on a Q-value
function critic (such as an rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent agent).
This example shows how to create a multi-output Q-value function critic for a discrete action space
using a custom basis function approximator.
This critic takes only the observation as input and produces as output a vector with as many elements
as the possible actions. Each element represents the expected cumulative long term reward when the
agent starts from the given observation and takes the action corresponding to the position of the
element in the output vector.
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 2
doubles.
obsInfo = rlNumericSpec([2 1]);
Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example, define the
action space as a finite set consisting of 3 possible values (named 7, 5, and 3 in this case).
actInfo = rlFiniteSetSpec([7 5 3]);
Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations
defined by obsInfo.
myBasisFcn = @(myobs) [myobs(2)^2; ...
myobs(1); ...
exp(myobs(2)); ...
abs(myobs(1))]
The output of the critic is the vector c = W'*myBasisFcn(myobs), where W is a weight matrix
which must have as many rows as the length of the basis function output, and as many columns as the
number of possible actions.
Each element of c is the expected cumulative long term reward when the agent starts from the given
observation and takes the action corresponding to the position of the considered element. The
elements of W are the learnable parameters.
Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial parameter matrix. The second and third arguments are, respectively, the
observation and action specification objects.
3-284
rlQValueRepresentation
critic = rlQValueRepresentation({myBasisFcn,W0},...
obsInfo,actInfo)
critic =
rlQValueRepresentation with properties:
To check your critic, use the getValue function to return the values of a random observation, using
the current parameter matrix. Note that there is one value for each of the three possible actions.
v = getValue(critic,{rand(2,1)})
v =
3x1 dlarray
2.1395
1.2183
2.3342
You can now use the critic (along with an actor) to create a discrete action space agent relying on a
Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).
env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
numObs = obsInfo.Dimension(1);
numDiscreteAct = numel(actInfo.Elements);
Create a recurrent deep neural network for your critic. To create a recurrent neural network, use a
sequenceInputLayer as the input layer and include at least one lstmLayer.
criticNetwork = [
sequenceInputLayer(numObs,...
'Normalization','none','Name','state')
fullyConnectedLayer(50, 'Name', 'CriticStateFC1')
reluLayer('Name','CriticRelu1')
lstmLayer(20,'OutputMode','sequence',...
'Name','CriticLSTM');
fullyConnectedLayer(20,'Name','CriticStateFC2')
reluLayer('Name','CriticRelu2')
fullyConnectedLayer(numDiscreteAct,...
'Name','output')];
Create a representation for your critic using the recurrent neural network.
3-285
3 Objects
criticOptions = rlRepresentationOptions(...
'LearnRate',1e-3,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,...
obsInfo,actInfo,...
'Observation','state',criticOptions);
Version History
Introduced in R2020a
The following table shows some typical uses of rlQValueRepresentation to create neural
network-based critics, and how to update your code with one of the new Q-value approximator
objects instead.
The following table shows some typical uses of rlQValueRepresentation to create table-based
critics with discrete observation and action spaces, and how to update your code with one of the new
Q-value approximator objects instead.
3-286
rlQValueRepresentation
The following table shows some typical uses of rlQValueRepresentation to create critics which
use a (linear in the learnable parameters) custom basis function, and how to update your code with
one of the new Q-value approximator objects instead. In these function calls, the first input argument
is a two-element cell array containing both the handle to the custom basis function and the initial
weight vector or matrix.
See Also
Functions
rlQValueFunction | rlRepresentationOptions | getActionInfo | getObservationInfo
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-287
3 Objects
rlReplayMemory
Replay memory experience buffer
Description
An off-policy reinforcement learning agent stores experiences in a circular experience buffer. During
training, the agent samples mini-batches of experiences from the buffer and uses these mini-batches
to update its actor and critic function approximators.
By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object
as their experience buffer. Agents uniformly sample data from this buffer. To perform nonuniform
prioritized sampling, use an rlPrioritizedReplayMemory object.
When you create a custom off-policy reinforcement learning agent, you can create an experience
buffer by using an rlReplayMemory object.
Creation
Syntax
buffer = rlReplayMemory(obsInfo,actInfo)
buffer = rlReplayMemory(obsInfo,actInfo,maxLength)
Description
Input Arguments
You can extract the observation specifications from an existing environment or agent using
getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec
or rlNumericSpec.
3-288
rlReplayMemory
You can extract the action specifications from an existing environment or agent using
getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or
rlNumericSpec.
Properties
MaxLength — Maximum buffer length
10000 (default) | positive integer
Object Functions
append Append experiences to replay memory buffer
sample Sample experiences from replay memory buffer
resize Resize replay memory experience buffer
allExperiences Return all experiences in replay memory buffer
getActionInfo Obtain action data specifications from reinforcement learning environment,
agent, or experience buffer
getObservationInfo Obtain observation data specifications from reinforcement learning
environment, agent, or experience buffer
Examples
Define observation specifications for the environment. For this example, assume that the environment
has a single observation channel with three continuous signals in specified ranges.
Define action specifications for the environment. For this example, assume that the environment has
a single action channel with two continuous signals in specified ranges.
buffer = rlReplayMemory(obsInfo,actInfo,20000);
3-289
3 Objects
Append a single experience to the buffer using a structure. Each experience contains the following
elements: current observation, action, next observation, reward, and is-done.
For this example, create an experience with random observation, action, and reward values. Indicate
that this experience is not a terminal condition by setting the IsDone value to 0.
exp.Observation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Action = {actInfo.UpperLimit.*rand(2,1)};
exp.NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
exp.Reward = 10*rand(1);
exp.IsDone = 0;
append(buffer,exp);
You can also append a batch of experiences to the experience buffer using a structure array. For this
example, append a sequence of 100 random experiences, with the final experience representing a
terminal condition.
for i = 1:100
expBatch(i).Observation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Action = {actInfo.UpperLimit.*rand(2,1)};
expBatch(i).NextObservation = {obsInfo.UpperLimit.*rand(3,1)};
expBatch(i).Reward = 10*rand(1);
expBatch(i).IsDone = 0;
end
expBatch(100).IsDone = 1;
append(buffer,expBatch);
After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 50 experiences from the buffer.
miniBatch = sample(buffer,50);
You can sample a horizon of data from the buffer. For example, sample a horizon of 10 consecutive
experiences with a discount factor of 0.95.
horizonSample = sample(buffer,1,...
NStepHorizon=10,...
DiscountFactor=0.95);
• Observation and Action are the observation and action from the first experience in the
horizon.
• NextObservation and IsDone are the next observation and termination signal from the final
experience in the horizon.
• Reward is the cumulative reward across the horizon using the specified discount factor.
You can also sample a sequence of consecutive experiences. In this case, the structure fields contain
arrays with values for all sampled experiences.
sequenceSample = sample(buffer,1,...
SequenceLength=20);
3-290
rlReplayMemory
Define observation specifications for the environment. For this example, assume that the environment
has two observation channels: one channel with two continuous observations and one channel with a
three-valued discrete observation.
obsContinuous = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[1;5]);
obsDiscrete = rlFiniteSetSpec([1 2 3]);
obsInfo = [obsContinuous obsDiscrete];
Define action specifications for the environment. For this example, assume that the environment has
a single action channel with one continuous action in a specified range.
actInfo = rlNumericSpec([2 1],...
LowerLimit=0,...
UpperLimit=[5;10]);
append(buffer,exp);
After appending experiences to the buffer, you can sample mini-batches of experiences for training of
your RL agent. For example, randomly sample a batch of 10 experiences from the buffer.
miniBatch = sample(buffer,10);
Create an environment for training the agent. For this example, load a predefined environment.
env = rlPredefinedEnv("SimplePendulumWithImage-Discrete");
3-291
3 Objects
agent = rlDQNAgent(obsInfo,actInfo);
By default, the agent uses an experience buffer with a maximum size of 10,000.
agent.ExperienceBuffer
ans =
rlReplayMemory with properties:
MaxLength: 10000
Length: 0
resize(agent.ExperienceBuffer,20000)
agent.ExperienceBuffer
ans =
rlReplayMemory with properties:
MaxLength: 20000
Length: 0
Version History
Introduced in R2022a
See Also
rlPrioritizedReplayMemory
3-292
rlRepresentationOptions
rlRepresentationOptions
(Not recommended) Options set for reinforcement learning agent representations (critics and actors)
Description
Use an rlRepresentationOptions object to specify an options set for critics
(rlValueRepresentation, rlQValueRepresentation) and actors
(rlDeterministicActorRepresentation, rlStochasticActorRepresentation).
Creation
Syntax
repOpts = rlRepresentationOptions
repOpts = rlRepresentationOptions(Name,Value)
Description
Properties
LearnRate — Learning rate for the representation
0.01 (default) | positive scalar
Learning rate for the representation, specified as a positive scalar. If the learning rate is too low, then
training takes a long time. If the learning rate is too high, then training might reach a suboptimal
result or diverge.
Example: 'LearnRate',0.025
Optimizer for training the network of the representation, specified as one of the following values.
3-293
3 Objects
• "adam" — Use the Adam optimizer. You can specify the decay rates of the gradient and squared
gradient moving averages using the GradientDecayFactor and
SquaredGradientDecayFactor fields of the OptimizerParameters option.
• "sgdm" — Use the stochastic gradient descent with momentum (SGDM) optimizer. You can specify
the momentum value using the Momentum field of the OptimizerParameters option.
• "rmsprop" — Use the RMSProp optimizer. You can specify the decay rate of the squared gradient
moving average using the SquaredGradientDecayFactor fields of the OptimizerParameters
option.
For more information about these optimizers, see “Stochastic Gradient Descent” in the Algorithms
section of trainingOptions in Deep Learning Toolbox.
Example: 'Optimizer',"sgdm"
Applicable parameters for the optimizer, specified as an OptimizerParameters object with the
following parameters.
Parameter Description
Momentum Contribution of previous step, specified as a
scalar from 0 to 1. A value of 0 means no
contribution from the previous step. A value of 1
means maximal contribution.
3-294
rlRepresentationOptions
To change the default values, create an rlRepresentationOptions set and use dot notation to
access and change the properties of OptimizerParameters.
repOpts = rlRepresentationOptions;
repOpts.OptimizerParameters.GradientDecayFactor = 0.95;
Threshold value for the representation gradient, specified as Inf or a positive scalar. If the gradient
exceeds this value, the gradient is clipped as specified by the GradientThresholdMethod option.
Clipping the gradient limits how much the network parameters change in a training iteration.
Example: 'GradientThreshold',1
Gradient threshold method used to clip gradient values that exceed the gradient threshold, specified
as one of the following values.
For more information, see “Gradient Clipping” in the Algorithms section of trainingOptions in
Deep Learning Toolbox.
Example: 'GradientThresholdMethod',"absolute-value"
Factor for L2 regularization (weight decay), specified as a nonnegative scalar. For more information,
see “L2 Regularization” in the Algorithms section of trainingOptions in Deep Learning Toolbox.
To avoid overfitting when using a representation with many parameters, consider increasing the
L2RegularizationFactor option.
Example: 'L2RegularizationFactor',0.0005
Computation device used to perform deep neural network operations such as gradient computation,
parameter update and prediction during training. It is specified as either "cpu" or "gpu".
3-295
3 Objects
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating an agent on a GPU involves device-specific numerical round off errors.
These errors can produce different results compared to performing the same operations a CPU.
Note that if you want to use parallel processing to speed up training, you do not need to set
UseDevice. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: 'UseDevice',"gpu"
Object Functions
rlValueRepresentation (Not recommended) Value function critic representation for
reinforcement learning agents
rlQValueRepresentation (Not recommended) Q-Value function critic representation for
reinforcement learning agents
rlDeterministicActorRepresentation (Not recommended) Deterministic actor representation for
reinforcement learning agents
rlStochasticActorRepresentation (Not recommended) Stochastic actor representation for
reinforcement learning agents
Examples
Create an options set for creating a critic or actor representation for a reinforcement learning agent.
Set the learning rate for the representation to 0.05, and set the gradient threshold to 1. You can set
the options using Name,Value pairs when you create the options set. Any options that you do not
explicitly set have their default values.
repOpts = rlRepresentationOptions('LearnRate',5e-2,...
'GradientThreshold',1)
repOpts =
rlRepresentationOptions with properties:
LearnRate: 0.0500
GradientThreshold: 1
GradientThresholdMethod: "l2norm"
L2RegularizationFactor: 1.0000e-04
UseDevice: "cpu"
Optimizer: "adam"
OptimizerParameters: [1x1 rl.option.OptimizerParameters]
Alternatively, create a default options set and use dot notation to change some of the values.
3-296
rlRepresentationOptions
repOpts = rlRepresentationOptions;
repOpts.LearnRate = 5e-2;
repOpts.GradientThreshold = 1
repOpts =
rlRepresentationOptions with properties:
LearnRate: 0.0500
GradientThreshold: 1
GradientThresholdMethod: "l2norm"
L2RegularizationFactor: 1.0000e-04
UseDevice: "cpu"
Optimizer: "adam"
OptimizerParameters: [1x1 rl.option.OptimizerParameters]
If you want to change the properties of the OptimizerParameters option, use dot notation to
access them.
repOpts.OptimizerParameters.Epsilon = 1e-7;
repOpts.OptimizerParameters
ans =
OptimizerParameters with properties:
Version History
Introduced in R2019a
Specifically, you can create an agent options object and set its CriticOptimizerOptions and
ActorOptimizerOptions properties to suitable rlOptimizerOptions objects. Then you pass the
agent options object to the function that creates the agent. This workflow is shown in the following
table.
3-297
3 Objects
agent = rlACAgent(actor,critic,agentOpts)
Alternatively, you can create the agent and then use dot notation to access the optimization options
for the agent actor and critic, for example:
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;.
See Also
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-298
rlSACAgent
rlSACAgent
Soft actor-critic reinforcement learning agent
Description
The soft actor-critic (SAC) algorithm is a model-free, online, off-policy, actor-critic reinforcement
learning method. The SAC algorithm computes an optimal policy that maximizes both the long-term
expected reward and the entropy of the policy. The policy entropy is a measure of policy uncertainty
given the state. A higher entropy value promotes more exploration. Maximizing both the reward and
the entropy balances exploration and exploitation of the environment. The action space can only be
continuous.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
agent = rlSACAgent(observationInfo,actionInfo)
agent = rlSACAgent(observationInfo,actionInfo,initOptions)
agent = rlSACAgent(actor,critics)
Description
agent = rlSACAgent(actor,critics) creates a SAC agent with the specified actor and critic
networks and default agent options.
3-299
3 Objects
agent = rlSACAgent( ___ ,agentOptions) sets the AgentOptions property for any of the
previous syntaxes.
Input Arguments
actor — Actor
rlContinuousGaussianActor object
critics — Critic
rlQValueFunction object | two-element row vector of rlQValueFunction objects
For a SAC agent, each critic must be a single-output rlQValueFunction object that takes both the
action and observations as inputs.
For more information on creating critics, see “Create Policies and Value Functions”.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.
3-300
rlSACAgent
Action specification for a continuous action space, specified as an rlNumericSpec object defining
properties such as dimensions, data type and name of the action signals.
If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.
You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.
If you create a SAC agent with default actor and critic that use recurrent neural networks, the default
value of AgentOptions.SequenceLength is 32.
Experience buffer, specified as an rlReplayMemory object. During training the agent stores each of
its experiences (S,A,R,S',D) in a buffer. Here:
Option to use exploration policy when selecting actions, specified as a one of the following logical
values.
• true — Use the base agent exploration policy when selecting actions in sim and
generatePolicyFunction. In this case, the agent selects its actions by sampling its probability
distribution, the policy is therefore stochastic and the agent explores its observation space.
• false — Use the base agent greedy policy (the action with maximum likelihood) when selecting
actions in sim and generatePolicyFunction. In this case, the simulated agent and generated
policy behave deterministically.
Note This option affects only simulation and deployment; it does not affect training.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.
3-301
3 Objects
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
Examples
Create environment and obtain observation and action specifications. For this example, load the
environment used in the example “Train DDPG Agent to Control Double Integrator System”. The
observation from the environment is a vector containing the position and velocity of a mass. The
action is a scalar representing a force, applied to the mass, ranging continuously from -2 to 2
Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
Create a SAC agent from the environment observation and action specifications.
agent = rlSACAgent(obsInfo,actInfo);
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension)})
3-302
rlSACAgent
You can now test and train the agent within the environment. You can also use getActor and
getCritic to extract the actor and critic, respectively, and getModel to extract the approximator
model (by default a deep neural network) from the actor or critic.
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”. The observation from the environment is a vector containing the
position and velocity of a mass. The action is a scalar representing a force, applied to the mass,
ranging continuously from -2 to 2 Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons.
initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
Create a SAC agent from the environment observation and action specifications using the
initialization options.
agent = rlSACAgent(obsInfo,actInfo,initOpts);
actorNet = getModel(getActor(agent));
Extract the deep neural networks from the two critics. Note that getModel(critics) only returns
the first critic network.
critics = getCritic(agent);
criticNet1 = getModel(critics(1));
criticNet2 = getModel(critics(2));
Display the layers of the first critic network, and verify that each hidden fully connected layer has 128
neurons.
criticNet1.Layers
ans =
9x1 Layer array with layers:
3-303
3 Objects
Plot the networks of the actor and of the second critic, and display the number of weights.
plot(layerGraph(actorNet))
summary(actorNet)
Initialized: true
Inputs:
1 'input_1' 2 features
plot(layerGraph(criticNet2))
3-304
rlSACAgent
summary(criticNet2)
Initialized: true
Inputs:
1 'input_1' 2 features
2 'input_2' 1 features
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension)})
You can now test and train the agent within the environment.
Create an environment and obtain observation and action specifications. For this example, load the
environment used in the example “Train DDPG Agent to Control Double Integrator System”. The
observations from the environment is a vector containing the position and velocity of a mass. The
3-305
3 Objects
action is a scalar representing a force, applied to the mass, ranging continuously from -2 to 2
Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
A SAC agent uses two Q-value function critics. To approximate each Q-value function, use a neural
network. The network for a single-output Q-value function critic must have two input layers, one for
the observation and the other for the action, and return a scalar value representing the expected
cumulative long-term reward following from the given observation and action.
Define each network path as an array of layer objects, and the dimensions of the observation and
action spaces from the environment specification objects.
% Observation path
obsPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="obsPathIn")
fullyConnectedLayer(32)
reluLayer
fullyConnectedLayer(16,Name="obsPathOut")
];
% Action path
actPath = [
featureInputLayer(prod(actInfo.Dimension),Name="actPathIn")
fullyConnectedLayer(32)
reluLayer
fullyConnectedLayer(16,Name="actPathOut")
];
% Common path
commonPath = [
concatenationLayer(1,2,Name="concat")
reluLayer
fullyConnectedLayer(1)
];
% Connect layers
criticNet = connectLayers(criticNet,"obsPathOut","concat/in1");
criticNet = connectLayers(criticNet,"actPathOut","concat/in2");
To initialize the network weights differently for the two critics, create two different dlnetwork
objects. You must do this because if the agent constructor function does not accept two identical
critics.
criticNet1 = dlnetwork(criticNet);
criticNet2 = dlnetwork(criticNet);
3-306
rlSACAgent
Initialized: true
Inputs:
1 'obsPathIn' 2 features
2 'actPathIn' 1 features
Create the two critics using rlQValueFunction, using the two networks with different weights.
Alternatively, if you use exactly the same network with the same weights, you must explicitly initialize
the network each time (to make sure weights are initialized differently) before passing it to
rlQValueFunction. To do so, use initialize.
critic1 = rlQValueFunction(criticNet1,obsInfo,actInfo, ...
ActionInputNames="actPathIn",ObservationInputNames="obsPathIn");
ans = single
-0.1330
getValue(critic2,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})
ans = single
-0.1526
To approximate the policy within the actor, use a deep neural network. Since SAC agents use a
continuous Gaussian actor, the network must take the observation signal as input and return both a
mean value and a standard deviation value for each action. Therefore it must have two output layers
(one for the mean values the other for the standard deviation values), each having as many elements
as the dimension of the action space.
Do not add a tanhLayer or scalingLayer in the mean output path. The SAC agent internally
transforms the unbounded Gaussian distribution to the bounded distribution to compute the
probability density function and entropy properly.
Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces from the environment specification objects, and specify a name for the input and output
layers, so you can later explicitly associate them with the appropriate channel.
% Define common input path
commonPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="netObsIn")
fullyConnectedLayer(400)
reluLayer(Name="CommonRelu")];
3-307
3 Objects
% Connect layers
actorNet = connectLayers(actorNet,"CommonRelu","meanIn/in");
actorNet = connectLayers(actorNet,"CommonRelu","stdIn/in");
Initialized: true
Inputs:
1 'netObsIn' 2 features
Create the actor using actorNet, the observation and action specification objects, and the names of
the input and output layers.
getAction(actor,{rand(obsInfo.Dimension)})
3-308
rlSACAgent
Specify agent options, including training options for actor and critics.
agentOptions = rlSACAgentOptions;
agentOptions.SampleTime = env.Ts;
agentOptions.DiscountFactor = 0.99;
agentOptions.TargetSmoothFactor = 1e-3;
agentOptions.ExperienceBufferLength = 1e6;
agentOptions.MiniBatchSize = 32;
agentOptions.CriticOptimizerOptions = criticOptions;
agentOptions.ActorOptimizerOptions = actorOptions;
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension)})
You can now test and train the agent within the environment.
For this example, load the environment used in the example “Train DDPG Agent to Control Double
Integrator System”. The observations from the environment is a vector containing the position and
velocity of a mass. The action is a scalar representing a force, applied to the mass, ranging
continuously from -2 to 2 Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
A SAC agent uses two Q-value function critics. To approximate each Q-value function, use a neural
network. The network for a single-output Q-value function critic must have two input layers, one for
the observation and the other for the action, and return a scalar value representing the expected
cumulative long-term reward following from the given observation and action.
Define each network path as an array of layer objects, and the dimensions of the observation and
action spaces from the environment specification objects. To create a recurrent neural network, use
sequenceInputLayer as the input layer and include an lstmLayer as one of the other network
layers.
3-309
3 Objects
% Connect paths
criticNet = connectLayers(criticNet,"obsOut","cat/in1");
criticNet = connectLayers(criticNet,"actOut","cat/in2");
To initialize the network weights differently for the two critics, create two different dlnetwork
objects. You must do this because if the agent constructor function does not accept two identical
critics.
criticNet1 = dlnetwork(criticNet);
criticNet2 = dlnetwork(criticNet);
summary(criticNet1)
Initialized: true
Inputs:
1 'obsIn' Sequence input with 2 dimensions
2 'actIn' Sequence input with 1 dimensions
Create the critic using rlQValueFunction. Use the same network structure for both critics. The
SAC agent initializes the two networks using different default parameters.
critic1 = rlQValueFunction(criticNet1,obsInfo,actInfo);
critic2 = rlQValueFunction(criticNet2,obsInfo,actInfo);
getValue(critic1,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})
ans = single
-0.0020
getValue(critic2,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})
ans = single
0.0510
3-310
rlSACAgent
To approximate the policy within the actor, use a deep neural network. Since the critic has a
recurrent network, the actor must have a recurrent network too. The network must have two output
layers (one for the mean values the other for the standard deviation values), each having as many
elements as the dimension of the action space.
Do not add a tanhLayer or scalingLayer in the mean output path. The SAC agent internally
transforms the unbounded Gaussian distribution to the bounded distribution to compute the
probability density function and entropy properly.
Define each network path as an array of layer objects and specify a name for the input and output
layers, so you can later explicitly associate them with the appropriate channel.
meanPath = [
fullyConnectedLayer(300,Name="MeanIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension),Name="Mean")
];
stdPath = [
fullyConnectedLayer(300,Name="StdIn")
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension))
softplusLayer(Name="StandardDeviation")];
actorNet = layerGraph(commonPath);
actorNet = addLayers(actorNet,meanPath);
actorNet = addLayers(actorNet,stdPath);
actorNet = connectLayers(actorNet,"CommonOut","MeanIn/in");
actorNet = connectLayers(actorNet,"CommonOut","StdIn/in");
Initialized: true
Inputs:
1 'obsIn' Sequence input with 2 dimensions
3-311
3 Objects
getAction(actor,{rand(obsInfo.Dimension)})
Specify agent options. To use a recurrent neural network, you must specify a SequenceLength
greater than 1.
agentOptions = rlSACAgentOptions;
agentOptions.SampleTime = env.Ts;
agentOptions.DiscountFactor = 0.99;
agentOptions.TargetSmoothFactor = 1e-3;
agentOptions.ExperienceBufferLength = 1e6;
agentOptions.SequenceLength = 32;
agentOptions.MiniBatchSize = 32;
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
Version History
Introduced in R2020b
See Also
rlAgentInitializationOptions | rlSACAgentOptions | rlQValueFunction |
rlContinuousGaussianActor | initialize | Deep Network Designer
Topics
“Soft Actor-Critic Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
3-312
rlSACAgentOptions
rlSACAgentOptions
Options for SAC agent
Description
Use an rlSACAgentOptions object to specify options for soft actor-critic (SAC) agents. To create a
SAC agent, use rlSACAgent.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
opt = rlSACAgentOptions
opt = rlSACAgentOptions(Name,Value)
Description
opt = rlSACAgentOptions creates an options object for use as an argument when creating a SAC
agent using all default options. You can modify the object properties using dot notation.
Properties
EntropyWeightOptions — Entropy tuning options
EntropyWeightOptions object
Optimizer learning rate, specified as a nonnegative scalar. If LearnRate is zero, the EntropyWeight
value is fixed during training and the TargetEntropy value is ignored.
3-313
3 Objects
Target entropy value for tuning entropy weight, specified as a scalar. A higher target entropy value
encourages more exploration.
If you do not specify TargetEntropy, the agent uses –A as the target value, where A is the number
of actions.
• "adam" — Use the Adam optimizer. You can specify the decay rates of the gradient and squared
gradient moving averages using the GradientDecayFactor and
SquaredGradientDecayFactor fields of the OptimizerParameters option.
• "sgdm" — Use the stochastic gradient descent with momentum (SGDM) optimizer. You can specify
the momentum value using the Momentum field of the OptimizerParameters option.
• "rmsprop" — Use the RMSProp optimizer. You can specify the decay rate of the squared gradient
moving average using the SquaredGradientDecayFactor fields of the OptimizerParameters
option.
For more information about these optimizers, see “Stochastic Gradient Descent” in Deep Learning
Toolbox.
Threshold value for the entropy gradient, specified as Inf or a positive scalar. If the gradient exceeds
this value, the gradient is clipped.
Applicable parameters for the optimizer, specified as an OptimizerParameters object with the
following parameters. The default parameter values work well for most problems.
3-314
rlSACAgentOptions
To change the default values, access the properties of OptimizerParameters using dot notation.
opt = rlSACAgentOptions;
opt.EntropyWeightOptions.OptimizerParameters.GradientDecayFactor = 0.95;
Number of steps between actor policy updates, specified as a positive integer. For more information,
see “Training Algorithm”.
Number of steps between critic updates, specified as a positive integer. For more information, see
“Training Algorithm”.
Number of actions to take before updating actor and critics, specified as a positive integer. By
default, the NumWarmStartSteps value is equal to the MiniBatchSize value.
3-315
3 Objects
Number of gradient steps to take when updating actor and critics, specified as a positive integer.
Smoothing factor for target critic updates, specified as a positive scalar less than or equal to 1. For
more information, see “Target Update Methods”.
Number of steps between target critic updates, specified as a positive integer. For more information,
see “Target Update Methods”.
Option for clearing the experience buffer before training, specified as a logical value.
Maximum batch-training trajectory length when using a recurrent neural network, specified as a
positive integer. This value must be greater than 1 when using a recurrent neural network and 1
otherwise.
Size of random experience mini-batch, specified as a positive integer. During each training episode,
the agent randomly samples experiences from the experience buffer when computing gradients for
updating the actor and critics. Large mini-batches reduce the variance when computing gradients but
increase the computational effort.
NumStepsToLookAhead — Number of future rewards used to estimate the value of the policy
1 (default) | positive integer
Number of future rewards used to estimate the value of the policy, specified as a positive integer. For
more information, see [1], Chapter 7.
3-316
rlSACAgentOptions
Note that if parallel training is enabled (that is if an rlTrainingOptions option object in which the
UseParallel property is set to true is passed to train) then NumStepsToLookAhead must be set
to 1, otherwise an error is generated. This guarantees that experiences are stored contiguously.
Experience buffer size, specified as a positive integer. During training, the agent computes updates
using a mini-batch of experiences randomly sampled from the buffer.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.
Object Functions
rlSACAgent Soft actor-critic reinforcement learning agent
Examples
opt = rlSACAgentOptions('DiscountFactor',0.95)
opt =
rlSACAgentOptions with properties:
3-317
3 Objects
You can modify options using dot notation. For example, set the agent sample time to 0.5.
opt.SampleTime = 0.5;
For SAC agents, configure the entropy weight optimizer using the options in
EntropyWeightOptions. For example, set the target entropy value to –5.
opt.EntropyWeightOptions.TargetEntropy = -5;
Version History
Introduced in R2020b
• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.AgentOptions.UseDeterministicExploitation = true;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.AgentOptions.UseDeterministicExploitation = false;
• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.UseExplorationPolicy = false;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.UseExplorationPolicy = true;
3-318
rlSACAgentOptions
References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second
edition. Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press,
2018.
See Also
rlSACAgent
Topics
“Soft Actor-Critic Agents”
3-319
3 Objects
rlSARSAAgent
SARSA reinforcement learning agent
Description
The SARSA algorithm is a model-free, online, on-policy reinforcement learning method. A SARSA
agent is a value-based reinforcement learning agent which trains a critic to estimate the return or
future rewards.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
agent = rlSARSAAgent(critic,agentOptions)
Description
Input Arguments
critic — Critic
rlQValueFunction object
Critic, specified as an rlQValueFunction object. For more information on creating critics, see
“Create Policies and Value Functions”.
Properties
AgentOptions — Agent options
rlSARSAAgentOptions object
Option to use exploration policy when selecting actions, specified as a one of the following logical
values.
3-320
rlSARSAAgent
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The initial value of SampleTime matches the value specified in
AgentOptions.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
Examples
3-321
3 Objects
Create or load an environment interface. For this example load the Basic Grid World environment
interface also used in the example “Train Reinforcement Learning Agent in Basic Grid World”.
env = rlPredefinedEnv("BasicGridWorld");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
Create a table approximation model derived from the environment observation and action
specifications.
qTable = rlTable(obsInfo,actInfo);
Create the critic using qTable. SARSA agents use an rlValueFunction object to implement the
critic.
critic = rlQValueFunction(qTable,obsInfo,actInfo);
Create a SARSA agent using the specified critic and an epsilon value of 0.05.
opt = rlSARSAAgentOptions;
opt.EpsilonGreedyExploration.Epsilon = 0.05;
agent = rlSARSAAgent(critic,opt)
agent =
rlSARSAAgent with properties:
To check your agent, use getAction to return the action from a random observation.
act = getAction(agent,{randi(numel(obsInfo.Elements))});
act{1}
ans = 1
You can now test and train the agent against the environment.
Version History
Introduced in R2019a
See Also
rlSARSAAgentOptions
3-322
rlSARSAAgent
Topics
“SARSA Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
3-323
3 Objects
rlSARSAAgentOptions
Options for SARSA agent
Description
Use an rlSARSAAgentOptions object to specify options for creating SARSA agents. To create a
SARSA agent, use rlSARSAAgent
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
opt = rlSARSAAgentOptions
opt = rlSARSAAgentOptions(Name,Value)
Description
Properties
EpsilonGreedyExploration — Options for epsilon-greedy exploration
EpsilonGreedyExploration object
3-324
rlSARSAAgentOptions
At the end of each training time step, if Epsilon is greater than EpsilonMin, then it is updated
using the following formula.
Epsilon = Epsilon*(1-EpsilonDecay)
If your agent converges on local optima too quickly, you can promote agent exploration by increasing
Epsilon.
To specify exploration options, use dot notation after creating the rlSARSAAgentOptions object
opt. For example, set the epsilon value to 0.9.
opt.EpsilonGreedyExploration.Epsilon = 0.9;
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.
3-325
3 Objects
Object Functions
rlSARSAAgent SARSA reinforcement learning agent
Examples
opt = rlSARSAAgentOptions('SampleTime',0.5)
opt =
rlSARSAAgentOptions with properties:
You can modify options using dot notation. For example, set the agent discount factor to 0.95.
opt.DiscountFactor = 0.95;
Version History
Introduced in R2019a
See Also
Topics
“SARSA Agents”
3-326
rlSimulationOptions
rlSimulationOptions
Options for simulating a reinforcement learning agent within an environment
Description
Use an rlSimulationOptions object to specify simulation options for simulating a reinforcement
learning agent within an environment. To perform the simulation, use sim.
For more information on agents training and simulation, see “Train Reinforcement Learning Agents”.
Creation
Syntax
simOpts = rlSimulationOptions
opt = rlSimulationOptions(Name,Value)
Description
Properties
MaxSteps — Number of steps to run the simulation
500 (default) | positive integer
Number of steps to run the simulation, specified as the comma-separated pair consisting of
'MaxSteps' and a positive integer. In general, you define episode termination conditions in the
environment. This value is the maximum number of steps to run in the simulation if those termination
conditions are not met.
Example: 'MaxSteps',1000
3-327
3 Objects
Example: 'NumSimulations',10
Stop simulation when an error occurs, specified as "off" or "on". When this option is "off", errors
are captured and returned in the SimulationInfo output of sim, and simulation continues.
Flag for using parallel simulation, specified as a logical. Setting this option to true configures the
simulation to use parallel processing to simulate the environment, thereby enabling usage of multiple
cores, processors, computer clusters or cloud resources to speed up simulation. To specify options for
parallel simulation, use the ParallelizationOptions property.
Note that if you want to speed up deep neural network calculations (such as gradient computation,
parameter update and prediction) using a local GPU you do not need to set UseParallel to true.
Instead, when creating your actor or critic representation, use an rlRepresentationOptions
object in which the UseDevice option is set to "gpu".
Using parallel computing or the GPU requires Parallel Computing Toolbox software. Using computer
clusters or cloud resources additionally requires MATLAB Parallel Server™.
For more information about training using multicore processors and GPUs, see “Train Agents Using
Parallel Computing and GPUs”.
Example: 'UseParallel',true
The ParallelTraining object has the following properties, which you can modify using dot
notation after creating the rlTrainingOptions object.
• –1 — Assign a unique random seed to each worker. The value of the seed is the worker ID.
• –2 — Do not assign a random seed to the workers.
• Vector — Manually specify the random seed for each work. The number of elements in the vector
must match the number of workers.
3-328
rlSimulationOptions
Send model and workspace variables to parallel workers, specified as "on" or "off". When the
option is "on", the host sends variables used in models and defined in the base MATLAB workspace
to the workers.
Additional files to attach to the parallel pool, specified as a string or string array.
Function to run before simulation starts, specified as a handle to a function having no input
arguments. This function is run once per worker before simulation begins. Write this function to
perform any processing that you need prior to simulation.
Function to run after simulation ends, specified as a handle to a function having no input arguments.
You can write this function to clean up the workspace or perform other processing after simulation
terminates.
Object Functions
sim Simulate trained reinforcement learning agents within specified environment
Examples
Create an options set for simulating a reinforcement learning environment. Set the number of steps
to simulate to 1000, and configure the options to run three simulations.
You can set the options using Name,Value pairs when you create the options set. Any options that you
do not explicitly set have their default values.
simOpts = rlSimulationOptions(...
'MaxSteps',1000,...
'NumSimulations',3)
simOpts =
rlSimulationOptions with properties:
MaxSteps: 1000
NumSimulations: 3
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelSimulation]
Alternatively, create a default options set and use dot notation to change some of the values.
simOpts = rlSimulationOptions;
simOpts.MaxSteps = 1000;
3-329
3 Objects
simOpts.NumSimulations = 3;
simOpts
simOpts =
rlSimulationOptions with properties:
MaxSteps: 1000
NumSimulations: 3
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelSimulation]
Version History
Introduced in R2019a
See Also
Topics
“Reinforcement Learning Agents”
3-330
rlStochasticActorPolicy
rlStochasticActorPolicy
Policy object to generate stochastic actions for custom training loops and application deployment
Description
This object implements a stochastic policy, which returns stochastic actions given an input
observation, according to a probability distribution. You can create an rlStochasticActorPolicy
object from an rlDiscreteCategoricalActor or rlContinuousGaussianActor, or extract it
from an rlPGAgent, rlACAgent, rlPPOAgent, rlTRPOAgent, or rlSACAgent. You can then train
the policy object using a custom training loop or deploy it for your application using
generatePolicyBlock or generatePolicyFunction. If UseMaxLikelihoodAction is set to 1
the policy is deterministic, therefore in this case it does not explore. For more information on policies
and value functions, see “Create Policies and Value Functions”.
Creation
Syntax
policy = rlStochasticActorPolicy(actor)
Description
Properties
Actor — Actor
rlDiscreteCategoricalActor object | rlContinuousGaussianActor object
Option to enable maximum likelihood action, specified as a logical value: either false (default, the
action is sampled from the probability distribution, this helps exploration) or true (always using
maximum likelihood action). When the option to always use the maximum likelihood action enabled
the policy is deterministic and therefore it does not explore.
Example: false
3-331
3 Objects
Sample time of the policy, specified as a positive scalar or as -1 (default). Setting this parameter to
-1 allows for event-based simulations.
Within a Simulink environment, the RL Agent block in which the policy is specified executes every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the policy is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience. If
SampleTime is -1, the sample time is treated as being equal to 1.
Example: 0.2
Object Functions
generatePolicyBlock Generate Simulink block that evaluates policy of an agent or policy object
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
getAction Obtain action from agent, actor, or policy object given environment
observations
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
reset Reset environment, agent, experience buffer, or policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
Examples
Create observation and action specification objects. For this example, define a continuous four-
dimensional observation space and a discrete action space having two possible actions.
3-332
rlStochasticActorPolicy
Create a discrete categorical actor. This actor must accept an observation as input and return an
output vector in which each element represents the probability of taking the corresponding action.
To approximate the policy function within the actor, use a deep neural network model. Define the
network as an array of layer objects, and get the dimension of the observation space and the number
of possible actions from the environment specification objects.
layers = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(16)
reluLayer
fullyConnectedLayer(numel(actInfo.Elements))
];
Convert the network to a dlnetwork object and display the number of weights.
model = dlnetwork(layers);
summary(model)
Initialized: true
Inputs:
1 'input' 4 features
Create the actor using model, and the observation and action specifications.
actor = rlDiscreteCategoricalActor(model,obsInfo,actInfo)
actor =
rlDiscreteCategoricalActor with properties:
To return the probability distribution of the possible actions as a function of a random observation,
and given the current network weights, use evaluate.
prb = evaluate(actor,{rand(obsInfo.Dimension)});
prb{1}
0.5850
0.4150
policy =
rlStochasticActorPolicy with properties:
3-333
3 Objects
You can access the policy options using dot notation. For example, set the option to always use the
maximum likelihood action, thereby making the policy deterministic.
policy.UseMaxLikelihoodAction = true
policy =
rlStochasticActorPolicy with properties:
act = getAction(policy,{rand(obsInfo.Dimension)});
act{1}
ans = -1
You can now train the policy with a custom training loop and then deploy it to your application.
Version History
Introduced in R2022a
See Also
Functions
rlMaxQPolicy | rlEpsilonGreedyPolicy | rlDeterministicActorPolicy |
rlAdditiveNoisePolicy | rlStochasticActorPolicy | rlDiscreteCategoricalActor |
rlContinuousGaussianActor | rlPGAgent | rlACAgent | rlSACAgent | rlPPOAgent |
rlTRPOAgent | generatePolicyBlock | generatePolicyFunction
Blocks
RL Agent
Topics
“Create Policies and Value Functions”
“Model-Based Reinforcement Learning Using Custom Training Loop”
“Train Reinforcement Learning Policy Using Custom Training Loop”
3-334
rlStochasticActorRepresentation
rlStochasticActorRepresentation
(Not recommended) Stochastic actor representation for reinforcement learning agents
Description
This object implements a function approximator to be used as a stochastic actor within a
reinforcement learning agent. A stochastic actor takes the observations as inputs and returns a
random action, thereby implementing a stochastic policy with a specific probability distribution. After
you create an rlStochasticActorRepresentation object, use it to create a suitable agent, such
as an rlACAgent or rlPGAgent agent. For more information on creating representations, see
“Create Policies and Value Functions”.
Creation
Syntax
discActor = rlStochasticActorRepresentation(net,observationInfo,
discActionInfo,'Observation',obsName)
discActor = rlStochasticActorRepresentation({basisFcn,W0},observationInfo,
actionInfo)
discActor = rlStochasticActorRepresentation( ___ ,options)
contActor = rlStochasticActorRepresentation(net,observationInfo,
contActionInfo,'Observation',obsName)
contActor = rlStochasticActorRepresentation( ___ ,options)
Description
Discrete Action Space Stochastic Actor
discActor = rlStochasticActorRepresentation(net,observationInfo,
discActionInfo,'Observation',obsName) creates a stochastic actor with a discrete action
space, using the deep neural network net as function approximator. Here, the output layer of net
must have as many elements as the number of possible discrete actions. This syntax sets the
ObservationInfo and ActionInfo properties of discActor to the inputs observationInfo and
discActionInfo, respectively. obsName must contain the names of the input layers of net.
discActor = rlStochasticActorRepresentation({basisFcn,W0},observationInfo,
actionInfo) creates a discrete space stochastic actor using a custom basis function as underlying
approximator. The first input argument is a two-elements cell in which the first element contains the
handle basisFcn to a custom basis function, and the second element contains the initial weight
matrix W0. This syntax sets the ObservationInfo and ActionInfo properties of discActor to the inputs
observationInfo and actionInfo, respectively.
3-335
3 Objects
contActor = rlStochasticActorRepresentation(net,observationInfo,
contActionInfo,'Observation',obsName) creates a Gaussian stochastic actor with a
continuous action space using the deep neural network net as function approximator. Here, the
output layer of net must have twice as many elements as the number of dimensions of the continuous
action space. This syntax sets the ObservationInfo and ActionInfo properties of contActor to the
inputs observationInfo and contActionInfo respectively. obsName must contain the names of
the input layers of net.
Note contActor does not enforce constraints set by the action specification, therefore, when using
this actor, you must enforce action space constraints within the environment.
Input Arguments
Deep neural network used as the underlying approximator within the actor, specified as one of the
following:
For a discrete action space stochastic actor, net must have the observations as input and a single
output layer having as many elements as the number of possible discrete actions. Each element
represents the probability (which must be nonnegative) of executing the corresponding action.
For a continuous action space stochastic actor, net must have the observations as input and a single
output layer having twice as many elements as the number of dimensions of the continuous action
space. The elements of the output vector represent all the mean values followed by all the standard
deviations (which must be nonnegative) of the Gaussian distributions for the dimensions of the action
space.
Note The fact that standard deviations must be nonnegative while mean values must fall within the
output range means that the network must have two separate paths. The first path must produce an
3-336
rlStochasticActorRepresentation
estimation for the mean values, so any output nonlinearity must be scaled so that its output falls in
the desired range. The second path must produce an estimation for the standard deviations, so you
must use a softplus or ReLU layer to enforce nonnegativity.
The network input layers must be in the same order and with the same data type and dimensions as
the signals defined in ObservationInfo. Also, the names of these input layers must match the
observation names specified in obsName. The network output layer must have the same data type and
dimension as the signal defined in ActionInfo.
For a list of deep neural network layers, see “List of Deep Learning Layers”. For more information on
creating deep neural networks for reinforcement learning, see “Create Policies and Value Functions”.
Observation names, specified as a cell array of strings or character vectors. The observation names
must be the names of the input layers in net.
Example: {'my_obs'}
Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The output
of the actor is the vector a = softmax(W'*B), where W is a weight matrix and B is the column
vector returned by the custom basis function. Each element of a represents the probability of taking
the corresponding action. The learnable parameters of the actor are the elements of W.
When creating a stochastic actor representation, your basis function must have the following
signature.
B = myBasisFunction(obs1,obs2,...,obsN)
Here obs1 to obsN are observations in the same order and with the same data type and dimensions
as the signals defined in observationInfo
Example: @(obs1,obs2,obs3) [obs3(2)*obs1(1)^2; abs(obs2(5)+obs3(1))]
Initial value of the basis function weights, W, specified as a matrix. It must have as many rows as the
length of the basis function output, and as many columns as the number of possible actions.
Properties
Options — Representation options
rlRepresentationOptions object
3-337
3 Objects
Note TRPO agents use only the Options.UseDevice representation options and ignore the other
training and learning rate options.
You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually.
For custom basis function representations, the action signal must be a scalar, a column vector, or a
discrete action.
Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
rlSACAgent Soft actor-critic reinforcement learning agent
getAction Obtain action from agent, actor, or policy object given environment observations
Examples
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
obsInfo = rlNumericSpec([4 1]);
3-338
rlStochasticActorRepresentation
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as consisting of
three values, -10, 0, and 10.
Create a deep neural network approximator for the actor. The input of the network (here called
state) must accept a four-element vector (the observation vector just defined by obsInfo), and its
output (here called actionProb) must be a three-element vector. Each element of the output vector
must be between 0 and 1 since it represents the probability of executing each of the three possible
actions (as defined by actInfo). Using softmax as the output layer enforces this requirement.
net = [ featureInputLayer(4,'Normalization','none',...
'Name','state')
fullyConnectedLayer(3,'Name','fc')
softmaxLayer('Name','actionProb') ];
Create the actor with rlStochasticActorRepresentation, using the network, the observations
and action specification objects, as well as the names of the network input layer.
actor = rlStochasticActorRepresentation(net,obsInfo,actInfo,...
'Observation','state')
actor =
rlStochasticActorRepresentation with properties:
To validate your actor, use getAction to return a random action from the observation vector [1 1 1
1], using the current network weights.
ans = 10
You can now use the actor to create a suitable agent, such as an rlACAgent, or rlPGAgent agent.
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous six-dimensional space, so that a single observation is a column vector containing 6
doubles.
Create an action specification object (or alternatively use getActionInfo to extract the
specification object from an environment). For this example, define the action space as a continuous
two-dimensional space, so that a single action is a column vector containing 2 doubles both between
-10 and 10.
3-339
3 Objects
Create a deep neural network approximator for the actor. The observation input (here called myobs)
must accept a six-dimensional vector (the observation vector just defined by obsInfo). The output
(here called myact) must be a four-dimensional vector (twice the number of dimensions defined by
actInfo). The elements of the output vector represent, in sequence, all the means and all the
standard deviations of every action.
The fact that standard deviations must be non-negative while mean values must fall within the output
range means that the network must have two separate paths. The first path is for the mean values,
and any output nonlinearity must be scaled so that it can produce values in the output range. The
second path is for the standard deviations, and you can use a softplus or relu layer to enforce non-
negativity.
actorOpts = rlRepresentationOptions('LearnRate',8e-3,'GradientThreshold',1);
3-340
rlStochasticActorRepresentation
Create the actor with rlStochasticActorRepresentation, using the network, the observations
and action specification objects, the names of the network input layer and the options object.
actor =
rlStochasticActorRepresentation with properties:
To check your actor, use getAction to return a random action from the observation vector
ones(6,1), using the current network weights.
act = getAction(actor,{ones(6,1)});
act{1}
-0.0763
9.6860
You can now use the actor to create a suitable agent (such as an rlACAgent, rlPGAgent, or
rlPPOAgent agent).
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 2
doubles.
The stochastic actor based on a custom basis function does not support continuous action spaces.
Therefore, create a discrete action space specification object (or alternatively use getActionInfo to
extract the specification object from an environment with a discrete action space). For this example,
define the action space as a finite set consisting of 3 possible values (named 7, 5, and 3 in this case).
Create a custom basis function. Each element is a function of the observations defined by obsInfo.
3-341
3 Objects
The output of the actor is the action, among the ones defined in actInfo, corresponding to the
element of softmax(W'*myBasisFcn(myobs)) which has the highest value. W is a weight matrix,
containing the learnable parameters, which must have as many rows as the length of the basis
function output, and as many columns as the number of possible actions.
W0 = rand(4,3);
Create the actor. The first argument is a two-element cell containing both the handle to the custom
function and the initial parameter matrix. The second and third arguments are, respectively, the
observation and action specification objects.
actor = rlStochasticActorRepresentation({myBasisFcn,W0},obsInfo,actInfo)
actor =
rlStochasticActorRepresentation with properties:
To check your actor use the getAction function to return one of the three possible actions,
depending on a given random observation and on the current parameter matrix.
v = getAction(actor,{rand(2,1)})
You can now use the actor (along with an critic) to create a suitable discrete action space agent.
For this example, you create a stochastic actor with a discrete action space using a recurrent neural
network. You can also use a recurrent neural network for a continuous stochastic actor using the
same method.
env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
numObs = obsInfo.Dimension(1);
numDiscreteAct = numel(actInfo.Elements);
Create a recurrent deep neural network for the actor. To create a recurrent neural network, use a
sequenceInputLayer as the input layer and include at least one lstmLayer.
actorNetwork = [
sequenceInputLayer(numObs,'Normalization','none','Name','state')
fullyConnectedLayer(8,'Name','fc')
reluLayer('Name','relu')
3-342
rlStochasticActorRepresentation
lstmLayer(8,'OutputMode','sequence','Name','lstm')
fullyConnectedLayer(numDiscreteAct,'Name','output')
softmaxLayer('Name','actionProb')];
Version History
Introduced in R2020a
3-343
3 Objects
See Also
Functions
rlDiscreteCategoricalActor | rlContinuousGaussianActor | rlRepresentationOptions
| getActionInfo | getObservationInfo
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-344
rlTable
rlTable
Value table or Q table
Description
Value tables and Q tables are one way to represent critic networks for reinforcement learning. Value
tables store rewards for a finite set of observations. Q tables store rewards for corresponding finite
observation-action pairs.
Creation
Syntax
T = rlTable(obsinfo)
T = rlTable(obsinfo,actinfo)
Description
Input Arguments
Properties
Table — Reward table
array
• Value table, it contains NO rows, where NO is the number of finite observation values.
• Q table, it contains NO rows and NA columns, where NA is the number of possible finite actions.
3-345
3 Objects
Object Functions
rlValueFunction Value function approximator object for reinforcement learning agents
rlQValueFunction Q-Value function approximator object for reinforcement learning agents
rlVectorQValueFunction Vector Q-value function approximator for reinforcement learning agents
Examples
This example shows how to use rlTable to create a value table. You can use such a table to
represent the critic of an actor-critic agent with a finite observation space.
env = rlPredefinedEnv("BasicGridWorld");
obsInfo = getObservationInfo(env)
obsInfo =
rlFiniteSetSpec with properties:
vTable = rlTable(obsInfo)
vTable =
rlTable with properties:
Create a Q Table
This example shows how to use rlTable to create a Q table. Such a table could be used to represent
the actor or critic of an agent with finite observation and action spaces.
Create an environment interface, and obtain its observation and action specifications.
env=rlMDPEnv(createMDP(8,["up";"down"]));
obsInfo = getObservationInfo(env)
obsInfo =
rlFiniteSetSpec with properties:
3-346
rlTable
Dimension: [1 1]
DataType: "double"
actInfo = getActionInfo(env)
actInfo =
rlFiniteSetSpec with properties:
qTable = rlTable(obsInfo,actInfo)
qTable =
rlTable with properties:
Version History
Introduced in R2019a
See Also
Topics
“Create Policies and Value Functions”
3-347
3 Objects
rlTD3Agent
Twin-delayed deep deterministic policy gradient reinforcement learning agent
Description
The twin-delayed deep deterministic policy gradient (DDPG) algorithm is an actor-critic, model-free,
online, off-policy reinforcement learning method which computes an optimal policy that maximizes
the long-term reward. The action space can only be continuous.
• Twin-delayed deep deterministic policy gradient (TD3) agent with two Q-value functions. This
agent prevents overestimation of the value function by learning two Q value functions and using
the minimum values for policy updates.
• Delayed deep deterministic policy gradient (delayed DDPG) agent with a single Q value function.
This agent is a DDPG agent with target policy smoothing and delayed policy and target updates.
For more information, see “Twin-Delayed Deep Deterministic Policy Gradient Agents”. For more
information on the different types of reinforcement learning agents, see “Reinforcement Learning
Agents”.
Creation
Syntax
agent = rlTD3Agent(observationInfo,actionInfo)
agent = rlTD3Agent(observationInfo,actionInfo,initOpts)
agent = rlTD3Agent(actor,critics,agentOptions)
Description
Create Agent from Observation and Action Specifications
3-348
rlTD3Agent
agent = rlTD3Agent( ___ ,agentOptions) creates a TD3 agent and sets the AgentOptions
property to the agentOptions input argument. Use this syntax after any of the input arguments in
the previous syntaxes.
Input Arguments
actor — Actor
rlContinuousDeterministicActor object
• rlQValueFunction object — Create a delayed DDPG agent with a single Q value function. This
agent is a DDPG agent with target policy smoothing and delayed policy and target updates.
• Two-element row vector of rlQValueFunction objects — Create a TD3 agent with two critic
value functions. The two critic networks must be unique rlQValueFunction objects with the
same observation and action specifications. The critics can either have different structures or the
same structure but with different initial parameters.
For more information on creating critics, see “Create Policies and Value Functions”.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.
3-349
3 Objects
Since a TD3 agent operates in a continuous action space, you must specify actionInfo as an
rlNumericSpec object.
If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.
You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlNumericSpec.
If you create a TD3 agent with default actor and critic that use recurrent neural networks, the default
value of AgentOptions.SequenceLength is 32.
Option to use exploration policy when selecting actions, specified as a one of the following logical
values.
• true — Use the base agent exploration policy when selecting actions.
• false — Use the base agent greedy policy when selecting actions.
Experience buffer, specified as an rlReplayMemory object. During training the agent stores each of
its experiences (S,A,R,S',D) in a buffer. Here:
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.
3-350
rlTD3Agent
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
generatePolicyFunction Generate function that evaluates policy of an agent or policy object
Examples
Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Control Double Integrator System”. The observation from the environment is a vector containing the
position and velocity of a mass. The action is a scalar representing a force, applied to the mass,
ranging continuously from -2 to 2 Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
The agent creation function initializes the actor and critic networks randomly. Ensure reproducibility
by fixing the seed of the random generator.
rng(0)
Create a TD3 agent from the environment observation and action specifications.
agent = rlTD3Agent(obsInfo,actInfo);
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo.Dimension)})
3-351
3 Objects
You can now test and train the agent within the environment. You can also use getActor and
getCritic to extract the actor and critic, respectively, and getModel to extract the approximator
model (by default a deep neural network) from the actor or critic.
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.
Create an agent initialization option object, specifying that each hidden fully connected layer in the
network must have 128 neurons (instead of the default number, 256).
initOpts = rlAgentInitializationOptions(NumHiddenUnit=128);
The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator. To do so, uncomment the following line.
% rng(0)
Create a DDPG agent from the environment observation and action specifications.
agent = rlTD3Agent(obsInfo,actInfo,initOpts);
actorNet = getModel(getActor(agent));
Extract the deep neural networks from the two critics. Note that getModel(critics) only returns
the first critic network.
critics = getCritic(agent);
criticNet1 = getModel(critics(1));
criticNet2 = getModel(critics(2));
Display the layers of the first critic network, and verify that each hidden fully connected layer has 128
neurons.
criticNet1.Layers
ans =
13x1 Layer array with layers:
3-352
rlTD3Agent
Plot the networks of the actor and of the second critic, and display the number of weights.
plot(layerGraph(actorNet))
summary(actorNet)
Initialized: true
Inputs:
1 'input_1' 50x50x1 images
2 'input_2' 1 features
plot(layerGraph(criticNet2))
3-353
3 Objects
summary(criticNet2)
Initialized: true
Inputs:
1 'input_1' 50x50x1 images
2 'input_2' 1 features
3 'input_3' 1 features
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment.
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
3-354
rlTD3Agent
Control Double Integrator System”. The observation from the environment is a vector containing the
position and velocity of a mass. The action is a scalar representing a force ranging continuously from
-2 to 2 Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
A SAC agent uses two Q-value function critics. To approximate each Q-value function, use a neural
network. The network for a single-output Q-value function critic must have two input layers, one for
the observation and the other for the action, and return a scalar value representing the expected
cumulative long-term reward following from the given observation and action.
Define each network path as an array of layer objects, and the dimensions of the observation and
action spaces from the environment specification objects.
% Observation path
obsPath = [
featureInputLayer(prod(obsInfo.Dimension),Name="obsPathIn")
fullyConnectedLayer(32)
reluLayer
fullyConnectedLayer(16,Name="obsPathOut")
];
% Action path
actPath = [
featureInputLayer(prod(actInfo.Dimension),Name="actPathIn")
fullyConnectedLayer(32)
reluLayer
fullyConnectedLayer(16,Name="actPathOut")
];
% Common path
commonPath = [
concatenationLayer(1,2,Name="concat")
reluLayer
fullyConnectedLayer(1)
];
% Connect layers
criticNet = connectLayers(criticNet,"obsPathOut","concat/in1");
criticNet = connectLayers(criticNet,"actPathOut","concat/in2");
To initialize the network weights differently for the two critics, create two different dlnetwork
objects. You must do this because if the agent constructor function does not accept two identical
critics.
criticNet1 = dlnetwork(criticNet);
criticNet2 = dlnetwork(criticNet);
3-355
3 Objects
summary(criticNet1)
Initialized: true
Inputs:
1 'obsPathIn' 2 features
2 'actPathIn' 1 features
Create the two critics using rlQValueFunction, using the two networks with different weights.
Alternatively, if you use exactly the same network with the same weights, you must explicitly initialize
the network each time (to make sure weights are initialized differently) before passing it to
rlQValueFunction. To do so, use initialize.
critic1 = rlQValueFunction(criticNet1,obsInfo,actInfo);
critic2 = rlQValueFunction(criticNet2,obsInfo,actInfo);
ans = single
-0.1330
getValue(critic2,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})
ans = single
-0.1526
Create a neural network to be used as approximation model within the actor. For TD3 agents, the
actor executes a deterministic policy, which is implemented by a continuous deterministic actor. In
this case the network must take the observation signal as input and return an action. Therefore the
output layer must have as many elements as the number of possible actions.
Define the network as an array of layer objects, and get the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(400)
reluLayer
fullyConnectedLayer(300)
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension))
tanhLayer
];
Initialized: true
Inputs:
1 'input' 2 features
3-356
rlTD3Agent
Create the actor using actorNet. TD3 agents use an rlContinuousDeterministicActor object
to implement the actor.
actor = rlContinuousDeterministicActor(actorNet,obsInfo,actInfo);
getAction(actor,{rand(obsInfo.Dimension)})
Specify agent options, including training options for actor and critics.
agentOptions = rlTD3AgentOptions;
agentOptions.DiscountFactor = 0.99;
agentOptions.TargetSmoothFactor = 5e-3;
agentOptions.TargetPolicySmoothModel.Variance = 0.2;
agentOptions.TargetPolicySmoothModel.LowerLimit = -0.5;
agentOptions.TargetPolicySmoothModel.UpperLimit = 0.5;
agentOptions.CriticOptimizerOptions = criticOptions;
agentOptions.ActorOptimizerOptions = actorOptions;
You can also create an rlTD3Agent object with a single critic. In this case, the object represents a
DDPG agent with target policy smoothing and delayed policy and target updates.
delayedDDPGAgent = rlTD3Agent(actor,critic1,agentOptions);
To check your agents, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo.Dimension)})
getAction(delayedDDPGAgent,{rand(obsInfo.Dimension)})
3-357
3 Objects
You can now test and train either agent within the environment.
For this example, load the environment used in the example “Train DDPG Agent to Control Double
Integrator System”. The observation from the environment is a vector containing the position and
velocity of a mass. The action is a scalar representing a force ranging continuously from -2 to 2
Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
A SAC agent uses two Q-value function critics. To approximate each Q-value function, use a deep
recurrent neural network. The network for a single-output Q-value function critic must have two
input layers, one for the observation and the other for the action, and return a scalar value
representing the expected cumulative long-term reward following from the given observation and
action.
Define each network path as an array of layer objects, and the dimensions of the observation and
action spaces from the environment specification objects. To create a recurrent neural network, use a
sequenceInputLayer as the input layer and include an lstmLayer as one of the other network
layers.
3-358
rlTD3Agent
criticNet = layerGraph;
criticNet = addLayers(criticNet,obsPath);
criticNet = addLayers(criticNet,actPath);
criticNet = addLayers(criticNet,commonPath);
% Connect paths
criticNet = connectLayers(criticNet,"obsOut","cat/in1");
criticNet = connectLayers(criticNet,"actOut","cat/in2");
To initialize the network weights differently for the two critics, create two different dlnetwork
objects. You must do this because if the agent constructor function does not accept two identical
critics.
criticNet1 = dlnetwork(criticNet);
criticNet2 = dlnetwork(criticNet);
summary(criticNet1)
Initialized: true
Inputs:
1 'obsIn' Sequence input with 2 dimensions (CTB)
2 'actIn' Sequence input with 1 dimensions (CTB)
Create the critic using rlQValueFunction. Use the same network structure for both critics. The
TD3 agent initializes the two networks using different default parameters.
critic1 = rlQValueFunction(criticNet1,obsInfo,actInfo);
critic2 = rlQValueFunction(criticNet2,obsInfo,actInfo);
getValue(critic1,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})
ans = single
-0.0060
getValue(critic2,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})
ans = single
0.0481
Since the critics has a recurrent network, the actor must have a recurrent network as approximation
model too. For TD3 agents, the actor executes a deterministic policy, which is implemented by a
continuous deterministic actor. In this case the network must take the observation signal as input and
return an action. Therefore the output layer must have as many elements as the number of possible
actions.
Define the network as an array of layer objects, and get the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
sequenceInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(400)
3-359
3 Objects
lstmLayer(8)
reluLayer
fullyConnectedLayer(300,'Name','ActorFC2')
reluLayer
fullyConnectedLayer(prod(actInfo.Dimension))
tanhLayer
];
actorNet = dlnetwork(actorNet);
summary(actorNet)
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 2 dimensions (CTB)
Create the actor using actorNet. TD3 agents use an rlContinuousDeterministicActor object
to implement the actor.
actor = rlContinuousDeterministicActor(actorNet,obsInfo,actInfo);
getAction(actor,{rand(obsInfo.Dimension)})
Specify agent options, including training options for actor and critics. To use a TD3 agent with
recurrent neural networks, you must specify a SequenceLength greater than 1.
agentOptions = rlTD3AgentOptions;
agentOptions.DiscountFactor = 0.99;
agentOptions.SequenceLength = 32;
agentOptions.TargetSmoothFactor = 5e-3;
agentOptions.TargetPolicySmoothModel.Variance = 0.2;
agentOptions.TargetPolicySmoothModel.LowerLimit = -0.5;
agentOptions.TargetPolicySmoothModel.UpperLimit = 0.5;
3-360
rlTD3Agent
agentOptions.CriticOptimizerOptions = criticOptions;
agentOptions.ActorOptimizerOptions = actorOptions;
You can also create an rlTD3Agent object with a single critic. In this case, the object represents a
DDPG agent with target policy smoothing and delayed policy and target updates.
delayedDDPGAgent = rlTD3Agent(actor,critic1,agentOptions);
To check your agents, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo.Dimension)})
getAction(delayedDDPGAgent,{rand(obsInfo.Dimension)})
You can now test and train either agent within the environment.
Version History
Introduced in R2020a
See Also
rlAgentInitializationOptions | rlTD3AgentOptions | rlQValueFunction |
rlContinuousDeterministicActor | Deep Network Designer
Topics
“Twin-Delayed Deep Deterministic Policy Gradient Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
“Train Biped Robot to Walk Using Reinforcement Learning Agents”
3-361
3 Objects
rlTD3AgentOptions
Options for TD3 agent
Description
Use an rlTD3AgentOptions object to specify options for twin-delayed deep deterministic policy
gradient (TD3) agents. To create a TD3 agent, use rlTD3Agent
For more information see “Twin-Delayed Deep Deterministic Policy Gradient Agents”.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
opt = rlTD3AgentOptions
opt = rlTD3AgentOptions(Name,Value)
Description
opt = rlTD3AgentOptions creates an options object for use as an argument when creating a TD3
agent using all default options. You can modify the object properties using dot notation.
Properties
ExplorationModel — Exploration noise model options
GaussianActionNoise object (default) | OrnsteinUhlenbeckActionNoise object
For an agent with multiple actions, if the actions have different ranges and units, it is likely that each
action requires different noise model parameters. If the actions have similar ranges and units, you
can set the noise parameters for all actions to the same value.
For example, for an agent with two actions, set the standard deviation of each action to a different
value while using the same decay rate for both standard deviations.
3-362
rlTD3AgentOptions
opt = rlTD3AgentOptions;
opt.ExplorationModel.StandardDeviation = [0.1 0.2];
opt.ExplorationModel.StandardDeviationDecayRate = 1e-4;
opt = rlTD3AgentOptions;
opt.ExplorationModel = rl.option.OrnsteinUhlenbeckActionNoise;
opt.ExplorationModel.StandardDeviation = 0.05;
Target smoothing noise model options, specified as a GaussianActionNoise object. This model
helps the policy exploit actions with high Q-value estimates. For more information on noise models,
see “Noise Models” on page 3-365.
For an agent with multiple actions, if the actions have different ranges and units, it is likely that each
action requires different smoothing noise model parameters. If the actions have similar ranges and
units, you can set the noise parameters for all actions to the same value.
For example, for an agent with two actions, set the standard deviation of each action to a different
value while using the same decay rate for both standard deviations.
opt = rlTD3AgentOptions;
opt.TargetPolicySmoothModel.StandardDeviation = [0.1 0.2];
opt.TargetPolicySmoothModel.StandardDeviationDecayRate = 1e-4;
Smoothing factor for target actor and critic updates, specified as a positive scalar less than or equal
to 1. For more information, see “Target Update Methods”.
3-363
3 Objects
Number of steps between target actor and critic updates, specified as a positive integer. For more
information, see “Target Update Methods”.
Option for clearing the experience buffer before training, specified as a logical value.
Maximum batch-training trajectory length when using a recurrent neural network, specified as a
positive integer. This value must be greater than 1 when using a recurrent neural network and 1
otherwise.
Size of random experience mini-batch, specified as a positive integer. During each training episode,
the agent randomly samples experiences from the experience buffer when computing gradients for
updating the critic properties. Large mini-batches reduce the variance when computing gradients but
increase the computational effort.
NumStepsToLookAhead — Number of future rewards used to estimate the value of the policy
1 (default) | positive integer
Number of future rewards used to estimate the value of the policy, specified as a positive integer. For
more information, see [1], Chapter 7.
Note that if parallel training is enabled (that is if an rlTrainingOptions option object in which the
UseParallel property is set to true is passed to train) then NumStepsToLookAhead must be set
to 1, otherwise an error is generated. This guarantees that experiences are stored contiguously.
Experience buffer size, specified as a positive integer. During training, the agent computes updates
using a mini-batch of experiences randomly sampled from the buffer.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
3-364
rlTD3AgentOptions
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.
Object Functions
rlTD3Agent Twin-delayed deep deterministic policy gradient reinforcement learning agent
Examples
opt = rlTD3AgentOptions('MiniBatchSize',48)
opt =
rlTD3AgentOptions with properties:
You can modify options using dot notation. For example, set the agent sample time to 0.5.
opt.SampleTime = 0.5;
Algorithms
Noise Models
Gaussian Action Noise
3-365
3 Objects
At each time step k, the Gaussian noise v is sampled as shown in the following code.
w = Mean + randn(ActionSize).*StandardDeviation(k);
v(k+1) = min(max(w,LowerLimit),UpperLimit);
Where the initial value v(1) is defined by the InitialAction parameter. At each sample time step,
the standard deviation decays as shown in the following code.
At each sample time step k, the noise value v(k) is updated using the following formula, where Ts is
the agent sample time, and the initial value v(1) is defined by the InitialAction parameter.
3-366
rlTD3AgentOptions
At each sample time step, the standard deviation decays as shown in the following code.
You can calculate how many samples it will take for the standard deviation to be halved using this
simple formula.
halflife = log(0.5)/log(1-StandardDeviationDecayRate);
For continuous action signals, it is important to set the noise standard deviation appropriately to
encourage exploration. It is common to set StandardDeviation*sqrt(Ts) to a value between 1%
and 10% of your action range.
If your agent converges on local optima too quickly, promote agent exploration by increasing the
amount of noise; that is, by increasing the standard deviation. Also, to increase exploration, you can
reduce the StandardDeviationDecayRate.
Version History
Introduced in R2020a
The properties defining the probability distribution of the Gaussian action noise model have changed.
This noise model is used by TD3 agents for exploration and target policy smoothing.
When a GaussianActionNoise noise object saved from a previous MATLAB release is loaded, the
value of VarianceDecayRate is copied to StandardDeviationDecayRate, while the square root
of the values of Variance and VarianceMin are copied to StandardDeviation and
StandardDeviationMin, respectively.
The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the Gaussian action noise model, use the new
property names instead.
Update Code
This table shows how to update your code to use the new property names for rlTD3AgentOptions
object td3opt.
3-367
3 Objects
The properties defining the probability distribution of the Ornstein-Uhlenbeck (OU) noise model have
been renamed. TD3 agents use OU noise for exploration.
The Variance, VarianceDecayRate, and VarianceMin properties still work, but they are not
recommended. To define the probability distribution of the OU noise model, use the new property
names instead.
Update Code
This table shows how to update your code to use the new property names for rlTD3AgentOptions
object td3opt.
References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second
edition. Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press,
2018.
3-368
rlTD3AgentOptions
See Also
Topics
“Twin-Delayed Deep Deterministic Policy Gradient Agents”
3-369
3 Objects
rlTrainingOptions
Options for training reinforcement learning agents
Description
Use an rlTrainingOptions object to specify training options for an agent. To train an agent, use
train.
For more information on training agents, see “Train Reinforcement Learning Agents”.
Creation
Syntax
trainOpts = rlTrainingOptions
opt = rlTrainingOptions(Name,Value)
Description
Properties
MaxEpisodes — Maximum number of episodes to train the agent
500 (default) | positive integer
Maximum number of episodes to train the agent, specified as a positive integer. Regardless of other
criteria for termination, training terminates after MaxEpisodes.
Example: 'MaxEpisodes',1000
Maximum number of steps to run per episode, specified as a positive integer. In general, you define
episode termination conditions in the environment. This value is the maximum number of steps to run
in the episode if other termination conditions are not met.
Example: 'MaxStepsPerEpisode',1000
3-370
rlTrainingOptions
Window length for averaging the scores, rewards, and number of steps for each agent, specified as a
scalar or vector.
If the training environment is a multi-agent Simulink environment, specify a scalar to apply the same
window length to all agents.
Example: 'ScoreAveragingWindowLength',10
• "AverageSteps" — Stop training when the running average number of steps per episode equals
or exceeds the critical value specified by the option StopTrainingValue. The average is
computed using the window 'ScoreAveragingWindowLength'.
• "AverageReward" — Stop training when the running average reward equals or exceeds the
critical value.
• "EpisodeReward" — Stop training when the reward in the current episode equals or exceeds the
critical value.
• "GlobalStepCount" — Stop training when the total number of steps in all episodes (the total
number of times the agent is invoked) equals or exceeds the critical value.
• "EpisodeCount" — Stop training when the number of training episodes equals or exceeds the
critical value.
Example: 'StopTrainingCriteria',"AverageReward"
If the training environment is a multi-agent Simulink environment, specify a scalar to apply the same
termination criterion to all agents. To use a different termination criterion for each agent, specify
3-371
3 Objects
StopTrainingValue as a vector. In this case, the order of the elements in the vector corresponds to
the order of the agents used during environment creation.
For a given agent, training ends when the termination condition specified by the
StopTrainingCriteria option equals or exceeds this value. For the other agents, the training
continues until:
Condition for saving agents during training, specified as one of the following strings:
Set this option to store candidate agents that perform well according to the criteria you specify. When
you set this option to a value other than "none", the software sets the SaveAgentValue option to
500. You can change that value to specify the condition for saving the agent.
For instance, suppose you want to store for further testing any agent that yields an episode reward
that equals or exceeds 100. To do so, set SaveAgentCriteria to "EpisodeReward" and set the
SaveAgentValue option to 100. When an episode reward equals or exceeds 100, train saves the
corresponding agent in a MAT file in the folder specified by the SaveAgentDirectory option. The
MAT file is called AgentK.mat, where K is the number of the corresponding episode. The agent is
stored within that MAT file as saved_agent.
Example: 'SaveAgentCriteria',"EpisodeReward"
Critical value of the condition for saving agents, specified as a scalar or a vector.
3-372
rlTrainingOptions
If the training environment is a multi-agent Simulink environment, specify a scalar to apply the same
saving criterion to each agent. To save the agents when one meets a particular criterion, specify
SaveAgentValue as a vector. In this case, the order of the elements in the vector corresponds to the
order of the agents used when creating the environment. When a criteria for saving an agent is met,
all agents are saved in the same MAT file.
When you specify a condition for saving candidate agents using SaveAgentCriteria, the software
sets this value to 500. Change the value to specify the condition for saving the agent. See the
SaveAgentCriteria option for more details.
Example: 'SaveAgentValue',100
Folder for saved agents, specified as a string or character vector. The folder name can contain a full
or relative path. When an episode occurs that satisfies the condition specified by the
SaveAgentCriteria and SaveAgentValue options, the software saves the agents in a MAT file in
this folder. If the folder does not exist, train creates it. When SaveAgentCriteria is "none", this
option is ignored and train does not create a folder.
Example: 'SaveAgentDirectory', pwd + "\run1\Agents"
Flag for using parallel training, specified as a logical. Setting this option to true configures
training to use parallel processing to simulate the environment, thereby enabling usage of multiple
cores, processors, computer clusters or cloud resources to speed up training. To specify options for
parallel training, use the ParallelizationOptions property.
When UseParallel is true then for DQN, DDPG, TD3, and SAC the NumStepsToLookAhead
property or the corresponding agent option object must be set to 1, otherwise an error is generated.
This guarantees that experiences are stored contiguously. When AC agents are trained in parallel, a
warning is generated if the StepsUntilDataIsSent property of the ParallelizationOptions
object is set to a different value than the NumStepToLookAhead property of the AC agent option
object.
Note that if you want to speed up deep neural network calculations (such as gradient computation,
parameter update and prediction) using a local GPU, you do not need to set UseParallel to true.
Instead, when creating your actor or critic representation, use an rlRepresentationOptions
object in which the UseDevice option is set to "gpu". Using parallel computing or the GPU requires
Parallel Computing Toolbox software. Using computer clusters or cloud resources additionally
requires MATLAB Parallel Server. For more information about training using multicore processors
and GPUs, see “Train Agents Using Parallel Computing and GPUs”.
Example: 'UseParallel',true
3-373
3 Objects
The ParallelTraining object has the following properties, which you can modify using dot
notation after creating the rlTrainingOptions object.
• "sync" — Use parpool to run synchronous training on the available workers. In this case,
workers pause execution until all workers are finished. The host updates the actor and critic
parameters based on the results from all the workers and sends the updated parameters to all
workers. Note that synchronous training is required for gradient-based parallelization, that is
when DataToSendFromWorkers is set to "gradients" then Mode must be set to "sync".
• "async" — Use parpool to run asynchronous training on the available workers. In this case,
workers send their data back to the host as soon as they finish and receive updated parameters
from the host. The workers then continue with their task.
• –1 — Assign a unique random seed to each worker. The value of the seed is the worker ID.
• –2 — Do not assign a random seed to the workers.
• Vector — Manually specify the random seed for each worker. The number of elements in the
vector must match the number of workers.
Option to send model and workspace variables to parallel workers, specified as "on" or "off". When
the option is "on", the host sends variables used in models and defined in the base MATLAB
workspace to the workers.
Additional files to attach to the parallel pool, specified as a string or string array.
Function to run before training starts, specified as a handle to a function having no input arguments.
This function is run once per worker before training begins. Write this function to perform any
processing that you need prior to training.
Function to run after training ends, specified as a handle to a function having no input arguments.
You can write this function to clean up the workspace or perform other processing after training
terminates.
3-374
rlTrainingOptions
Display training progress on the command line, specified as the logical values false (0) or true (1).
Set to true to write information from each training episode to the MATLAB command line during
training.
Option to stop training when an error occurs during an episode, specified as "on" or "off". When
this option is "off", errors are captured and returned in the SimulationInfo output of train, and
training continues to the next episode.
Object Functions
train Train reinforcement learning agents within a specified environment
Examples
Create an options set for training a reinforcement learning agent. Set the maximum number of
episodes and the maximum number of steps per episode to 1000. Configure the options to stop
training when the average reward equals or exceeds 480, and turn on both the command-line display
and Reinforcement Learning Episode Manager for displaying training results. You can set the options
using name-value pair arguments when you create the options set. Any options that you do not
explicitly set have their default values.
trainOpts = rlTrainingOptions(...
'MaxEpisodes',1000,...
'MaxStepsPerEpisode',1000,...
'StopTrainingCriteria',"AverageReward",...
'StopTrainingValue',480,...
'Verbose',true,...
'Plots',"training-progress")
trainOpts =
rlTrainingOptions with properties:
MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: 5
StopTrainingCriteria: "AverageReward"
StopTrainingValue: 480
3-375
3 Objects
SaveAgentCriteria: "none"
SaveAgentValue: "none"
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelTraining]
Alternatively, create a default options set and use dot notation to change some of the values.
trainOpts = rlTrainingOptions;
trainOpts.MaxEpisodes = 1000;
trainOpts.MaxStepsPerEpisode = 1000;
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 480;
trainOpts.Verbose = true;
trainOpts.Plots = "training-progress";
trainOpts
trainOpts =
rlTrainingOptions with properties:
MaxEpisodes: 1000
MaxStepsPerEpisode: 1000
ScoreAveragingWindowLength: 5
StopTrainingCriteria: "AverageReward"
StopTrainingValue: 480
SaveAgentCriteria: "none"
SaveAgentValue: "none"
SaveAgentDirectory: "savedAgents"
Verbose: 1
Plots: "training-progress"
StopOnError: "on"
UseParallel: 0
ParallelizationOptions: [1x1 rl.option.ParallelTraining]
You can now use trainOpts as an input argument to the train command.
To turn on parallel computing for training a reinforcement learning agent, set the UseParallel
training option to true.
trainOpts = rlTrainingOptions(UseParallel=true);
ans =
ParallelTraining with properties:
3-376
rlTrainingOptions
Mode: "async"
WorkerRandomSeeds: -1
TransferBaseWorkspaceVariables: "on"
AttachedFiles: []
SetupFcn: []
CleanupFcn: []
You can now use trainOpts as an input argument to the train command to perform training with
parallel computing.
To train an agent using the asynchronous advantage actor-critic (A3C) method, you must set the
agent and parallel training options appropriately.
When creating the AC agent, set the NumStepsToLookAhead value to be greater than 1. Common
values are 64 and 128.
agentOpts = rlACAgentOptions(NumStepsToLookAhead=64);
Use agentOpts when creating your agent. Alternatively, create your agent first and then modify its
options, including the actor and critic options later using dot notation.
trainOpts = rlTrainingOptions(UseParallel=true);
trainOpts.ParallelizationOptions.Mode = "async";
Configure the workers to return gradient data to the host. Also, set the number of steps before the
workers send data back to the host to match the number of steps to look ahead.
trainOpts.ParallelizationOptions.DataToSendFromWorkers = ...
"gradients";
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = ...
agentOpts.NumStepsToLookAhead;
For an example on asynchronous advantage actor-critic agent training, see “Train AC Agent to
Balance Cart-Pole System Using Parallel Computing”.
Version History
Introduced in R2019a
3-377
3 Objects
See Also
train | rlMultiAgentTrainingOptions
Topics
“Train Reinforcement Learning Agents”
3-378
rlTRPOAgent
rlTRPOAgent
Trust region policy optimization reinforcement learning agent
Description
Trust region policy optimization (TRPO) is a model-free, online, on-policy, policy gradient
reinforcement learning method. This algorithm prevents significant performance drops compared to
standard policy gradient methods by keeping the updated policy within a trust region close to the
current policy. The action space can be either discrete or continuous.
For more information on TRPO agents, see “Trust Region Policy Optimization Agents”. For more
information on the different types of reinforcement learning agents, see “Reinforcement Learning
Agents”.
Creation
Syntax
agent = rlTRPOAgent(observationInfo,actionInfo)
agent = rlTRPOAgent(observationInfo,actionInfo,initOpts)
agent = rlTRPOAgent(actor,critic)
Description
agent = rlTRPOAgent(actor,critic) creates a TRPO agent with the specified actor and critic,
using the default options for the agent.
3-379
3 Objects
agent = rlTRPOAgent( ___ ,agentOptions) creates a TRPO agent and sets the AgentOptions
property to the agentOptions input argument. Use this syntax after any of the input arguments in
the previous syntaxes.
Input Arguments
actor — Actor
rlDiscreteCategoricalActor object | rlContinuousGaussianActor object
critic — Critic
rlValueFunction object
Critic that estimates the discounted long-term reward, specified as an rlValueFunction object. For
more information on creating critic approximators, see “Create Policies and Value Functions”.
Properties
ObservationInfo — Observation specifications
specification object | array of specification objects
If you create the agent by specifying an actor and critic, the value of ObservationInfo matches the
value specified in the actor and critic objects.
For a discrete action space, you must specify actionInfo as an rlFiniteSetSpec object.
For a continuous action space, you must specify actionInfo as an rlNumericSpec object.
3-380
rlTRPOAgent
If you create the agent by specifying an actor and critic, the value of ActionInfo matches the value
specified in the actor and critic objects.
You can extract actionInfo from an existing environment or agent using getActionInfo. You can
also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.
Option to use exploration policy when selecting actions, specified as a one of the following logical
values.
• true — Use the base agent exploration policy when selecting actions in sim and
generatePolicyFunction. In this case, the agent selects its actions by sampling its probability
distribution, the policy is therefore stochastic and the agent explores its observation space.
• false — Use the base agent greedy policy (the action with maximum likelihood) when selecting
actions in sim and generatePolicyFunction. In this case, the simulated agent and generated
policy behave deterministically.
Note This option affects only simulation and deployment; it does not affect training.
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations. The value of SampleTime matches the value specified in AgentOptions.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified
environment
getAction Obtain action from agent, actor, or policy object given environment
observations
getActor Get actor from reinforcement learning agent
setActor Set actor of reinforcement learning agent
getCritic Get critic from reinforcement learning agent
setCritic Set critic of reinforcement learning agent
3-381
3 Objects
Examples
Create an environment with a discrete action space, and obtain its observation and action
specifications. For this example, load the environment used in the example “Create Agent Using Deep
Network Designer and Train Using Image Observations”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
with five possible elements (a torque of either -2, -1, 0, 1, or 2 Nm applied to a swinging pole).
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator. To do so, uncomment the following line.
% rng(0)
Create a TRPO agent from the environment observation and action specifications.
agent = rlTRPOAgent(obsInfo,actInfo);
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment.
Create an environment with a continuous action space and obtain its observation and action
specifications. For this example, load the environment used in the example “Train DDPG Agent to
Swing Up and Balance Pendulum with Image Observation”. This environment has two observations: a
50-by-50 grayscale image and a scalar (the angular velocity of the pendulum). The action is a scalar
representing a torque ranging continuously from -2 to 2 Nm.
env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
3-382
rlTRPOAgent
Create an agent initialization options object, specifying that each hidden fully connected layer in the
network must have 128 neurons.
initOpts = rlAgentInitializationOptions('NumHiddenUnit',128);
The agent creation function initializes the actor and critic networks randomly. You can ensure
reproducibility by fixing the seed of the random generator. To do so, uncomment the following line.
% rng(0)
Create a TRPO agent from the environment observation and action specifications using the specified
initialization options.
agent = rlTRPOAgent(obsInfo,actInfo,initOpts);
Extract the deep neural networks from both the agent actor and critic.
actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));
You can verify that the networks have 128 units in their hidden fully connected layers. For example,
display the layers of the critic network.
criticNet.Layers
ans =
11x1 Layer array with layers:
To check your agent, use getAction to return the action from a random observation.
getAction(agent,{rand(obsInfo(1).Dimension),rand(obsInfo(2).Dimension)})
You can now test and train the agent within the environment.
Create an environment interface, and obtain its observation and action specifications.
3-383
3 Objects
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
For TRPO agents, the critic estimates a value function, therefore it must take the observation signal
as input and return a scalar value. Create a deep neural network to be used as approximation model
within the critic. Define the network as an array of layer objects.
criticNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(100)
reluLayer
fullyConnectedLayer(1)
];
criticNet = dlnetwork(criticNet);
summary(criticNet)
Initialized: true
Inputs:
1 'input' 4 features
Create the critic using criticNet. TRPO agents use an rlValueFunction object to implement the
critic.
critic = rlValueFunction(criticNet,obsInfo);
getValue(critic,{rand(obsInfo.Dimension)})
ans = single
-0.2479
To approximate the policy within the actor use a neural network. For TRPO agents, the actor executes
a stochastic policy, which for discrete action spaces is implemented by a discrete categorical actor. In
this case the approximator must take the observation signal as input and return a probability for each
action. Therefore the output layer must have as many elements as the number of possible actions.
Define the network as an array of layer objects, getting the dimension of the observation space and
the number of possible actions from the environment specification objects.
actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(200)
reluLayer
fullyConnectedLayer(numel(actInfo.Dimension))
];
actorNet = dlnetwork(actorNet);
summary(actorNet)
3-384
rlTRPOAgent
Initialized: true
Inputs:
1 'input' 4 features
Create the actor using actorNet. PPO agents use an rlDiscreteCategoricalActor object to
implement the actor for discrete action spaces.
actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);
getAction(actor,{rand(obsInfo.Dimension)})
agent = rlTRPOAgent(actor,critic)
agent =
rlTRPOAgent with properties:
Specify agent options, including training options for the actor and the critic.
agent.AgentOptions.ExperienceHorizon = 1024;
agent.AgentOptions.DiscountFactor = 0.95;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent against the environment.
Create an environment with a continuous action space, and obtain its observation and action
specifications. For this example, load the double integrator continuous action space environment used
3-385
3 Objects
in the example “Train DDPG Agent to Control Double Integrator System”. The observation from the
environment is a vector containing the position and velocity of a mass. The action is a scalar
representing a force applied to the mass, ranging continuously from -2 to 2 Newton.
env = rlPredefinedEnv("DoubleIntegrator-Continuous");
obsInfo = getObservationInfo(env)
obsInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "states"
Description: "x, dx"
Dimension: [2 1]
DataType: "double"
actInfo = getActionInfo(env)
actInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "force"
Description: [0x0 string]
Dimension: [1 1]
DataType: "double"
In this example, the action is a scalar value representing a force ranging from -2 to 2 Newton. To
make sure that the output from the agent is in this range, you perform an appropriate scaling
operation. Store these limits so you can easily access them later.
actInfo.LowerLimit=-2;
actInfo.UpperLimit=2;
The actor and critic networks are initialized randomly. You can ensure reproducibility by fixing the
seed of the random generator.
rng(0)
Create a deep neural network to be used as approximation model within the critic. For TRPO agents,
the critic estimates a value function, therefore it must take the observation signal as input and return
a scalar value.
criticNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(100)
reluLayer
fullyConnectedLayer(1)];
criticNet = dlnetwork(criticNet);
summary(criticNet)
3-386
rlTRPOAgent
Initialized: true
Inputs:
1 'input' 2 features
Create the critic using criticNet. PPO agents use an rlValueFunction object to implement the
critic.
critic = rlValueFunction(criticNet,obsInfo);
getValue(critic,{rand(obsInfo.Dimension)})
ans = single
-0.0899
To approximate the policy within the actor, use a neural network. For TRPO agents, the actor
executes a stochastic policy, which for continuous action spaces is implemented by a continuous
Gaussian actor. In this case the network must take the observation signal as input and return both a
mean value and a standard deviation value for each action. Therefore it must have two output layers
(one for the mean values the other for the standard deviation values), each having as many elements
as the dimension of the action space.
Note that standard deviations must be nonnegative and mean values must fall within the range of the
action. Therefore the output layer that returns the standard deviations must be a softplus or ReLU
layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling
layer, to scale the mean values to the output range.
Define each network path as an array of layer objects. Get the dimensions of the observation and
action spaces, and the action range limits from the environment specification objects. Specify a name
for the input and output layers, so you can later explicitly associate them with the appropriate
environment channel.
3-387
3 Objects
% Connect paths
actorNet = connectLayers(actorNet,"comPathOut","meanPathIn/in");
actorNet = connectLayers(actorNet,"comPathOut",'stdPathIn/in');
% Plot network
plot(actorNet)
Initialized: true
Inputs:
1 'comPathIn' 2 features
Create the actor using actorNet. TRPO agents use an rlContinuousGaussianActor object to
implement the actor for continuous action spaces.
3-388
rlTRPOAgent
getAction(actor,{rand(obsInfo.Dimension)})
agent = rlTRPOAgent(actor,critic)
agent =
rlTRPOAgent with properties:
Specify agent options, including training options for the actor and the critic.
agent.AgentOptions.ExperienceHorizon = 1024;
agent.AgentOptions.DiscountFactor = 0.95;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 8e-3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
getAction(agent,{rand(obsInfo.Dimension)})
You can now test and train the agent within the environment.
Tips
• For continuous action spaces, this agent does not enforce the constraints set by the action
specification. In this case, you must enforce action space constraints within the environment.
• While tuning the learning rate of the actor network is necessary for PPO agents, it is not
necessary for TRPO agents.
• For high-dimensional observations, such as for images, it is recommended to use PPO, SAC, or
TD3 agents.
3-389
3 Objects
Version History
Introduced in R2021b
See Also
rlTRPOAgentOptions | rlValueFunction | rlDiscreteCategoricalActor |
rlContinuousGaussianActor | Deep Network Designer
Topics
“Trust Region Policy Optimization Agents”
“Reinforcement Learning Agents”
“Train Reinforcement Learning Agents”
3-390
rlTRPOAgentOptions
rlTRPOAgentOptions
Options for TRPO agent
Description
Use an rlTRPOAgentOptions object to specify options for trust region policy optimization (TRPO)
agents. To create a TRPO agent, use rlTRPOAgent.
For more information on TRPO agents, see “Trust Region Policy Optimization Agents”.
For more information on the different types of reinforcement learning agents, see “Reinforcement
Learning Agents”.
Creation
Syntax
opt = rlTRPOAgentOptions
opt = rlTRPOAgentOptions(Name,Value)
Description
Properties
ExperienceHorizon — Number of steps the agent interacts with the environment before
learning
512 (default) | positive integer
Number of steps the agent interacts with the environment before learning from its experience,
specified as a positive integer.
The ExperienceHorizon value must be greater than or equal to the MiniBatchSize value.
Mini-batch size used for each learning epoch, specified as a positive integer. When the agent uses a
recurrent neural network, MiniBatchSize is treated as the training trajectory length.
3-391
3 Objects
The MiniBatchSize value must be less than or equal to the ExperienceHorizon value.
Entropy loss weight, specified as a scalar value between 0 and 1. A higher entropy loss weight value
promotes agent exploration by applying a penalty for being too certain about which action to take.
Doing so can help the agent move out of local optima.
When gradients are computed during training, an additional gradient component is computed for
minimizing the entropy loss. For more information, see “Entropy Loss”.
Number of epochs for which the actor and critic networks learn from the current experience set,
specified as a positive integer.
For more information on these methods, see the training algorithm information in “Proximal Policy
Optimization Agents”.
Smoothing factor for generalized advantage estimator, specified as a scalar value between 0 and 1,
inclusive. This option applies only when the AdvantageEstimateMethod option is "gae"
Conjugate gradient damping factor for numerical stability, specified as a nonnegative scalar.
Upper limit for the Kullback-Leibler (KL) divergence between the old policy and the current policy,
specified as a positive scalar.
Maximum number of iterations for conjugate gradient decent, specified as positive integer.
3-392
rlTRPOAgentOptions
Conjugate gradient residual tolerance, specified as a positive scalar. Once the residual for the
conjugate gradient algorithm is below this tolerance, the algorithm stops.
Method for normalizing advantage function values, specified as one of the following:
In some environments, you can improve agent performance by normalizing the advantage function
during training. The agent normalizes the advantage function by subtracting the mean advantage
value and scaling by the standard deviation.
Window size for normalizing advantage function values, specified as a positive integer. Use this
option when the NormalizedAdvantageMethod option is "moving".
Sample time of agent, specified as a positive scalar or as -1. Setting this parameter to -1 allows for
event-based simulations.
Within a Simulink environment, the RL Agent block in which the agent is specified to execute every
SampleTime seconds of simulation time. If SampleTime is -1, the block inherits the sample time
from its parent subsystem.
Within a MATLAB environment, the agent is executed every time the environment advances. In this
case, SampleTime is the time interval between consecutive elements in the output experience
returned by sim or train. If SampleTime is -1, the time interval between consecutive elements in
the returned output experience reflects the timing of the event that triggers the agent execution.
3-393
3 Objects
Discount factor applied to future rewards during training, specified as a positive scalar less than or
equal to 1.
Object Functions
rlTRPOAgent Trust region policy optimization reinforcement learning agent
Examples
opt = rlTRPOAgentOptions('DiscountFactor',0.9)
opt =
rlTRPOAgentOptions with properties:
ExperienceHorizon: 512
MiniBatchSize: 128
EntropyLossWeight: 0.0100
NumEpoch: 3
AdvantageEstimateMethod: "gae"
GAEFactor: 0.9500
ConjugateGradientDamping: 0.1000
KLDivergenceLimit: 0.0100
NumIterationsConjugateGradient: 10
NumIterationsLineSearch: 10
ConjugateGradientResidualTolerance: 1.0000e-08
NormalizedAdvantageMethod: "none"
AdvantageNormalizingWindow: 1000000
CriticOptimizerOptions: [1x1 rl.option.rlOptimizerOptions]
SampleTime: 1
DiscountFactor: 0.9000
InfoToSave: [1x1 struct]
You can modify options using dot notation. For example, set the agent sample time to 0.1.
opt.SampleTime = 0.1;
Version History
Introduced in R2021b
3-394
rlTRPOAgentOptions
• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.AgentOptions.UseDeterministicExploitation = true;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.AgentOptions.UseDeterministicExploitation = false;
• Force the agent to always select the action with maximum likelihood, thereby using a greedy
deterministic policy for simulation and deployment.
agent.UseExplorationPolicy = false;
• Allow the agent to select its action by sampling its probability distribution for simulation and
policy deployment, thereby using a stochastic policy that explores the observation space.
agent.UseExplorationPolicy = true;
See Also
Topics
“Trust Region Policy Optimization Agents”
3-395
3 Objects
rlValueFunction
Value function approximator object for reinforcement learning agents
Description
This object implements a value function approximator object that you can use as a critic for a
reinforcement learning agent. A value function maps an environment state to a scalar value. The
output represents the predicted discounted cumulative long-term reward when the agent starts from
the given state and takes the best possible action. After you create an rlValueFunction critic, use
it to create an agent such as an rlACAgent, rlPGAgent, or rlPPOAgent agent. For an example of
this workflow, see “Create Actor and Critic Representations” on page 3-412. For more information on
creating value functions, see “Create Policies and Value Functions”.
Creation
Syntax
critic = rlValueFunction(net,observationInfo)
critic = rlValueFunction(net,ObservationInputNames=netObsNames)
critic = rlValueFunction(tab,observationInfo)
critic = rlValueFunction({basisFcn,W0},observationInfo)
Description
3-396
rlValueFunction
a two-element cell array whose first element is the handle basisFcn to a custom basis function and
whose second element is the initial weight vector W0. The function sets the ObservationInfo
property of critic to the observationInfo input argument.
Input Arguments
Deep neural network used as the underlying approximator within the critic, specified as one of the
following:
Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
Deep Learning Toolbox neural network object. The resulting dlnet is the dlnetwork object that you
use for your critic or actor. This practice allows a greater level of insight and control for cases in
which the conversion is not straightforward and might require additional specifications.
The network must have the environment observation channels as inputs and a single scalar as output.
The learnable parameters of the critic are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.
Network input layers names corresponding to the environment observation channels, specified as a
string array or a cell array of character vectors. When you use this argument after
'ObservationInputNames', the function assigns, in sequential order, each environment
observation channel specified in observationInfo to each network input layer specified by the
corresponding name in the string array netObsNames. Therefore, the network input layers, ordered
as the names in netObsNames, must have the same data type and dimensions as the observation
specifications, as ordered in observationInfo.
3-397
3 Objects
Note Of the information specified in observationInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.
Example: {"NetInput1_airspeed","NetInput2_altitude"}
Value table, specified as an rlTable object containing a column vector with length equal to the
number of possible observations from the environment. Each element is the predicted discounted
cumulative long-term reward when the agent starts from the given observation and takes the best
possible action. The elements of this vector are the learnable parameters of the representation.
Custom basis function, specified as a function handle to a user-defined function. The user defined
function can either be an anonymous function or a function on the MATLAB path. The output of the
critic is the scalar c = W'*B, where W is a weight vector containing the learnable parameters and B
is the column vector returned by the custom basis function.
B = myBasisFunction(obs1,obs2,...,obsN)
Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the
environment observation channels defined in observationInfo.
For an example on how to use a basis function to create a value function critic with a mixed
continuous and discrete observation space, see “Create Mixed Observation Space Value Function
Critic from Custom Basis Function” on page 3-407.
Example: @(obs1,obs2,obs3) [obs3(1)*obs1(1)^2; abs(obs2(5)+obs1(2))]
Initial value of the basis function weights W, specified as a column vector having the same length as
the vector returned by the basis function.
Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array
3-398
rlValueFunction
Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.
To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: "gpu"
Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
getValue Obtain estimated value from a critic given environment observations and
actions
evaluate Evaluate function approximator object given observation (or observation-
action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data
accelerate Option to accelerate computation of gradient for approximator object
based on neural network
getLearnableParameters Obtain learnable parameter values from agent, function approximator, or
policy object
setLearnableParameters Set learnable parameter values of agent, function approximator, or policy
object
setModel Set function approximation model for actor or critic
getModel Get function approximator model from actor or critic
Examples
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
3-399
3 Objects
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
Create a deep neural network to approximate the value function within the critic, as a column vector
of layer objects. The network input layer must accept a four-element vector (the observation vector
defined by obsInfo), and the output must be a scalar (the value, representing the expected
cumulative long-term reward when the agent starts from the given observation).
You can also obtain the number of observations from the obsInfo specification (regardless of
whether the observation space is a column vector, row vector, or matrix,
prod(obsInfo.Dimension) is its total number of dimensions, in this case equal to 4).
net = [ featureInputLayer(prod(obsInfo.Dimension));
fullyConnectedLayer(10);
reluLayer;
fullyConnectedLayer(1,Name="value")];
dlnet = dlnetwork(net);
You can plot the network using plot and display its main characteristics, like the number of weights,
using summary.
plot(dlnet)
3-400
rlValueFunction
summary(dlnet)
Initialized: true
Number of learnables: 61
Inputs:
1 'input' 4 features
Create the critic using the network and the observation specification object. When you use this syntax
the network input layer is automatically associated with the environment observation according to
the dimension specifications in obsInfo.
critic = rlValueFunction(dlnet,obsInfo)
critic =
rlValueFunction with properties:
To check your critic, use getValue to return the value of a random observation, using the current
network weights.
v = getValue(critic,{rand(obsInfo.Dimension)})
v = single
0.5196
You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent).
Create an actor and a critic that you can use to define a reinforcement learning agent such as an
Actor Critic (AC) agent. For this example, create actor and critic for an agent that can be trained
against the cart-pole environment described in “Train AC Agent to Balance Cart-Pole System”.
First, create the environment. Then, extract the observation and action specifications from the
environment. You need these specifications to define the agent and critic.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
A state-value-function critic, such as those used for AC or PG agents, has the current observation as
input and the state value, a scalar, as output. For this example, to approximate the value function
within the critic, create a deep neural network with one output (the value) and four inputs (the
environment observation signals x, xdot, theta, and thetadot).
Create the network as a column vector of layer objects. You can obtain the number of observations
from the obsInfo specification (regardless of whether the observation space is a column vector, row
vector, or matrix, prod(obsInfo.Dimension) is its total number of dimensions). Name the network
input layer criticNetInput.
3-401
3 Objects
criticNetwork = [
featureInputLayer(prod(obsInfo.Dimension),...
Name="criticNetInput");
fullyConnectedLayer(10);
reluLayer;
fullyConnectedLayer(1,Name="CriticFC")];
criticNetwork = dlnetwork(criticNetwork);
summary(criticNetwork)
Initialized: true
Number of learnables: 61
Inputs:
1 'criticNetInput' 4 features
Create the critic using the specified neural network. Also, specify the action and observation
information for the critic. Set the observation name to observation, which is the name of the
criticNetwork input layer.
critic = rlValueFunction(criticNetwork,obsInfo,...
ObservationInputNames={'criticNetInput'})
critic =
rlValueFunction with properties:
Check your critic using getValue to return the value of a random observation, given the current
network weights.
v = getValue(critic,{rand(obsInfo.Dimension)})
v = single
0.5196
Specify the critic optimization options using rlOptimizerOptions. These options control the
learning of the critic network parameters. For this example, set the learning rate to 0.01 and the
gradient threshold to 1.
An AC agent decides which action to take given observations using a policy which is represented by
an actor. For an actor, the inputs are the environment observations, and the output depends on
whether the action space is discrete or continuous. The actor in this example has two possible
discrete actions, –10 or 10. To create the actor, use a deep neural network that can output these two
values given the same observation input as the critic.
3-402
rlValueFunction
Create the network using a row vector of two layer objects. You can obtain the number of actions
from the actInfo specification. Name the network output actorNetOutput.
actorNetwork = [
featureInputLayer( ...
prod(obsInfo.Dimension),...
Name="actorNetInput")
fullyConnectedLayer( ...
numel(actInfo.Elements), ...
Name="actorNetOutput")];
actorNetwork = dlnetwork(actorNetwork);
summary(actorNetwork)
Initialized: true
Number of learnables: 10
Inputs:
1 'actorNetInput' 4 features
Create the actor using rlDiscreteCategoricalActor together with the observation and action
specifications, and the name of the network input layer to be associated with the environment
observation channel.
actor = rlDiscreteCategoricalActor(actorNetwork,obsInfo,actInfo,...
ObservationInputNames={'actorNetInput'})
actor =
rlDiscreteCategoricalActor with properties:
To check your actor, use getAction to return a random action from a given observation, using the
current network weights.
a = getAction(actor,{rand(obsInfo.Dimension)})
Specify the actor optimization options using rlOptimizerOptions. These options control the
learning of the critic network parameters. For this example, set the learning rate to 0.05 and the
gradient threshold to 1.
3-403
3 Objects
Create an AC agent using the actor and critic. Use the optimizer options objects previously created
for both actor and critic.
agentOpts = rlACAgentOptions(...
NumStepsToLookAhead=32,...
DiscountFactor=0.99,...
CriticOptimizerOptions=criticOpts,...
ActorOptimizerOptions=actorOpts);
agent = rlACAgent(actor,critic,agentOpts)
agent =
rlACAgent with properties:
To check your agent, use getAction to return a random action from a given observation, using the
current actor and critic network weights.
act = getAction(agent,{rand(obsInfo.Dimension)})
For additional examples showing how to create actors and critics for different agent types, see:
Create a finite set observation specification object (or alternatively use getObservationInfo to
extract the specification object from an environment with a discrete observation space). For this
example, define the observation space as a finite set consisting of four possible values: 1, 3, 4 and 7.
vTable = rlTable(obsInfo);
The table is a column vector in which each entry stores the predicted cumulative long-term reward
for each possible observation as defined by obsInfo. You can access the table using the Table
property of the vTable object. The initial value of each element is zero.
vTable.Table
ans = 4×1
3-404
rlValueFunction
0
0
0
You can also initialize the table to any value, in this case, an array containing all the integers from 1
to 4.
vTable.Table = reshape(1:4,4,1)
vTable =
rlTable with properties:
Create the critic using the table and the observation specification object.
critic = rlValueFunction(vTable,obsInfo)
critic =
rlValueFunction with properties:
To check your critic, use the getValue function to return the value of a given observation, using the
current table entries.
v = getValue(critic,{7})
v = 4
You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent).
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations
defined by obsInfo.
3-405
3 Objects
The output of the critic is the scalar W'*myBasisFcn(myobs), where W is a weight column vector
which must have the same size as the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the best
possible action. The elements of W are the learnable parameters.
W0 = [3;5;2];
Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second argument is the observation specification object.
critic = rlValueFunction({myBasisFcn,W0},obsInfo)
critic =
rlValueFunction with properties:
To check your critic, use the getValue function to return the value of a given observation, using the
current parameter vector.
v = getValue(critic,{[2 4 6 8]'})
v = 130.9453
You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent).
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
To approximate the value function within the critic, use create a recurrent deep neural network as a
row vector of layer objects. Use a sequenceInputLayer as the input layer
(obsInfo.Dimension(1) is the dimension of the observation space) and include at least one
lstmLayer.
myNet = [
sequenceInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(8, Name="fc")
reluLayer(Name="relu")
lstmLayer(8,OutputMode="sequence")
fullyConnectedLayer(1,Name="output")];
dlCriticNet = dlnetwork(myNet);
3-406
rlValueFunction
summary(dlCriticNet)
Initialized: true
Inputs:
1 'sequenceinput' Sequence input with 4 dimensions
critic = rlValueFunction(dlCriticNet,obsInfo)
critic =
rlValueFunction with properties:
To check your critic, use the getValue function to return the value of a random observation, using
the current network weights.
v = getValue(critic,{rand(obsInfo.Dimension)})
v = single
0.0017
You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent).
Create Mixed Observation Space Value Function Critic from Custom Basis Function
Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations (in
this case a single number) defined by obsInfo.
The output of the critic is the scalar W'*myBasisFcn(obsA,obsB), where W is a weight column
vector which must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the best
possible action. The elements of W are the learnable parameters.
3-407
3 Objects
W0 = ones(4,1);
Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second argument is the observation specification object.
critic = rlValueFunction({myBasisFcn,W0},obsInfo)
critic =
rlValueFunction with properties:
To check your critic, use the getValue function to return the value of a given observation, using the
current parameter vector.
v = 60
Note that the critic does not enforce the set constraint for the discrete set element.
v = 12
You can now use the critic (along with an with an actor) to create an agent relying on a discrete value
function critic (such as rlACAgent or rlPGAgent).
Version History
Introduced in R2022a
See Also
Functions
rlQValueFunction | rlVectorQValueFunction | rlTable | getActionInfo |
getObservationInfo
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-408
rlValueRepresentation
rlValueRepresentation
(Not recommended) Value function critic representation for reinforcement learning agents
Description
This object implements a value function approximator to be used as a critic within a reinforcement
learning agent. A value function is a function that maps an observation to a scalar value. The output
represents the expected total long-term reward when the agent starts from the given observation and
takes the best possible action. Value function critics therefore only need observations (but not
actions) as inputs. After you create an rlValueRepresentation critic, use it to create an agent
relying on a value function critic, such as an rlACAgent, rlPGAgent, or rlPPOAgent. For an
example of this workflow, see “Create Actor and Critic Representations” on page 3-412. For more
information on creating representations, see “Create Policies and Value Functions”.
Creation
Syntax
critic = rlValueRepresentation(net,observationInfo,'Observation',obsName)
critic = rlValueRepresentation(tab,observationInfo)
critic = rlValueRepresentation({basisFcn,W0},observationInfo)
critic = rlValueRepresentation( ___ ,options)
Description
critic = rlValueRepresentation(net,observationInfo,'Observation',obsName)
creates the value function based critic from the deep neural network net. This syntax sets the
ObservationInfo property of critic to the input observationInfo. obsName must contain the
names of the input layers of net.
critic = rlValueRepresentation( ___ ,options) creates the value function based critic
using the additional option set options, which is an rlRepresentationOptions object. This
3-409
3 Objects
syntax sets the Options property of critic to the options input argument. You can use this syntax
with any of the previous input-argument combinations.
Input Arguments
Deep neural network used as the underlying approximator within the critic, specified as one of the
following:
The network input layers must be in the same order and with the same data type and dimensions as
the signals defined in ObservationInfo. Also, the names of these input layers must match the
observation names listed in obsName.
For a list of deep neural network layers, see “List of Deep Learning Layers”. For more information on
creating deep neural networks for reinforcement learning, see “Create Policies and Value Functions”.
Observation names, specified as a cell array of strings or character vectors. The observation names
must be the names of the input layers in net. These network layers must be in the same order and
with the same data type and dimensions as the signals defined in ObservationInfo.
Example: {'my_obs'}
Value table, specified as an rlTable object containing a column vector with length equal to the
number of observations. The element i is the expected cumulative long-term reward when the agent
starts from the given observation s and takes the best possible action. The elements of this vector are
the learnable parameters of the representation.
Custom basis function, specified as a function handle to a user-defined function. The user defined
function can either be an anonymous function or a function on the MATLAB path. The output of the
critic is c = W'*B, where W is a weight vector and B is the column vector returned by the custom
basis function. c is the expected cumulative long term reward when the agent starts from the given
observation and takes the best possible action. The learnable parameters of this representation are
the elements of W.
3-410
rlValueRepresentation
When creating a value function critic representation, your basis function must have the following
signature.
B = myBasisFunction(obs1,obs2,...,obsN)
Here obs1 to obsN are observations in the same order and with the same data type and dimensions
as the signals defined in ObservationInfo.
Example: @(obs1,obs2,obs3) [obs3(1)*obs1(1)^2; abs(obs2(5)+obs1(2))]
Initial value of the basis function weights, W, specified as a column vector having the same length as
the vector returned by the basis function.
Properties
Options — Representation options
rlRepresentationOptions object
Object Functions
rlACAgent Actor-critic reinforcement learning agent
rlPGAgent Policy gradient reinforcement learning agent
rlPPOAgent Proximal policy optimization reinforcement learning agent
getValue Obtain estimated value from a critic given environment observations and actions
Examples
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 4
doubles.
obsInfo = rlNumericSpec([4 1]);
3-411
3 Objects
Create a deep neural network to approximate the value function within the critic. The input of the
network (here called myobs) must accept a four-element vector (the observation vector defined by
obsInfo), and the output must be a scalar (the value, representing the expected cumulative long-
term reward when the agent starts from the given observation).
Create the critic using the network, observation specification object, and name of the network input
layer.
critic = rlValueRepresentation(net,obsInfo,'Observation',{'myobs'})
critic =
rlValueRepresentation with properties:
To check your critic, use the getValue function to return the value of a random observation, using
the current network weights.
v = getValue(critic,{rand(4,1)})
v = single
0.7904
You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent).
Create an actor representation and a critic representation that you can use to define a reinforcement
learning agent such as an Actor Critic (AC) agent.
For this example, create actor and critic representations for an agent that can be trained against the
cart-pole environment described in “Train AC Agent to Balance Cart-Pole System”. First, create the
environment. Then, extract the observation and action specifications from the environment. You need
these specifications to define the agent and critic representations.
env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
For a state-value-function critic such as those used for AC or PG agents, the inputs are the
observations and the output should be a scalar value, the state value. For this example, create the
critic representation using a deep neural network with one output, and with observation signals
corresponding to x, xdot, theta, and thetadot as described in “Train AC Agent to Balance Cart-
Pole System”. You can obtain the number of observations from the obsInfo specification. Name the
network layer input 'observation'.
numObservation = obsInfo.Dimension(1);
criticNetwork = [
3-412
rlValueRepresentation
featureInputLayer(numObservation,'Normalization','none','Name','observation')
fullyConnectedLayer(1,'Name','CriticFC')];
Specify options for the critic representation using rlRepresentationOptions. These options
control the learning of the critic network parameters. For this example, set the learning rate to 0.05
and the gradient threshold to 1.
repOpts = rlRepresentationOptions('LearnRate',5e-2,'GradientThreshold',1);
Create the critic representation using the specified neural network and options. Also, specify the
action and observation information for the critic. Set the observation name to 'observation',
which is the of the criticNetwork input layer.
critic = rlValueRepresentation(criticNetwork,obsInfo,'Observation',{'observation'},repOpts)
critic =
rlValueRepresentation with properties:
Similarly, create a network for the actor. An AC agent decides which action to take given observations
using an actor representation. For an actor, the inputs are the observations, and the output depends
on whether the action space is discrete or continuous. For the actor of this example, there are two
possible discrete actions, –10 or 10. To create the actor, use a deep neural network with the same
observation input as the critic, that can output these two values. You can obtain the number of
actions from the actInfo specification. Name the output 'action'.
numAction = numel(actInfo.Elements);
actorNetwork = [
featureInputLayer(numObservation,'Normalization','none','Name','observation')
fullyConnectedLayer(numAction,'Name','action')];
Create the actor representation using the observation name and specification and the same
representation options.
actor = rlStochasticActorRepresentation(actorNetwork,obsInfo,actInfo,...
'Observation',{'observation'},repOpts)
actor =
rlStochasticActorRepresentation with properties:
agentOpts = rlACAgentOptions(...
'NumStepsToLookAhead',32,...
'DiscountFactor',0.99);
agent = rlACAgent(actor,critic,agentOpts)
agent =
rlACAgent with properties:
3-413
3 Objects
For additional examples showing how to create actor and critic representations for different agent
types, see:
Create a finite set observation specification object (or alternatively use getObservationInfo to
extract the specification object from an environment with a discrete observation space). For this
example, define the observation space as a finite set consisting of 4 possible values.
obsInfo = rlFiniteSetSpec([1 3 5 7]);
The table is a column vector in which each entry stores the expected cumulative long-term reward for
each possible observation as defined by obsInfo. You can access the table using the Table property
of the vTable object. The initial value of each element is zero.
vTable.Table
ans = 4×1
0
0
0
0
You can also initialize the table to any value, in this case, an array containing all the integers from 1
to 4.
vTable.Table = reshape(1:4,4,1)
vTable =
rlTable with properties:
Create the critic using the table and the observation specification object.
critic = rlValueRepresentation(vTable,obsInfo)
critic =
rlValueRepresentation with properties:
3-414
rlValueRepresentation
To check your critic, use the getValue function to return the value of a given observation, using the
current table entries.
v = getValue(critic,{7})
v = 4
You can now use the critic (along with an actor) to create an agent relying on a value function critic
(such as rlACAgent or rlPGAgent agent).
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing 4
doubles.
Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations
defined by obsInfo.
The output of the critic is the scalar W'*myBasisFcn(myobs), where W is a weight column vector
which must have the same size of the custom basis function output. This output is the expected
cumulative long term reward when the agent starts from the given observation and takes the best
possible action. The elements of W are the learnable parameters.
W0 = [3;5;2];
Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial weight vector. The second argument is the observation specification object.
critic = rlValueRepresentation({myBasisFcn,W0},obsInfo)
critic =
rlValueRepresentation with properties:
To check your critic, use the getValue function to return the value of a given observation, using the
current parameter vector.
v = getValue(critic,{[2 4 6 8]'})
3-415
3 Objects
v =
1x1 dlarray
130.9453
You can now use the critic (along with an with an actor) to create an agent relying on a value function
critic (such as rlACAgent or rlPGAgent).
env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
numObs = obsInfo.Dimension(1);
numDiscreteAct = numel(actInfo.Elements);
Create a recurrent deep neural network for the critic. To create a recurrent neural network, use a
sequenceInputLayer as the input layer and include at least one lstmLayer.
criticNetwork = [
sequenceInputLayer(numObs,'Normalization','none','Name','state')
fullyConnectedLayer(8, 'Name','fc')
reluLayer('Name','relu')
lstmLayer(8,'OutputMode','sequence','Name','lstm')
fullyConnectedLayer(1,'Name','output')];
criticOptions = rlRepresentationOptions('LearnRate',1e-2,'GradientThreshold',1);
critic = rlValueRepresentation(criticNetwork,obsInfo,...
'Observation','state',criticOptions);
Version History
Introduced in R2020a
The following table shows some typical uses of rlValueRepresentation, and how to update your
code with rlValueFunction instead. Each table entry is related to different approximator objects,
the first one uses a neural network, the second one uses a table, the third one uses a basis function.
3-416
rlValueRepresentation
See Also
Functions
rlValueFunction | getActionInfo | getObservationInfo
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-417
3 Objects
rlVectorQValueFunction
Vector Q-value function approximator for reinforcement learning agents
Description
This object implements a vector Q-value function approximator that you can use as a critic with a
discrete action space for a reinforcement learning agent. A vector Q-value function is a function that
maps an environment state to a vector in which each element represents the predicted discounted
cumulative long-term reward when the agent starts from the given state and executes the action
corresponding to the element number. A Q-value function critic therefore needs only the environment
state as input. After you create an rlVectorQValueFunction critic, use it to create an agent such
as rlQAgent, rlDQNAgent, rlSARSAAgent, rlDDPGAgent, or rlTD3Agent. For more information
on creating representations, see “Create Policies and Value Functions”.
Creation
Syntax
critic = rlVectorQValueFunction(net,observationInfo,actionInfo)
critic = rlVectorQValueFunction(net,observationInfo,ObservationInputNames=
netObsNames)
critic = rlVectorQValueFunction({basisFcn,W0},observationInfo,actionInfo)
Description
critic = rlVectorQValueFunction(net,observationInfo,ObservationInputNames=
netObsNames) specifies the names of the network input layers to be associated with the
environment observation channels. The function assigns, in sequential order, each environment
observation channel specified in observationInfo to the layer specified by the corresponding name
in the string array netObsNames. Therefore, the network input layers, ordered as the names in
netObsNames, must have the same data type and dimensions as the observation channels, as ordered
in observationInfo.
critic = rlVectorQValueFunction({basisFcn,W0},observationInfo,actionInfo)
creates the multi-output Q-value function critic with a discrete action space using a custom basis
function as underlying approximator. The first input argument is a two-element cell array whose first
3-418
rlVectorQValueFunction
element is the handle basisFcn to a custom basis function and whose second element is the initial
weight matrix W0. Here the basis function must have only the observations as inputs, and W0 must
have as many columns as the number of possible actions. The function sets the ObservationInfo and
ActionInfo properties of critic to the input arguments observationInfo and actionInfo,
respectively.
Input Arguments
Deep neural network used as the underlying approximator within the critic,specified as one of the
following:
Note Among the different network representation options, dlnetwork is preferred, since it has
built-in validation checks and supports automatic differentiation. If you pass another network object
as an input argument, it is internally converted to a dlnetwork object. However, best practice is to
convert other representations to dlnetwork explicitly before using it to create a critic or an actor for
a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any
Deep Learning Toolbox neural network object. The resulting dlnet is the dlnetwork object that you
use for your critic or actor. This practice allows a greater level of insight and control for cases in
which the conversion is not straightforward and might require additional specifications.
The network must have only the observation channels as inputs and a single output layer having as
many elements as the number of possible discrete actions. Each element of the output vector
approximates the value of executing the corresponding action starting from the currently observed
state.
The learnable parameters of the critic are the weights of the deep neural network. For a list of deep
neural network layers, see “List of Deep Learning Layers”. For more information on creating deep
neural networks for reinforcement learning, see “Create Policies and Value Functions”.
Network input layers names corresponding to the environment observation channels, specified as a
string array or a cell array of character vectors. When you use this argument after
3-419
3 Objects
Note Of the information specified in observationInfo, the function uses only the data type and
dimension of each channel, but not its (optional) name or description.
Example: {"NetInput1_airspeed","NetInput2_altitude"}
Custom basis function, specified as a function handle to a user-defined MATLAB function. The user
defined function can either be an anonymous function or a function on the MATLAB path. The output
of the critic is the vector c = W'*B, where W is a matrix containing the learnable parameters, and B
is the column vector returned by the custom basis function. Each element of a approximates the
value of executing the corresponding action from the observed state.
B = myBasisFunction(obs1,obs2,...,obsN)
Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the
channels defined in observationInfo.
Example: @(obs1,obs2) [act(2)*obs1(1)^2; abs(obs2(5))]
Initial value of the basis function weights W, specified as a matrix having as many rows as the length
of the basis function output vector and as many columns as the number of possible actions.
Properties
ObservationInfo — Observation specifications
rlFiniteSetSpec object | rlNumericSpec object | array
3-420
rlVectorQValueFunction
Action specifications, specified as an rlFiniteSetSpec object. This object defines the properties of
the environment action channel, such as its dimensions, data type, and name. Note that the function
does not use the name of the action channel specified in actionInfo.
You can extract ActionInfo from an existing environment or agent using getActionInfo. You can
also construct the specifications manually.
Computation device used to perform operations such as gradient computation, parameter update and
prediction during training and simulation, specified as either "cpu" or "gpu".
The "gpu" option requires both Parallel Computing Toolbox software and a CUDA enabled NVIDIA
GPU. For more information on supported GPUs see “GPU Computing Requirements” (Parallel
Computing Toolbox).
You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be
used with MATLAB.
Note Training or simulating an agent on a GPU involves device-specific numerical round-off errors.
These errors can produce different results compared to performing the same operations a CPU.
To speed up training by using parallel processing over multiple cores, you do not need to use this
argument. Instead, when training your agent, use an rlTrainingOptions object in which the
UseParallel option is set to true. For more information about training using multicore processors
and GPUs for training, see “Train Agents Using Parallel Computing and GPUs”.
Example: "gpu"
Object Functions
rlDQNAgent Deep Q-network (DQN) reinforcement learning agent
rlQAgent Q-learning reinforcement learning agent
rlSARSAAgent SARSA reinforcement learning agent
getValue Obtain estimated value from a critic given environment observations and
actions
getMaxQValue Obtain maximum estimated value over all possible actions from a Q-value
function critic with discrete action space, given environment observations
evaluate Evaluate function approximator object given observation (or observation-
action) input data
gradient Evaluate gradient of function approximator object given observation and
action input data
3-421
3 Objects
Examples
This example shows how to create a vector Q-value function critic for a discrete action space using a
deep neural network approximator.
This critic takes only the observation as input and produces as output a vector with as many elements
as the possible actions. Each element represents the expected cumulative long term reward when the
agent starts from the given observation and takes the action corresponding to the position of the
element in the output vector.
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example, define the
action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).
To approximate the Q-value function within the critic, use a deep neural network. The input of the
network must accept a four-element vector, as defined by obsInfo. The output must be a single
output layer having as many elements as the number of possible discrete actions (three in this case,
as defined by actInfo).
net = [featureInputLayer(4)
fullyConnectedLayer(3)];
net = dlnetwork(net);
summary(net)
Initialized: true
3-422
rlVectorQValueFunction
Number of learnables: 15
Inputs:
1 'input' 4 features
Create the critic using the network, as well as the observation and action specification objects. The
network input layers are automatically associated with the components of the observation signals
according to the dimension specifications in obsInfo.
critic = rlVectorQValueFunction(net,obsInfo,actInfo)
critic =
rlVectorQValueFunction with properties:
To check your critic, use getValue to return the values of a random observation, using the current
network weights. There is one value for each of the three possible actions.
v = getValue(critic,{rand(obsInfo.Dimension)})
0.7232
0.8177
-0.2212
You can now use the critic (along with an actor) to create a discrete action space agent relying on a
Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).
Create Multi-Output Q-Value Function Critic from Deep Neural Network Specifying Layer
Names
A vector Q-value function critic takes only the observation as input and produces as output a vector
with as many elements as the possible actions. Each element represents the expected cumulative
long term reward when the agent starts from the given observation and takes the action
corresponding to the position of the element in the output vector.
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as a
continuous four-dimensional space, so that a single observation is a column vector containing four
doubles.
Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example, define the
action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).
3-423
3 Objects
To approximate the Q-value function within the critic, use a deep neural network. The input of the
network must accept a four-element vector, as defined by obsInfo. The output must be a single
output layer having as many elements as the number of possible discrete actions (three in this case,
as defined by actInfo).
Create the network as an array of layer objects. Name the network input netObsIn (so you can later
explicitly associate it with the observation input channel).
net = [featureInputLayer(4,Name="netObsIn")
fullyConnectedLayer(3,Name="value")];
Convert the network to a dlnetwork object and display the number of its learnable parameters.
net = dlnetwork(net)
net =
dlnetwork with properties:
summary(net)
Initialized: true
Number of learnables: 15
Inputs:
1 'netObsIn' 4 features
Create the critic using the network, the observations specification object, and the name of the
network input layer. The specified network input layer, netObsIn, is associated with the environment
observation, and therefore must have the same data type and dimension as the observation channel
specified in obsInfo.
critic =
rlVectorQValueFunction with properties:
To check your critic, use the getValue function to return the values of a random observation, using
the current network weights. The function returns one value for each of the three possible actions.
3-424
rlVectorQValueFunction
v = getValue(critic, ...
{rand(obsInfo.Dimension)})
0.7232
0.8177
-0.2212
You can now use the critic (along with an actor) to create a discrete action space agent relying on a
Q-value function critic (such as rlQAgent, rlDQNAgent, or rlSARSAAgent).
This critic takes only the observation as input and produces as output a vector with as many elements
as the possible actions. Each element represents the expected cumulative long term reward when the
agent starts from the given observation and takes the action corresponding to the position of the
element in the output vector.
Create an observation specification object (or alternatively use getObservationInfo to extract the
specification object from an environment). For this example, define the observation space as
consisting of two channels, the first a two-by-two continuous matrix and the second is scalar that can
assume only two values, 0 and 1.
Create a finite set action specification object (or alternatively use getActionInfo to extract the
specification object from an environment with a discrete action space). For this example, define the
action space as a finite set consisting of three possible vectors, [1 2], [3 4], and [5 6].
Create a custom basis function to approximate the value function within the critic. The custom basis
function must return a column vector. Each vector element must be a function of the observations
defined by obsInfo.
The output of the critic is the vector c = W'*myBasisFcn(obsA,obsB), where W is a weight matrix
which must have as many rows as the length of the basis function output and as many columns as the
number of possible actions.
Each element of c is the expected cumulative long term reward when the agent starts from the given
observation and takes the action corresponding to the position of the considered element. The
elements of W are the learnable parameters.
W0 = rand(4,3);
3-425
3 Objects
Create the critic. The first argument is a two-element cell containing both the handle to the custom
function and the initial parameter matrix. The second and third arguments are, respectively, the
observation and action specification objects.
critic = rlVectorQValueFunction({myBasisFcn,W0},obsInfo,actInfo)
critic =
rlVectorQValueFunction with properties:
To check your critic, use the getValue function to return the values of a random observation, using
the current parameter matrix. The function returns one value for each of the three possible actions.
v = getValue(critic,{rand(2,2),0})
v = 3×1
1.3192
0.8420
1.5053
Note that the critic does not enforce the set constraint for the discrete set elements.
v = getValue(critic,{rand(2,2),-1})
v = 3×1
2.7890
1.8375
3.0855
You can now use the critic (along with an actor) to create a discrete action space agent relying on a
Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).
Version History
Introduced in R2022a
See Also
Functions
rlValueFunction | rlQValueFunction | getActionInfo | getObservationInfo
Topics
“Create Policies and Value Functions”
“Reinforcement Learning Agents”
3-426
scalingLayer
scalingLayer
Scaling layer for actor or critic network
Description
A scaling layer linearly scales and biases an input array U, giving an output Y = Scale.*U + Bias.
You can incorporate this layer into the deep neural networks you define for actors or critics in
reinforcement learning agents. This layer is useful for scaling and shifting the outputs of nonlinear
layers, such as tanhLayer and sigmoid.
For instance, a tanhLayer gives bounded output that falls between –1 and 1. If your actor network
output has different bounds (as defined in the actor specification), you can include a ScalingLayer
as an output to scale and shift the actor network output appropriately.
Creation
Syntax
sLayer = scalingLayer
sLayer = scalingLayer(Name,Value)
Description
sLayer = scalingLayer(Name,Value) sets properties on page 3-427 using name-value pairs. For
example, scalingLayer('Scale',0.5) creates a scaling layer that scales its input by 0.5. Enclose
each property name in quotes.
Properties
Name — Name of layer
'scaling' (default) | character vector
Name of layer, specified as a character vector. To include a layer in a layer graph, you must specify a
nonempty unique layer name. If you train a series network with this layer and Name is set to '', then
the software automatically assigns a name to the layer at training time.
Description of layer, specified as a character vector. When you create the scaling layer, you can use
this property to give it a description that helps you identify its purpose.
3-427
3 Objects
Element-wise scale on the input to the scaling layer, specified as one of the following:
• Scalar — Specify the same scale factor for all elements of the input array.
• Array with the same dimensions as the input array — Specify different scale factors for each
element of the input array.
The scaling layer takes an input U and generates the output Y = Scale.*U + Bias.
Element-wise bias on the input to the scaling layer, specified as one of the following:
• Scalar — Specify the same bias for all elements of the input array.
• Array with the same dimensions as the input array — Specify a different bias for each element of
the input array.
The scaling layer takes an input U and generates the output Y = Scale.*U + Bias.
Examples
Create a scaling layer that converts an input array U to the output array Y = 0.1.*U - 0.4.
sLayer = scalingLayer('Scale',0.1,'Bias',-0.4)
sLayer =
ScalingLayer with properties:
Name: 'scaling'
Scale: 0.1000
Bias: -0.4000
Learnable Parameters
No properties.
State Parameters
No properties.
Confirm that the scaling layer scales and offsets an input array as expected.
predict(sLayer,[10,20,30])
ans = 1×3
3-428
scalingLayer
You can incorporate sLayer into an actor network or critic network for reinforcement learning.
Assume that the layer preceding the scalingLayer is a tanhLayer with three outputs aligned
along the first dimension, and that you want to apply a different scaling factor and bias to each out
using a scalingLayer.
sLayer = scalingLayer('Scale',scale,'Bias',bias);
Confirm that the scaling layer applies the correct scale and bias values to an array with the expected
dimensions.
ans = 3×1
30
4
50
Version History
Introduced in R2019a
Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
See Also
quadraticLayer | softplusLayer
Topics
“Train DDPG Agent to Swing Up and Balance Pendulum”
“Create Policies and Value Functions”
3-429
3 Objects
SimulinkEnvWithAgent
Reinforcement learning environment with a dynamic model implemented in Simulink
Description
The SimulinkEnvWithAgent object represents a reinforcement learning environment that uses a
dynamic model implemented in Simulink. The environment object acts as an interface such that when
you call sim or train, these functions in turn call the Simulink model to generate experiences for
the agents.
Creation
To create a SimulinkEnvWithAgent object, use one of the following functions.
• rlSimulinkEnv — Create an environment using a Simulink model with at least one RL Agent
block.
• createIntegratedEnv — Use a reference model as a reinforcement learning environment.
• rlPredefinedEnv — Create a predefined reinforcement learning environment.
Properties
Model — Simulink model name
string | character vector
Simulink model name, specified as a string or character vector. The specified model must contain one
or more RL Agent blocks.
If Model contains a single RL Agent block for training, then AgentBlock is a string containing the
block path.
If Model contains multiple RL Agent blocks for training, then AgentBlock is a string array, where
each element contains the path of one agent block.
Model can contain RL Agent blocks whose path is not included in AgentBlock. Such agent blocks
behave as part of the environment and select actions based on their current policies. When you call
sim or train, the experiences of these agents are not returned and their policies are not updated.
The agent blocks can be inside of a model reference. For more information on configuring an agent
block for reinforcement learning, see RL Agent.
3-430
SimulinkEnvWithAgent
Reset behavior for the environment, specified as a function handle or anonymous function handle.
The function must have a single Simulink.SimulationInput input argument and a single
Simulink.SimulationInput output argument.
The reset function sets the initial state of the Simulink environment. For example, you can create a
reset function that randomizes certain block states such that each training episode begins from
different initial conditions.
If you have an existing reset function myResetFunction on the MATLAB path, set ResetFcn using a
handle to the function.
env.ResetFcn = @(in)myResetFunction(in);
If your reset behavior is simple, you can implement it using an anonymous function handle. For
example, the following code sets the variable x0 to a random value.
env.ResetFcn = @(in) setVariable(in,'x0',rand());
The sim function calls the reset function to reset the environment at the start of each simulation, and
the train function calls it at the start of each training episode.
Option to toggle fast restart, specified as either "on" or "off". Fast restart allows you to perform
iterative simulations without compiling a model or terminating the simulation each time.
For more information on fast restart, see “How Fast Restart Improves Iterative Simulations”
(Simulink).
Object Functions
train Train reinforcement learning agents within a specified environment
sim Simulate trained reinforcement learning agents within specified environment
getObservationInfo Obtain observation data specifications from reinforcement learning
environment, agent, or experience buffer
getActionInfo Obtain action data specifications from reinforcement learning environment,
agent, or experience buffer
Examples
Create a Simulink environment using the trained agent and corresponding Simulink model from the
“Create Simulink Environment and Train Agent” example.
Create an environment for the rlwatertank model, which contains an RL Agent block. Since the
agent used by the block is already in the workspace, you do not need to pass the observation and
action specifications to create the environment.
env = rlSimulinkEnv('rlwatertank','rlwatertank/RL Agent')
3-431
3 Objects
env =
SimulinkEnvWithAgent with properties:
Model : rlwatertank
AgentBlock : rlwatertank/RL Agent
ResetFcn : []
UseFastRestart : on
Validate the environment by performing a short simulation for two sample times.
validateEnvironment(env)
You can now train and simulate the agent within the environment by using train and sim,
respectively.
For this example, consider the rlSimplePendulumModel Simulink model. The model is a simple
frictionless pendulum that initially hangs in a downward position.
mdl = 'rlSimplePendulumModel';
open_system(mdl)
Create rlNumericSpec and rlFiniteSetSpec objects for the observation and action information,
respectively.
The observation is a vector containing three signals: the sine, cosine, and time derivative of the
angle.
obsInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: [0×0 string]
Description: [0×0 string]
Dimension: [3 1]
DataType: "double"
The action is a scalar expressing the torque and can be one of three possible values, -2 Nm, 0 Nm and
2 Nm.
actInfo =
rlFiniteSetSpec with properties:
3-432
SimulinkEnvWithAgent
You can use dot notation to assign property values for the rlNumericSpec and rlFiniteSetSpec
objects.
obsInfo.Name = 'observations';
actInfo.Name = 'torque';
Assign the agent block path information, and create the reinforcement learning environment for the
Simulink model using the information extracted in the previous steps.
agentBlk = [mdl '/RL Agent'];
env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo)
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on
You can also include a reset function using dot notation. For this example, randomly initialize theta0
in the model workspace.
env.ResetFcn = @(in) setVariable(in,'theta0',randn,'Workspace',mdl)
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : @(in)setVariable(in,'theta0',randn,'Workspace',mdl)
UseFastRestart : on
Create an environment for the Simulink model from the example “Train Multiple Agents to Perform
Collaborative Task”.
Create an environment for the rlCollaborativeTask model, which has two agent blocks. Since the
agents used by the two blocks (agentA and agentB) are already in the workspace, you do not need
to pass their observation and action specifications to create the environment.
env = rlSimulinkEnv( ...
'rlCollaborativeTask', ...
["rlCollaborativeTask/Agent A","rlCollaborativeTask/Agent B"])
3-433
3 Objects
env =
SimulinkEnvWithAgent with properties:
Model : rlCollaborativeTask
AgentBlock : [
rlCollaborativeTask/Agent A
rlCollaborativeTask/Agent B
]
ResetFcn : []
UseFastRestart : on
You can now simulate or train the agents within the environment using sim or train, respectively.
env =
SimulinkEnvWithAgent with properties:
Model : rlSimplePendulumModel
AgentBlock : rlSimplePendulumModel/RL Agent
ResetFcn : []
UseFastRestart : on
This example shows how to use createIntegratedEnv to create an environment object starting
from a Simulink model that implements the system with which the agent. Such a system is often
referred to as plant, open-loop system, or reference system, while the whole (integrated) system
including the agent is often referred to as the closed-loop system.
For this example, use the flying robot model described in “Train DDPG Agent to Control Flying
Robot” as the reference (open-loop) system.
% sample time
Ts = 0.4;
3-434
SimulinkEnvWithAgent
Create the Simulink model myIntegratedEnv containing the flying robot model connected in a
closed loop to the agent block. The function also returns the reinforcement learning environment
object env to be used for training.
env = createIntegratedEnv('rlFlyingRobotEnv','myIntegratedEnv')
env =
SimulinkEnvWithAgent with properties:
Model : myIntegratedEnv
AgentBlock : myIntegratedEnv/RL Agent
ResetFcn : []
UseFastRestart : on
The function can also return the block path to the RL Agent block in the new integrated model, as
well as the observation and action specifications for the reference model.
agentBlk =
'myIntegratedEnv/RL Agent'
observationInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "observation"
Description: [0x0 string]
Dimension: [7 1]
DataType: "double"
actionInfo =
rlNumericSpec with properties:
LowerLimit: -Inf
UpperLimit: Inf
Name: "action"
Description: [0x0 string]
Dimension: [2 1]
DataType: "double"
Returning the block path and specifications is useful in cases in which you need to modify
descriptions, limits, or names in observationInfo and actionInfo. After modifying the
specifications, you can then create an environment from the integrated model IntegratedEnv using
the rlSimulinkEnv function.
Version History
Introduced in R2019a
3-435
3 Objects
See Also
Functions
rlSimulinkEnv | rlPredefinedEnv | train | sim | rlNumericSpec | rlFiniteSetSpec
Blocks
RL Agent
Topics
“Create Simulink Reinforcement Learning Environments”
3-436
softplusLayer
softplusLayer
Softplus layer for actor or critic network
Description
A softplus layer applies the softplus activation function Y = log(1 + eX), which ensures that the output
is always positive. This activation function is a smooth continuous version of reluLayer. You can
incorporate this layer into the deep neural networks you define for actors in reinforcement learning
agents. This layer is useful for creating continuous Gaussian policy deep neural networks, for which
the standard deviation output must be positive.
Creation
Syntax
sLayer = softplusLayer
sLayer = softplusLayer(Name,Value)
Description
Properties
Name — Name of layer
'softplus' (default) | character vector
Name of layer, specified as a character vector. To include a layer in a layer graph, you must specify a
nonempty unique layer name. If you train a series network with this layer and Name is set to '', then
the software automatically assigns a name to the layer at training time.
Description of layer, specified as a character vector. When you create the softplus layer, you can use
this property to give it a description that helps you identify its purpose.
Examples
3-437
3 Objects
sLayer = softplusLayer;
You can specify the name of the softplus layer. For example, if the softplus layer represents the
standard deviation of a Gaussian policy deep neural network, you can specify an appropriate name.
sLayer = softplusLayer('Name','stddev')
sLayer =
SoftplusLayer with properties:
Name: 'stddev'
Learnable Parameters
No properties.
State Parameters
No properties.
You can incorporate sLayer into an actor network for reinforcement learning.
Version History
Introduced in R2020a
Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
See Also
quadraticLayer | scalingLayer
Topics
“Create Policies and Value Functions”
3-438
4
Blocks
4 Blocks
RL Agent
Reinforcement learning agent
Library: Reinforcement Learning Toolbox
Description
Use the RL Agent block to simulate and train a reinforcement learning agent in Simulink. You
associate the block with an agent stored in the MATLAB workspace or a data dictionary, such as an
rlACAgent or rlDDPGAgent object. You connect the block so that it receives an observation and a
computed reward. For instance, consider the following block diagram of the
rlSimplePendulumModel model.
The observation input port of the RL Agent block receives a signal that is derived from the
instantaneous angle and angular velocity of the pendulum. The reward port receives a reward
calculated from the same two values and the applied action. You configure the observations and
reward computations that are appropriate to your system.
The block uses the agent to generate an action based on the observation and reward you provide.
Connect the action output port to the appropriate input for your system. For instance, in the
4-2
RL Agent
rlSimplePendulumModel, the action output port is a torque applied to the pendulum system. For
more information about this model, see “Train DQN Agent to Swing Up and Balance Pendulum”.
To train a reinforcement learning agent in Simulink, you generate an environment from the Simulink
model. You then create and configure the agent for training against that environment. For more
information, see “Create Simulink Reinforcement Learning Environments”. When you call train
using the environment, train simulates the model and updates the agent associated with the block.
Ports
Input
This port receives observation signals from the environment. Observation signals represent
measurements or other instantaneous system data. If you have multiple observations, you can use a
Mux block to combine them into a vector signal. To use a nonvirtual bus signal, use bus2RLSpec.
This port receives the reward signal, which you compute based on the observation data. The reward
signal is used during agent training to maximize the expectation of the long-term reward.
Use this signal to specify conditions under which to terminate a training episode. You must configure
logic appropriate to your system to determine the conditions for episode termination. One application
is to terminate an episode that is clearly going well or going poorly. For instance, you can terminate
an episode if the agent reaches its goal or goes irrecoverably far from its goal.
Use this signal to provide an external action to the block. This signal can be a control action from a
human expert, which can be used for safe or imitation learning applications. When the value of the
use external action signal is 1, the passes the external action signal to the environment through
the action block output. The block also uses the external action to update the agent policy based on
the resulting observations and rewards.
Dependencies
For some applications, the action applied to the environment can differ from the action output by the
RL Agent block. For example, the Simulink model can contain a saturation block on the action output
signal.
In such cases, to improve learning results, you can enable this input port and connect the actual
action signal that is applied to the environment.
4-3
4 Blocks
Note The last action port should be used only with off-policy agents, otherwise training can produce
unexpected results.
Dependencies
Use this signal to pass the external action signal to the environment.
When the value of the use external action signal is 1 the block passes the external action signal to
the environment. The block also uses the external action to update the agent policy.
When the value of the use external action signal is 0 the block does not pass the external action
signal to the environment and does not update the policy using the external action. Instead, the
action from the block uses the action from the agent policy.
Dependencies
Output
Action computed by the agent based on the observation and reward inputs. Connect this port to the
inputs of your system. To use a nonvirtual bus signal, use bus2RLSpec.
Note Continuous action-space agents such as rlACAgent, rlPGAgent, or rlPPOAgent (the ones
using an rlContinuousGaussianActor object), do not enforce constraints set by the action
specification. In these cases, you must enforce action space constraints within the environment.
Cumulative sum of the reward signal during simulation. Observe or log this signal to track how the
cumulative reward evolves over time.
Dependencies
Parameters
Agent object — Agent to train
agentObj (default) | agent object
Enter the name of an agent object stored in the MATLAB workspace or a data dictionary, such as an
rlACAgent or rlDDPGAgent object. For information about agent objects, see “Reinforcement
Learning Agents”.
4-4
RL Agent
If the RL Agent block is within a conditionally executed subsystem, such as a Triggered Subsystem or
a Function-Call Subsystem, you must specify the sample time of the agent object as -1 so that the
block can inherit the sample time of its parent subsystem.
Programmatic Use
Block Parameter: Agent
Type: string, character vector
Default: "agentObj"
Generate a Policy block that implements a greedy policy for the agent specified in Agent object by
calling the generatePolicyBlock block function. To generate a greedy policy, the block sets the
UseExplorationPolicy property of the agent to false before generating the policy block..
The generated block is added to a new Simulink model and the policy data is saved in a MAT-file in
the current working folder.
Enable the external action and use external action block input ports by selecting this parameter.
Programmatic Use
Block Parameter: ExternalActionAsInput
Type: string, character vector
Values: "off" | "on"
Default: "off"
Last action input — Add input ports for last action applied to environment
off (default) | on
Enable the last action block input port by selecting this parameter.
Programmatic Use
Block Parameter: ProvideLastAction
Type: string, character vector
Values: "off" | "on"
Default: "off"
Use strict observation data types — Enforce strict data types for observations
off (default) | on
Select this parameter to enforce the observation data types. In this case, if the data type of the signal
connected to the observation input port does not match the data type in the ObservationInfo
4-5
4 Blocks
property of the agent, the block attempts to cast the signal to the correct data type. If casting the
data type is not possible, the block generates an error.
• Lets you validate that the block is getting the correct data types.
• Allows other blocks to inherit their data type from the observation port.
Programmatic Use
Block Parameter: UseStrictObservationDataTypes
Type: string, character vector
Values: "off" | "on"
Default: "off"
Version History
Introduced in R2019a
See Also
Functions
bus2RLSpec | createIntegratedEnv
Blocks
Policy
Topics
“Create Simulink Reinforcement Learning Environments”
“Create Simulink Environment and Train Agent”
4-6
Policy
Policy
Reinforcement learning policy
Library: Reinforcement Learning Toolbox
Description
Use the Policy block to simulate a reinforcement learning policy in Simulink and to generate code
(using Simulink Coder) for deployment purposes. This block takes an observation as input and
outputs an action. You associate the block with a MAT-file that contains the information needed to
fully characterize the policy, and which can be generated by generatePolicyFunction or
generatePolicyBlock.
Ports
Input
This port receives observation signals from the environment. Observation signals represent
measurements or other instantaneous system data. If you have multiple observations, you can use a
Mux block to combine them into a vector signal. To use a nonvirtual bus signal, use bus2RLSpec.
Output
Action computed by the policy based on the observation input. Connect this port to the inputs of your
system. To use a nonvirtual bus signal, use bus2RLSpec.
Parameters
Policy block data MAT file — Policy block data MAT file
blockAgentData.mat (default) | file name
Enter the name of the MAT-file containing the information needed to fully characterize the policy. This
file is generated by generatePolicyFunction or generatePolicyBlock. When you generate the
4-7
4 Blocks
block using generatePolicyBlock and specify a non-default dataFileName argument, then the
generated block has this parameter set to the specified file name, so that the block is associated with
the generated data file.
To use a Policy block within a conditionally executed subsystem, such as a Triggered Subsystem or a
Function-Call Subsystem, you must generate its data file from an agent or policy object which has its
SampleTIme property set to -1. Doing so allows the block to inherit the sample time of its parent
subsystem.
Programmatic Use
Block Parameter: MATFile
Type: string, character vector
Default: "blockAgentData.mat"
Version History
Introduced in R2022b
Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using Simulink® Coder™.
See Also
Functions
bus2RLSpec | createIntegratedEnv | generatePolicyFunction | generatePolicyBlock
Blocks
RL Agent
Topics
“Create Policies and Value Functions”
“Create Simulink Reinforcement Learning Environments”
“Create Simulink Environment and Train Agent”
4-8