Journal of Intelligent & Robotic Systems
https://doi.org/10.1007/s10846-018-0968-4
Contextual Direct Policy Search
With Regularized Covariance Matrix Estimation
Abbas Abdolmaleki1 · David Simões1
· Nuno Lau1 · Luı́s Paulo Reis2 · Gerhard Neumann3,4
Received: 14 December 2017 / Accepted: 3 December 2018
© Springer Nature B.V. 2019
Abstract
Stochastic search and optimization techniques are used in a vast number of areas, ranging from refining the design of
vehicles, determining the effectiveness of new drugs, developing efficient strategies in games, or learning proper behaviors
in robotics. However, they specialize for the specific problem they are solving, and if the problem’s context slightly changes,
they cannot adapt properly. In fact, they require complete re-leaning in order to perform correctly in new unseen scenarios,
regardless of how similar they are to previous learned environments. Contextual algorithms have recently emerged as
solutions to this problem. They learn the policy for a task that depends on a given context, such that widely different
contexts belonging to the same task are learned simultaneously. That being said, the state-of-the-art proposals of this class
of algorithms prematurely converge, and simply cannot compete with algorithms that learn a policy for a single context. We
describe the Contextual Relative Entropy Policy Search (CREPS) algorithm, which belongs to the before-mentioned class
of contextual algorithms. We extend it with a technique that allows the algorithm to severely increase its performance, and
we call it Contextual Relative Entropy Policy Search with Covariance Matrix Adaptation (CREPS-CMA). We propose two
variants, and demonstrate their behavior in a set of classic contextual optimization problems, and on complex simulator
robot tasks.
Keywords Multi-task learning · Stochastic policy search · Contextual learning · Covariance matrix adaptation
1 Introduction
A black-box objective function which depends on a highdimensional parameter vector can be optimized by gradientfree black-box optimizers, such as stochastic search
machine learning algorithms. By black-box, we mean that
This paper is an extended version of an ICARSC 2016 paper [5].
The second author is supported by Fundação para a Ciência e a
Tecnologia under grant PD/BD/113963/2015. The work was also
partially funded by the Operational Programme for Competitiveness and Internationalisation - COMPETE 2020 and by FCT –
Portuguese Foundation for Science and Technology under projects
PEst-OE/EEI/UI0027/2013 and UID/CEC/00127/2013 (IEETA).
The work was also funded by project EuRoC, reference 608849
from call FP7-2013-NMP-ICT-FOF.
Abbas Abdolmaleki
abbas.a@ua.pt
Extended author information available on the last page of the article.
we do not possess access to this function or its properties,
such as analytical gradient or second order information. We
can only sample from this function, which may take significant amounts of time, and might have a high degree
of variance. Therefore, we need algorithms that do not
obtain gradient information or make assumptions on the
form of the function (e.g., being linear or quadratic), but
instead evaluate the performance function value for a specific parameter vector through the return of an episode. In
the context of robotics, the return is computed by generating a roll-out or trajectory (also known as episode) on the
real robot platform by following the control policy of the
robot with a given parameter vector. The trajectory is subsequently evaluated by an objective function. For example,
an episode would be the entire experiment of kicking a
ball, while its return would be the distance traveled by the
ball, which we want to maximize. One class of methods
called stochastic search algorithms, for example, refine populations of candidate solutions (known as individuals) to
a given problem. Initially, an initial population of random
J Intell Robot Syst
individuals is generated. Each candidate is then evaluated,
and the algorithm selects, breeds and mutates a new generation of individuals based on their evaluation. The new generation replaces the older one, and the process is repeated.
This class of algorithms tries to optimize a set of
parameters, and maintains a distribution over them. Samples
are drawn from this search distribution, and then tested
with respect to their performance. The distribution is then
updated based on the samples and their corresponding
evaluations. There many different update rules such as path
integrals [25, 28], gradient based updates [23, 27], crossentropy methods [19], information-theoretic policy updates
[1, 2, 18], or evolutionary strategies [12]. However, most of
these algorithms just optimize for a fixed objective function,
and multi-task learning is not supported by many of them.
In such a setting, one can naively learn one task at a
time independently. However, if the tasks are similar, we
can exploit the correlation of the tasks to speed up the
learning substantially. We use a context vector to describe a
task, changing between different executions, but remaining
constant throughout each task’s episode. Our goal, known
as contextual learning, is to learn a policy that can handle
multiple contexts for the same task. In robotics, we can
assume controllers for robotic walking, where different
speeds on different surfaces or inclinations require different
models to be learned, or for ball-kicking behaviors, where
the physical properties of the ball or the kick’s target
(distance, accuracy, etc.) may change. As another example,
consider learning to retrieve relevant ads, given a context
vector which describes features of the current user, or rating
search results according to the user profile and search query.
This allows a further degree of customization in on-line
services.
For contextual learning and generalization between the
tasks, one could independently optimize for several target
contexts, like optimizing a ball kick for different distances
(contexts). Optimized contexts could then be generalized
through regression methods to new unseen contexts [21, 29].
However, it is time-consuming to do this for a number of
different unseen contexts, as well as sample-inefficient, since
we don’t share any information between contexts in order to
speed-up the learning processes.
Learning a set of parameters for multiple simultaneous
tasks is known as contextual policy search [16, 18]. Such
concurrent learning enables the algorithm to transfer the
knowledge across the context space, which speeds up
learning of the desired context dependent policy. Such a
function allows us to adapt quickly to new situation by
generalizing learned tasks from similar contexts to the new
context.
Several algorithms have succeeded in the multi-task
learning field, such as contextual policy search algorithms
that are based in information theory [22]. One of these
algorithms, named Contextual Relative Entropy Policy
Search (CREPS) [9, 18], keeps a search distribution based
on a Gaussian curve and iteratively computes its parameters
(mean and covariance matrix). However, we will show
that the search distribution might collapse prematurely to
a point-estimate before finding a good solution, due to the
covariance matrix update rule of CREPS. This results in
premature convergence, which avoids the algorithm to be
fully effective and be competitive.
Other stochastic search algorithms [12, 27], such as
CMA-ES and NES, or commonly used policy search methods [17, 25], don’t typically suffer from premature convergence, although they also do not have multi-task learning
capabilities. Therefore, to solve the premature convergence
problem of CREPS, we use the rank-μ covariance matrix
update rule of CMA-ES [12] along with CREPS. CMA-ES
does not have the multi-task learning feature, despite being
considered state of the art in stochastic search and being
highly competitive with other stochastic optimizers.
We propose a new contextual stochastic search algorithm,
inspired by CREPS and CMA-ES, in this paper, which
can learn for multiple task without premature convergence.
We name our proposal Contextual Relative Entropy Policy
Search with Covariance Matrix Adaptation (CREPS-CMA)
and we evaluate two of its variants in standard contextual optimization problems and in robotics problems in
simulated environments. We will demonstrate how the premature convergence issue is solved by CREPS-CMA, which
outperforms the original CREPS by orders of magnitude.
1.1 Problem Statement
Stochastic search is a class of methods for optimization,
for example, finding a set of parameters for a humanoid
robot controller such that robot walks as fast as possible.
We consider the objective function R(θ ) : Rn → R and
we optimize it for the parameter vector θ ∈ Rn . We assume
our objective function is a black box and therefore only
accessible informations are rewards for the parameters. On
the other word we do not assume analytical first order or
second order information. More concretely, our goal is to
find the best solution θ ∗ in a search space S : Rn , i.e,
θ ∗ = argmax R(θ ),
θ∈S
where R : Rn → R is our objective function (also known as
fitness function). Considering our robot walking example,
the search point θ would be the parameters of the robot
walking controller parameters and R(θ ) is speed of the robot
that can be achieved if we use the parameters θ .
So far we considered a setting where the objective
function is fixed and the solution is a point estimate θ ∗ in the
search space. Now we consider a setting where the objective
J Intell Robot Syst
function can changes slightly dependent on the context of
the task. Considering our robot locomotion task, in
contextual setting we are interested in not only the controller
parameters for highest possible speed but controller
parameters for any feasible speed s. In such a setting the
solution is not only a point θ but a function of context s,
i.e, m(s). In the contextual setting, we represent the context
with a vector s, and we aim to find a function that for
each given context s it outputs optimal vector θ given an
objective function R(θ , s) which depends on context. In
this paper, we consider contextual optimization problems
where the objective function depends on an m-dimensional
context vector s. For each possible context vector s drawn
from some unknown context distribution, we would like
to compute the optimal parameter vector θ ∗s , such that the
objective function R(s, θ ) : {Rm × Rn }. → R is optimized.
Since we assume continuous context space, our goal is to
find an optimal policy m∗ (s). We assume R(θ , s) is a black
box and therefore we only have access to the evaluations
{R [k] }k=1...N of samples {s [k] , θ [k] }k=1...N , where k denotes
the sample index and N the number of samples.
In the context of robotics, the reward (objective value)
of a parameter vector is computed by generating a roll-out
or trajectory on the real robot platform by following the
control policy of the robot with the given parameter vector.
The reward for the trajectory is then given by the summed
collected immediate reward at each time step throughout the
trajectory.
1.2 Contextual Stochastic Search
In this section we explain the general procedure for contextual stochastic search algorithms. Contextual stochastic
search algorithms maintain a conditional search distribution
π(θ|s) over the parameter space. In this paper, we model the
search distribution π(θ|s) as a linear Gaussian policy, i.e.,
π(θ|s) = N (θ|mπ (s), Σπ ) ,
(1)
where mπ (s) is a context dependent mean function, and Σ π
is a context independent full covariance matrix. Function
mπ (s) is our current estimation for the context dependent
function and Σ π is effectively the search distribution’s
exploration over parameter space. Please note that we use
a full covariance matrix which enable us to model the
correlation of parameters. We use a linear model for the
mean function θ|mπ (s) = ATπ ϕ(s), where Aπ is matrix
of coefficients for the linear terms, and ϕ(s) is an arbitrary
feature function for context s.
If we now select the feature function as ϕ(s) = [1 s],
the generalization would be linear across the contexts. Alternatively one could use non-linear feature functions, such
as radial basis functions (RBF) [7], which allows for nonlinear generalization over contexts. Such controllers can
find more complex policies, for example, for a robot locomotion task [4]. In contextual setting, in each iteration
we use the current search distribution q(θ|s) to generate samples θ [k] of the parameter vector θ given context samples1 s [k] . Next, we use the objective function
R(θ , s) to evaluate R [k] of {s [k] , θ [k] }. Subsequently, a
weight d [k] for each sample k is computed by using the
samples {s [k] , θ [k] , R [k] }k=1...N . Finally, the search distribution π(θ |s) (a Gaussian distribution) is estimated using
{s [k] , θ [k] , d [k] }k=1...N . The algorithm does this procedure
iteratively till it converges to a solution. Algorithm 1 shows
an exemplary pseudo-code for a stochastic search algorithm
with contextual capacities.
Algorithm 1 Contextual stochastic search algorithm
Repeat
Set q(θ|s) to π(θ |s)
Generate context samples {s [k] }k=1...N
Sample parameters {θ [k] }k=1...N from current
search distribution q(θ|s) given context samples
{s [k] }k=1...N
Evaluate the reward R [k] of each sample in the
sample set {s [k] , θ [k] }k=1...N
Use the data set {s [k] , R [k] }k=1...N to compute
a weight d [k] for each sample
Use the data set {s [k] , θ [k] , d [k] }k=1...N to update the
new search distribution π(θ |s)
Until search distribution π(θ |s) converges.
The goal is to find a new coefficient matrix Aπ and a
new full covariance matrix Σπ in each iteration. Therefore,
in the following sections, we explain the update rules for
obtaining the distribution parameters.
2 Related Work
For generalizing a learned parameter vector across context
space, a naive solution is to choose several contexts and
find a solution for each one by running an optimization
algorithm for each independently. Subsequently, given contexts and their optimal parameters, we can use supervised
1 Context
samples generally depends on the task and comes form
environment. However for simplicity we use a uniform distribution
throughout this paper.
J Intell Robot Syst
learning and regression methods to fit a context dependent
function that generalizes the optimized contexts to a new,
unseen context [8, 24]. For example, [8] use such method.
Although we can use such approaches and they have been
shown to be successful to some extent, they cannot reuse
data-points obtained from optimizing a task with context s
to improve and accelerate the optimization of a task with
another context s ′ . The reason is that such methods assume
that learning parameters for the contexts and the generalization between them are two independent processes.
Such contextual learning, i.e, learning multiple tasks and
generalizing between them simultaneously without restarting the learning process has been established with the
name of contextual (multi-task) policy search [10, 16, 18].
Among all methods, information-theoretic based contextual
policy search algorithms [22], such as the episodic Contextual Relative Entropy Policy Search (CREPS) algorithm
[9, 18], have been shown to be successful for multi task
learning without restarting. In application space CREPS
could successfully optimize a humanoid robot to be able to
walk for different speeds [4]. However, while CREPS has
been shown to be effective it also suffers from premature
convergence which avoid the algorithm to be easily used.
Recently in [11] a contextual learning algorithm based on
the (1+1)-CMA-ES [14] algorithm was proposed. The main
problem of this approach is that it is only applicable for one
dimensional context spaces, which significantly limits the
applicability of the algorithm.
3 Contextual Relative Entropy Policy Search
with Covariance Matrix Adaptation
Relative Entropy Policy Search (REPS) [22] is a reinforcement learning method that tries to attain maximal expected
reward while bounding the amount of information loss. It
allows an exact policy update and uses data generated while
following an unknown policy to generate a new, better policy. An information-theoretic contextual stochastic search
algorithm was recently presented [18] as a modification
of the REPS algorithm. Despite showing good results to
some extent, we will demonstrate that the algorithm suffers
from premature convergence. By using a new update rule,
such as the rank-μ covariance matrix adaptation method
of the CMA-ES algorithm, we can solve this issue, thus
contributing with the CREPS-CMA algorithm.
Starting with a random distribution q(θ|s), we use it
to create a set of unweighted samples {θ [k] }k=1...N , which
are then evaluated against the environment to create the
dataset {s [k] , θ [k] , R [k] }k=1...N . Afterwards, our contribution
computes a weight d [k] for each element of the dataset, and
these weights are used to calculate a new search distribution
π(θ |s).
The following sections describe how the weights are
found, and how they are used to calculate the new search
distribution.
3.1 Weight Computation
CREPS-CMA use the same method as CREPS [18] to
calculate the weights for each sample. CREPS assumes
all the current samples have the same probabilities and
then it changes the weights of the samples such that the
expected reward is optimized. However, simply optimizing
the expected reward based on the samples will give a
probability of 1 to samples with the best reward value and
zero to the others. To solve this problem, CREPS limits the
relative entropy between the previous sample-based search
distribution q(θ|s) and the next one, π(θ|s), such that the
learning process is not unstable and has a smooth evolution.
More formally, CREPS solves the following optimization
program in each iteration
μ(s)π(θ |s)Rsθ dθds,
max
π
s.t. μ(s)KL π(θ |s)||q(θ|s) ds ≤ ǫ,
∀s : 1 = π(θ |s)dθ ,
(2)
where μ(s) represents the context distribution, which is
dependent on the current task, Rsθ is the expected reward
when the parameter vector θ is evaluated in
the environment
using context s, and KL π(θ |s)||q(θ|s) is the KullbackLeibler divergence from q(θ |s) to π(θ|s).
This equation can be solved in closed form through
the use of Lagrangian multipliers [6], and the closed form
solution for policy π(θ |s) is given by
π(θ |s) ∝ q(θ |s) exp (Rsθ /η) ,
(3)
where η is a Lagrangian multiplier that sets the temperature
of the soft-max distribution given in the previous equation.
The temperature parameter η can be found efficiently by
optimizing the dual function
Rsθ
dθ ds.
g(η) = ηǫ + η μ(s) log
q(θ|s) exp
η
(4)
By minimizing the dual function g(η) such that η > 0,
we can obtain the optimal value for η [6]. However, we
would need a lot of samples {θ k0 , θ k1 , . . .} for each context
s k to approximate the log integral in the dual function (4).
This is not feasible, as the context can not often be directly
J Intell Robot Syst
controlled, and we have only access to a single action θ k per
context s k .
Therefore CREPS reformulates the performance criteria
to tackle this issue. Instead of optimizing for the policy
π(θ|s), CREPS optimizes for the joint probabilities p(s, θ).
Additionally, CREPS use the constraints ∀s : p(s) =
μ(s) to enforce the condition that p(s) = θ p(s, θ)dθ
still reproduces a context distribution μ(s) that matches
the distribution used to draw the context samples from.
However, this results in an infinite number of constraints
as the context is continuous. In order to solve this problem,
CREPS matches feature expectations, instead of matching
single probabilities, i.e.,
p(s)φ(s)ds = φ̂,
(5)
s
where φ̂ = s μ(s)φ(s)ds is the expected feature vector for
the given context distribution μ(s) and a given feature space
φ. For example, if we assume the feature vector φ(s) has
the linear and squared terms [s, s 2 ] of the context vector s,
in practice we are matching the first and second moment
of the distribution p(s) with the moments of the context
distribution μ(s).2 Note that φ can be different from ϕ.
After these consideration, the new optimization program is
p(s, θ)Rsθ dsdθ
max
p
s.t. ǫ ≥ KL(p(s, θ)||μ(s)q(θ |s)),
φ̂ =
p(s, θ)φ(s)dsdθ,
1=
p(s, θ)dsdθ .
Input: Data Set D{s [k] , θ [k] , R [k] }k=1...N and the old
covariance matrix Σ q
Compute the weights d [k] for each sample:
1- Optimize the dual function g and find optimum η
and w
T
g(η, w) = ηǫ + φ̂ w
N
[k]
1
R − φ(s [k] )T w
+η log
exp
.
N
η
K=1
[k]
R − φ(s [k] )T w
[k]
2- Compute weights d = exp
.
η
N
3- Normalize d [k] such that k=1 d [k] = 1.
Compute the new mean function mπ (s):
Use weighted maximum likelihood to estimate parameters Aπ of the new mean function
−1
Aπ = (Φ T DΦ + λI )
(6)
(7)
where V (s) = φ(s)T w can be considered a context dependent baseline which is subtracted from the return Rsθ . The
parameters w and η are again Lagrangian multipliers that
can be obtained by optimizing the dual function, given as
Φ T DU ,
where Φ T = [ϕ [1] , ..., ϕ [N] ] contains the feature vector
for all samples, U = [θ [1] , ..., θ [N] ] contains all the
sample parameters and D is the diagonal weighting
matrix containing the weights d [k] .
Compute the sample covariance S:
N
Sq =
k=1
The solution for p(s, θ ) is now given by
p(s, θ) ∝ q(θ|s)μ(s) exp ((Rsθ − V (s))/η) ,
Algorithm 2 CREPS-CMA
⎛
Z=⎝
T
d [k] θ [k] −ATq ϕ(s [k] ) θ [k] −ATq ϕ(s [k] ) /Z,
(d [k]
d [k] )2 −
k=1
2
N
N
k=1
⎞
⎠/
N
(d [k] )
k=1
Compute the number of effective samples φeff and λ:
1
φeff
, λ = min 1, 2
φeff = N
[k] 2
n
k=1 (d )
Compute the new covariance matrix Σ:
Σ π = λΣ q + (1 − λ)S q .
T
g(η, w) = ηǫ + φ̂ w
Rsθ −φ(s)T w
μ(s)q(θ |s)exp
+ η log
dθds .
η
(8)
2 In this paper, in all experiments we use a feature vector that contains
all the linear and squared terms of the context vector, i.e, φ(s) = [s, s 2 ]
which corresponds to matching the first and second moment of the
distributions.
This policy update results in a weight
d [k] = exp ((Rsθ − V (s))/η)
(9)
for each sample [s [k] , θ [k] ], which we can use to estimate a
new search distribution ππ (θ|s).
Please note that the optimization program that CREPS
solves, is a convex optimization problem, which can be
demonstrated by showing that the second derivatives of
J Intell Robot Syst
the objective and constraints are always positive. For the
constraints,
∂ 2 KL(p(s, θ)||μ(s)q(θ|s))
∂p(s, θ )2
=
=
In order to obtain the context dependent mean function mπ ,
we directly solve the maximum likelihood estimate problem
in Eq. 10 for Aπ . It is a weighted linear regression problem
and the solution for Aπ can be obtained in closed form,
which is given by
p(s,θ )
∂ 2 p(s, θ ) log Q(s,θ)
∂p(s, θ )2
∂ 2 p(s, θ ) log p(s, θ ) − p(s, θ ) log Q(s, θ)
∂p(s, θ )2
−1
Aπ = (Φ T DΦ + λI )
∂ 2 p(s, θ ) log p(s, θ ) ∂ 2 p(s, θ ) log Q(s, θ)
−
=
∂p(s, θ )2
∂p(s, θ )2
=
=0+
=
∂ 2 log p(s, θ )
−0
∂p(s, θ )2
2
2
1
+p(s, θ )
=
−
p(s, θ )
∂p(s, θ ) p(s, θ ) p(s, θ )
For the objective, we are taking the gradient of Eq. 6
w.r.t. p(s, θ ) and (s, θ) is given, i.e., p(s = s ′ , θ = θ ′ ) and
(s ′ , θ ′ ) is a constant. In this case, the gradient with respect to
p(s = s ′ , θ = θ ′ ) will be zero everywhere but at (s ′ , θ ′ ),
where it is Rs ′ θ ′ . The gradient of Rs ′ θ ′ w.r.t. p(s ′ , θ ′ ) is
simply zero.
3.2 Search Distribution Update Rule
In this section we propose update rules to calculate
the context-dependent mean function mπ of the search
distribution, as well as the context independent covariance matrix Σπ . Given the weights we obtained in
the previous section for each context-parameters joint
{s [k] , θ [k] , d [k] }k=1...N and the old Gaussian search distribution,
we want to find the new
search distribution π(θ |s) =
N θ|mπ (s) = ATπ ϕ(s), π ,by finding Aπ and π . In
fact, at this point we can use a supervised learning algorithm
to fit a new distribution. Therefore, similarly to contextual
REPS, we can directly use a weighted maximum likelihood
estimate method to obtain a new distribution, i.e,
N
d [k] log π(θ [k] |s [k] ; Σ π , Aπ ).
(11)
3.2.2 Covariance Matrix Update Rule
1
∂ p(s,θ
)
1
≥ 0.
p(s, θ )
argmax
Φ T DU ,
where Φ T = [ϕ [1] , ..., ϕ [N] ] has all the feature vectors for
all samples, U = [θ [1] , ..., θ [N] ] includes all the sample
parameters, and D is the diagonal matrix with the weights
d [k] .
∂ 2 p(s, θ )
∂p(s, θ ) log p(s, θ )
log p(s, θ ) + 2
∂p(s, θ )
∂p(s, θ )2
+p(s, θ )
3.2.1 Context-Dependent Mean-Function Update Rule
(10)
Σπ ,Aπ k=1
The maximum likelihood estimate gives us the update
rules for both mean function and covariance matrix.
However, maximum likelihood for fitting covariance matrix
leads to over-fitting and therefore premature convergence.
We will explain how we solve this problem. First we present
the update rule for the mean function.
In practice, again similar to CREPS, we could also just solve
the optimization program in Eq. 10 for Σπ . In this case, we
can solve for the covariance matrix Σπ = S and the solution
can be obtained in closed form. This solution is also known
as sample covariance S matrix because it is purely based on
samples and is given by
T
N
d [k] θ [k] − ATπ ϕ(s [k] ) θ [k] − ATπ ϕ(s [k] )
,
S = k=1
Z
N
[k] 2
[k] 2
( N
k=1 d ) −
k=1 (d )
.
(12)
Z=
N
[k]
k=1 (d )
CREPS directly uses this covariance update rule. However this solution is typically over-fitted. The reason is that
the number of free parameters of the covariance matrix are
typically much larger than the number of available samples,
since they cannot fully span the parameter space. Therefore,
fitting these samples typically cause over-fitting. And thus
we can argue that the sample covariance matrix of Eq. 12
estimates the true covariance matrix poorly [3]. In other
words, fitting this limited number of samples causes an
exploration decrease along many dimensions of the parameter space that are not present in our samples. That is the
main reason CREPS will suffer from premature convergence. In order to have a competitive contextual stochastic
search algorithm, this loss of exploration after each distribution fitting should be controlled. Therefore we need to
limit the change of the new covariance matrix with respect
to the old covariance matrix. This constraint will maintain
exploration along the different dimensions of the parameter space. Such bounding has been used by the CMA-ES
algorithm, which is a non-contextual algorithm. Therefore,
inspired by CMA-ES, we use a convex combination of the
old covariance matrix Σ q and the sample covariance matrix
S from Eq. 12, i.e.,
Σ π = (1 − λ)Σ q + (λ)S.
(13)
J Intell Robot Syst
The interpolation factor λ ∈ [0, 1] controls the information
loss of the new covariance matrix by defining how much
the new covariance matrix diverges from the old covariance
matrix Σ q towards the sample covariance matrix S. This
factor enables the algorithm to define the amount of
information from the new samples to incorporate into the
covariance matrix. This approach simply avoids over-fitting
the new samples by bounding the change of the new
covariance matrix. Note that if we set λ = 0, we will obtain
the update rule is used by CREPS. The λ factor can be set
in different ways. For example we can choose the λ such
that the entropy of the new search distribution is reduced
by a certain amount [3]. However, for CREPS-CMA, we
use an update rule similar to the rank-μ update in CMA-ES
algorithm [12], i.e.,
1
φeff
, λ = min 1, 2 , ,
(14)
φeff = N
[k] 2
n
k=1 (d )
where φeff is the number of effective samples and n is the
dimension of the parameter space θ. In order to calculate
the sample covariance matrix in Eq. 12, we can use the
new mean function or the old mean function from the
current search distribution. Using the new mean function
mπ to calculate the sample covariance matrix in Eq. 12 will
increase the probability of reproducing the current samples
we have in the dataset. Therefore the distribution will shrink
in each iteration in order to cover those weighted samples.
In other words, we will have a distribution that tends to
repeat producing the current samples. This may still cause
premature convergence. We would instead prefer our new
distribution to increase the probability of selected steps, as
opposed to selected samples, i.e, we prefer to repeat the
steps that result in good rewarded samples, instead of the
samples we have in the dataset. That is, we are interested in
repeating the mutations that resulted in the good samples in
our data set. To achieve this, we can simply use the old mean
function mq to compute the sample covariance matrix, i.e,
Sq =
N
[k] θ [k]
k=1 d
T
− ATq ϕ(s [k] ) θ [k] − ATq ϕ(s [k] )
Z
,
(15)
where Z is given in Eq. 12. By using the old mean, in fact
we encode the information about the steps we took in the
last iteration in the new covariance matrix. As these steps
are weighted, the new distribution with new covariance
matrix will repeat the successful steps and the unsuccessful
steps will be discarded. Please note that the weights of the
samples define the amount of success of a step. Therefore,
the steps with more weight will have more probability to
be repeated. We will use both the new mean function and
the old mean function, and compare both possible variants
of the algorithm. The first variant uses the mean calculated
for the current iteration mπ to obtain the new covariance
matrix, and is referred to as CREPS-CMACurr , see Eq. 13.
The second variant, referred to as CREPS-CMAOld , uses
the old mean function mq , i.e.,
Σ π = (1 − λ)Σ q + (λ)S q .
(16)
We will compare both approaches and provide an empirical
analysis to show that using the old mean function effectively
avoids the premature convergence while using the new mean
can result in premature convergence.
See Algorithm 2 for a compact representation of the
CREPS-CMA algorithm.
4 Interpretation of the Regularized
Covariance Matrix Update Rule
So far, we discussed the importance of regularizing covariance matrices with intuitive reasons. In this section, we
propose a KL regularized objective that enables us to derive
the covariance matrix update rules from a single principle. We can obtain such an update rule by increasing the
likelihood of weighted steps subjected to a KL-divergence
penalty between new and old search distribution to avoid
overfitting samples, i.e,
N
d [k] log π(θ k |s k )
arg max J =
Σ
k=1
incorporates successful steps (J1 )
N
KL(πold (θ |s [k] )|π(θ |s [k] )) .
−γ
k=1
avoids overfitting (J2 )
In this objective, γ > 0 is the trade off of maximizing
the likelihood of weighted steps and limiting the information loss between the new and old search distributions. If
we set γ to zero, we obtain the maximum likelihood objective without any penalty on information loss, which as we
discussed will lead to overfitting and premature convergence. If we use a Gaussian distribution as the underlying
sampling policy, we will be able to solve this KL regularized
objective in closed form, and the solution is the regularized covariance update rule we proposed in the last section.
Next we explain how we solve this objective. Given samples {s [k] , θ [k] , d [k] }k=1...N and a linear multivariate normal
distribution, i.e,
π(θ |s) = N (θ|m(s), Σ) ,
we maximize the objective J to obtain a new covariance
matrix Σ. As in this objective we are only interested in
covariance matrix, we set the mean function of the objective
to the old one, i.e. m(s) = mold (s). As we discussed
J Intell Robot Syst
in the previous sections, using this simple trick we find
a covariance matrix Σ that optimizes the likelihood of
weighted steps which can considerably reduce the risk of
premature convergence as will be shown in experiments.
First, we need to calculate the gradient of J with respect
to Σ −1 . Please note that there are to terms in the objective.
Therefore we take the gradient for terms (J1 , J2 ) in the
objective separately, i.e.,
∇Σ −1 J = ∇Σ −1 J1 − ∇Σ −1 J2 .
First we take the gradient w.r.t J1 , which results in
N
J1 =
k=1
1
d [k] log π(θ [k] |s [k] ) = const− log det Σ
2
1
− tr(Σ −1
2
× (θ
[k]
N
d [k] (θ [k] −mold (s [k] ))
k=1
−mold (s [k] ))T ),
where const is the constant value from the KL divergence
of two Gaussians, and det(Σ) and tr(Σ) are the determinant
and trace of a matrix Σ, respectively. We obtain the gradient
as
1
1
Σ−
2
2
∇Σ −1 J1 =
N
d [k]
k=1
×(θ k − mold (s [k] ))(θ [k] − mold (s [k] ))T .
and we get,
N
1
1
dk (θ [k] − mold (s [k] ))(θ [k] − mold (s [k] ))T
Σ−
2
2
k=1
γ
γ
− Σold + Σ = 0.
2
2
To find the new Σ, we rearrange the terms in the above
equation and we get
1
γ
Σold +
Σ =
1+γ
1+γ
N
d [k]
k=1
× (θ [k] − mold (s [k] ))(θ k − mold (s [k] ))T .
We can see that the weights in above update rule, sum to 1,
i.e.,
γ
1
+
= 1.
1+γ
1+γ
We can now rewrite this
1
γ
λ=
,1 − λ =
,
1+γ
1+γ
and by rewriting the equation for Σ we get,
N
Σ = (1 − λ)Σold + λ
dk
k=1
[k]
× (θ [k] − mold (s ))(θ [k] − mold (s [k] ))T .
This is the exact regularized covariance matrix update rule
we discussed in the previous section. Please note that as
γ >= 0, we can easily infer that λ ≤ 1.
Second, we take the gradient of J2 , i.e,
N
KL(πold (θ|s [k] )|π(θ |s [k] ))
J2 = γ
k=1
N
=γ
k=1
1
const + tr(Σ −1 Σold )
2
1
+ (mold (s [k] ) − mold (s [k] ))T Σ −1 (mold (s [k] )
2
1
det Σ
.
− mold (s [k] )) + ln
2 det Σold
The gradient of J2 is as follows:
∇Σ −1 J2 =
γ
γ
Σold − Σ.
2
2
Please note that the covariance matrix in our set-up is
context independent. That is the reason the gradient does not
depend on the context distribution. Now to find the optimum
for Σ, we simply set the derivative of the KL regularized
objective function ∇J to zero i.e.,
∇J1 − ∇J2 = 0,
5 Experiments
We now demonstrate and compare the performance of
our algorithms, i.e, CREPS-CMACurr and CREPS-CMAold
against the state of the art methods. We use three different
environments to evaluate our algorithms exhaustively,
ranging from mathematical functions, chosen for their
complexity and non-linear landscapes, a robotic arm motion
control problem, applicable in real-world contexts, and a
high-dimensional simulated humanoid kick, which was later
integrated into the FCPortugal3D team, participating in the
worldwide RoboCup 3D Simulation League.
The first environment consists of a set of standard optimization test functions [13, 20, 26, 30], including Schwefel’s Problem and the Rosenbrock function. The functions
were extended to the contextual paradigm, and the optimization target is the optimal 15-dimensional array θ for a
1-dimensional context s. The results can be seen in Figs. 1
and 2, where CREPS-CMA could successfully learn the
contextual tasks, despite standard Contextual REPS suffering from premature convergence.
J Intell Robot Syst
The second environment consists on a robotic arm with
five joints that must reach a certain target, dependent on
the task’s context. A complex but comprehensive way to
represent the arm’s movements are through the use of
dynamic movement primitives (DMPs) [15], using five
functions for each joint, and totaling a 25-dimensional
Error
CREPS
parameter vector. The task’s context, or in other words, the
point which must be reached by the robotic arm, is a 2dimensional position. Figure 4 shows the setup of the robot
and the optimization results.
The third environment is a simulated humanoid ball
kick, which was split into two different tasks. One of
CREPS-CMA
CREPS-CMA
(a) Rosenbrock Function
(b) Sphere Function
(c) Shifted Sphere Function
(d) Shifted Schwefel’s Problem
(e) CigTab Function
(f) Tablet Function
(g) Elliptic Function
(h) Elliptic Function variant
Iterations Taken
Fig. 1 The performance comparison of CREPS (red), CREPSCMAOld (blue) and CREPS-CMACurr (green) to calculate the variance in the distribution update). The y-axis is the error (in logarithmic
scale) and the x-axis is the amount of iterations elapsed. Results
are shown on the optimization of the contextual version of standard
functions a Rosenbrock, b Sphere, c Shifted Sphere, d Shifted Schwefel’s, e CigTab, f Tablet, g Elliptic, and h an Elliptic variant. The
results show that Contextual REPS suffers from premature convergence, CREPS-CMAOld solves the problem, despite being slower than
CREPS-CMACurr
J Intell Robot Syst
Fig. 2 The performance
comparison of CREPS (red),
CREPS-CMAOld (blue) and
CREPS-CMACurr (green) to
calculate the variance in the
distribution update). The y-axis
is the error (in logarithmic scale)
and the x-axis is the amount of
iterations elapsed. Results are
shown on the optimization of the
contextual version of standard
functions a Different Powers,
b Plane, c Two Axes, d Cigar,
e Rastrigin’s, f Parabolic Ridge,
and g Sharp Ridge. The results
show that Contextual REPS
suffers from premature
convergence and
CREPS-CMAOld solves the
problem, despite being slower
than CREPS-CMACurr
Error
CREPS
CREPS-CMA
CREPS-CMA
(a) Different Powers Function
(b) Plane Function
(c) Two Axes Function
(d) Cigar Function
(e) Rastrigin’s Function
(f) Parabolic Ridge Function
(g) Sharp Ridge Function
Iterations Taken
the tasks focuses on precision, and the context defines a
2-dimensional point where the ball should stop at. The
remaining task focuses on flexibility, where the agent must
kick the ball as far as possible, but the ball’s initial position
with respect to the robot varies and is defined by a 2dimensional context. A linear interpolation model is used to
define the robot’s motion, by defining the initial and final
robot positions, as well as the time taken to perform the
movement. Figure 5 shows an example of the humanoid’s
movements and the performance results for several contexts.
Figure 6 shows examples of possible ball positions relative
to the robot, the range of possible positions and the results
for several contexts.
Figures 1 and 2 show the average, as well as the standard
deviation, of the optimization results for the first series of
tasks. The results are shown in a logarithmic or linear scale,
over 5 trials for each experiment. Figure 4 shows the results
for the planar reaching task. Figure 5c shows the average
and two times the standard deviation of the results over 10
trials for the first humanoid kick task. Figure 6c shows the
J Intell Robot Syst
average kick distance over 10 trials for the second humanoid
kick task.
In next sections we will analyze the results for each
environment in details. In general we will show that our new
proposed algorithms, i.e, CREPS-CMACurr and CREPSCMAold can achieve state of art results and can successfully
avoid premature convergence.
5.1 Standard Optimization Test Functions
In this section, we measured the performance of our
proposed algorithms with fifteen popular and challenging
optimization functions [13, 20, 26, 30] as are given in table
in Table 1. These functions originally are non-contextual.
We there for contextualized these functions, i.e, we choose
x = θ + As, and the matrix A is a constant matrix that was
chosen randomly.
Due to the fact that our context s is 3-dimensional, A
is a p × 3 dimensional vector. Our definition for x means
the optimal θ for these functions is linearly dependent on
the given context s. The initial search area of θ for all
experiments is restricted to the hypercube −5 ≤ θ i ≤
5, i = 1, . . . , p and contexts are uniformly sampled from
the interval 0 ≤ s i ≤ 3, i = 1, . . . , z where z is the
dimension of the context space s. In our experiments, the
mean of the initial distributions has been chosen randomly
in the defined search area.
We generated 50 new samples per iteration, and compared both versions of CREPS-CMA, CREPS-CMACurr
and CREPS-CMAOld , against the original Contextual
REPS. We compared CREPS-CMACurr and CREPSCMAOld against the standard Contextual REPS. In each
Table 1 The 15 optimization
functions used to compare the
performance of the algorithms
where x = θ + As
iteration, we generated 50 new samples. The results are
shown in Figs. 1 and 2, where CREPS-CMA converged
to the solutions, as opposed to Contextual REPS, which
converged prematurely to a poor solution. We can also
see that CREPS-CMACurr speeds up the convergence process in some cases, but leads to pre mature convergence in
others. CREPS-CMAOld , on the other hand, does not prematurely converge, despite not being as fast sometimes. We
can see while CREPS-CMAOld is sometimes slower than
CREPS-CMACurr , it robustly solves all tasks. In contrast
CREPS-CMACurr can also suffers from premature convergence. However both these algorithms outperform the
original CREPS by order of magnitude.
We also demonstrate the applicability and performance
of CREPS-CMAOld for several combinations of context and
problem dimension. Figure 3 shows the amount of samples
needed to learn five functions, namely the Rosenbrock,
Sphere, Shifted Schwefel, and Elliptic functions. Results
show, as expected, if the problem or context dimensionality
increases, the learning process becomes more complex
and a larger amount of samples are needed. The results
also show applicability of CREPS-CMAOld to highdimensional problems (up to 64-dimensional problems and
5-dimensional contexts) (Fig. 4).
5.2 Planar Reaching
This environment consisted on using a 5-joint robotic arm,
controlled with DMPs, to reach a certain point in space.
Each segment of the robot’s arm had a 1-meter length. The
arm’s first target is a point v 50 , which must be reach by
the arm’s end effector within 50 time-steps, and the second
Name
Function
Stop
Rosenbrock
Sphere
p−1
2 2
2
i=1 [100(x i+1 − xi ) + (1 − x i ) ]
p
x2
i=1 i
p
2
i=1 xi
p
i
2
i=1 ( j =1 x j )
p−2
2
2
8
x1 + 10 xp−1 + 104 i=2 xi2
p
2
(1000x1 ) + 100 i=2 xi2
i−1
p
6 p−1 x 2
i=1 (10 )
i
i−1
p
4 ) p−1 x 2
(10
i=1
i
i−1
p
2+10 p−1
|x
|
i
i=1
10−5
10−5
Shifted sphere
Shifted Schwefel
CigTab
Tablet
Shifted rotated high conditioned elliptic 6
Shifted rotated high conditioned elliptic 4
Different powers
Plane
Two Axes
Cigar
Rastrigin’s multimodal
Sharp ridge
Parabolic ridge
x1
⌊ p2 ⌋ 2
i=1 xi
p
x2
i=⌊ p2 ⌋+1 i
p
x12 + 100 i=2 (1000x i )2
p
10p + i=1 xi2 − 10 cos(2πx i )
p
2
x1 + 100
i=2 xi
p
2
x1 + 100 i=2 xi
+ 106
10−5
10−5
10−5
10−5
10−5
10−5
10−5
−10000
10−5
10−5
10−5
−10000
−10000
J Intell Robot Syst
Fig. 3 The amount of samples
needed by CREPS-CMAOld to
converge to the target values
shown in Figs. 1 and 2 for the
functions a Rosenbrock,
b Sphere, c Schwefel, and
d Elliptic. The x-axis represents
the context dimension, from 1 to
5, and the y-axis represents the
problem dimension, from 2 to 64
-3.5
(a) Rosenbrock Function
(b) Sphere Function
(c) Shifted Schwefel’s Problem
(d) Elliptic Function
-10y
5
4
3
-4.5
y [m]
Average Return
-4
2
-5
1
Contextual REPS
Contextual REPS-CMA
-5.5
0
-6
0
50
100
150
200
Iterations
250
300
350
(a) Reaching task algorithm comparison
Fig. 4 a Algorithmic comparison for a planar reaching task (5 joints,
25 parameters). In this task, CREPS-CMAOld has converged faster and
learned the task well. Contextual REPS suffers from premature convergence and cannot learn the task. b The planar reaching task used
for our comparisons. A 5-link planar robot has to reach a waypoint
400
-1
-1
0
1
2
x [m]
3
4
5
(b) Planar robot
v 50 = [1, 1] in task space. The waypoint position is the 2 dimensional context vector and is given. The waypoint is indicated by the
red cross. The postures of the resulting motion are shown as overlay,
where darker postures indicate a posture which is close in time to the
waypoint
(a) Initial position of kick move- (b) Final position of kick move- (c) Accuracy of the kick.
ment.
ment.
Fig. 5 a The initial position of an exemplary humanoid kick. b The
final position of an exemplary humanoid kick. c The performance
of the learned linear (blue) and non-linear (red) policies. The y-axis
represents the distance at which the ball was from the intended target,
in meters, while the x-axis represents the distance at which the ball
was being kicked from, also in meters
J Intell Robot Syst
(a) Exemplary (close) ball posi- (b) Exemplary (far) ball position (c) Possible ball positions in relation in relation to agent.
in relation to agent.
tion to agent.
Fig. 6 a A possible ball position, close to the agent. b A possible ball position, far from the agent. c The range of possible ball positions, relative
to the agent
target is a point v 100 = [5, 0], reached by time-step 100.
The first point’s coordinates are the tasks context, between
the points [0, 0] and [2, 2].
We modeled the task’s reward, based on a quadratic
cost term for distance from the two target points as well
as quadratic costs for high accelerations to punish jerky
movements and energy consumption. We used 5 basis
functions per joint for the DMPs, while the goal attractor for
reaching the final state was assumed to be known, totaling
25 dimensions in our parameter vector. We generated 100
new samples per iteration. The result show that CREPSCMAOld successfully learns the task without premature
convergence, and significantly outperform the original
Contextual REPS.
Fig. 7 The median kick
distances of the policies learned
by CREPS and
CREPS-CMAOld over 10 trials.
The x-axis represents the ball
position xB and the y-axis the
ball position yB, both in relation
to the agent’s center. Warmer
colors represent longer distances
traveled by the ball, and it is
clear that CREPS-CMAOld
outperforms CREPS in the
possible range of ball positions
(a) Kick distances of the CREPS policy.
(b) Kick distances of the CREPS-CMA policy.
J Intell Robot Syst
Fig. 8 Two exemplary
movement sequences for a kick
where a–c the ball is close to the
agent and b–d when it is far
from the agent
(a) The first part of a movement sequence to kick a ball close to the agent.
(b) The first part of a movement sequence to kick a ball far from the agent.
(c) The last part of a movement sequence to kick a ball close to the agent.
(d) The last part of a movement sequence to kick a ball far from the agent.
5.3 Humanoid Kick
The third and final environment consisted in two separate
distinct tasks, modeled in a simulated soccer environment
for humanoid robots.
The first task focuses on precision. The ball is in a fixed
initial location with respect to the agent, and the agent is
given a 1-dimensional context describing the distance the
ball should travel, which is within the [3, 12] meter interval.
The motion controller is a linear interpolator between an
initial position (l-dimensional vector of joint angles), a final
position (l-dimensional vector of joint angles), and the time
t between positions. The agent has 6 joints per leg, and
remaining joints are ignored, totaling 12-dimensional joints
for each position, and a final 25-dimensional parameter
vector. Figure 5a and b show the initial and final positions
of an exemplary kick.
The reward function as modeled as
basis features to generalize over non-linear contexts and
generate 20 samples per iteration. Using CREPS-CMAOld ,
we achieved great accuracy, as showing in Fig. 5c,3 after
1000 iterations, with an average error distance of 0.34±0.11
meters.
The second tasks focuses on flexibility. The ball is in a
context-dependent initial location with respect to the agent,
bound within the box shown in Fig. 6c, where xB ranges
from 0.15 to 0.3 meters and yB from −0.15 to 0.15 meters.
These values were chosen based on the agent architecture
and movement capabilities. Smaller values would cause the
agent’s body to overlap the ball, while larger values would
make the ball out of reach for the agent. The goal of this
task is to kick the ball as further as possible. To speed-up
training, we train only one of the agent’s legs, and mirror
the joints for the other leg. Figure 6a and b show exemplary
ball positions in relation to the agent.
The reward function as modeled as
R(θ , s) = −(x − s)2 − y 2 ,
R(θ , s) = x 2 − y 2 ,
where x and y are the distances traveled by the ball along the
X and Y axes. This function penalizes deviations from the
target distance s. Based on previous work [4], we used radial
3 Demonstration
video is available on-line at https://www.dropbox.
com/s/bl27w9uqe7qh1sd/ICARSC16kick.mp4.
J Intell Robot Syst
where x and y are the distances traveled by the ball along the
X and Y axes. This function rewards for distance traveled
and penalized for side-ways deviation. We generated 100
new samples per iteration, and compared Contextual REPS
and CREPS-CMAOld after 700 iterations, with the results
shown in Fig. 7.4 The original algorithm achieved an
average kick distance of 2.67 ± 2.69 meters, while our
proposal achieved 6.50 ± 2.95 meters. We show two distinct
kick motions in Fig. 8. Figure 8a and c show the sequence
for a kick where the context is xB = 0.15m and yB =
0.05m, while Fig. 8b and d show a kick with the context
xB = 0.25m and yB = 0.02m.
6 Conclusion
There are many optimization algorithms that have been
proposed by the community. However, most of these
algorithms optimize a fixed task with a single context,
such as optimizing for the lowest energy consumption, the
ideal gait for the highest speed, or both. Stochastic search
methods, and in particular CMA-ES, have shown many
successes for such single context black-box optimization
problems. Although stochastic search algorithms are well
studied, in this paper we study contextual stochastic search
algorithms where we optimize for a context dependent
function instead of a point estimate. We built on previous
state of art algorithms for contextual stochastic search, i.e,
CREPS, and a non-contextual stochastic search method, i.e.,
CMA-ES. While CREPS enjoys contextual learning feature,
it systematically suffers from premature convergence. On
the other hand, CMA-ES does not have contextual learning
capabilities, but effectively avoids premature convergence.
We therefore introduce a powerful contextual stochastic
search algorithm that has the best of both worlds, i.e,
contextual learning and premature convergence avoidance.
In this paper, inspired by CMA-ES, we alleviated the
premature convergence problem of contextual REPS, which
resulted in two variants, the CREPS-CMAOld and CREPSCMACurr algorithms. One variant uses the old mean and
the other variant uses the new mean. We did an exhaustive
evaluation using three different environments, ranging from
standard functions to complex simulated robotics tasks. The
results show that both algorithms perform favorably and
outperform the original CMA-ES by orders of magnitude.
Additionally, we showed that CREPS-CMAOld solves the
premature convergence issue effectively and robustly solves
all the tasks. We also show the applicability of the algorithm
in practical situations, such as a humanoid robot kick task
4 Demonstration
video of CREPS-CMAOld is available on-line at
https://www.dropbox.com/s/uoyetyxt1slonhh/vidKicks2.wmv.
and a planar reaching task. In the future, we will investigate
different ways to incorporate CMA-ES’s step-size control
feature into CREPS-CMA for faster convergence and less
sensitivity to hyper parameters.
Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
References
1. Abdolmaleki, A., Lau, N., Reis, L.P., Peters, J., Neumann, G.:
Contextual policy search for linear and nonlinear generalization
of a humanoid walking controller. J. Intell. Robot. Syst. 10, 1–16
(2016)
2. Abdolmaleki, A., Lioutikov, R., Peters, J., Lua, N., Reis, L.,
Neumann, G.: Regularized Covariance Estimation for Weighted
Maximum Likelihood Policy Search Methods. In: Advances in
Neural Information Processing Systems (NIPS). MIT Press (2015)
3. Abdolmaleki, A., Lua, N., Reis, L., Neumann, G.: Regularized
covariance estimation for weighted maximum likelihood policy
search methods. In: Proceedings of the International Conference
on Humanoid Robots (HUMANOIDS) (2015)
4. Abdolmaleki, A., Lua, N., Reis, L., Peters, J., Neumann, G.:
Contextual Policy Search for Generalizing a Parameterized
Biped Walking Controller. In: IEEE International Conference on
Autonomous Robot Systems and Competitions (ICARSC) (2015)
5. Abdolmaleki, A., Simoes, D., Lau, N., Reis, L.P., Neumann,
G.: Contextual Relative Entropy Policy Search with Covariance
Matrix Adaptation. In: 2016 IEEE International Conference On
Autonomous Robot Systems and Competitions (ICARSC), pp.
94–99. IEEE (2016)
6. Boyd, S., Vandenberghe, L.: Convex optimization. University
Press, Cambridge (2004)
7. Broomhead, D.S., Lowe, D.: Radial Basis Functions, MultiVariable Functional Interpolation and Adaptive Networks. Tech.
rep., DTIC Document (1988)
8. Da Silva, B., Konidaris, G., Barto, A.: Learning parameterized
skills. International Conference on Machine Learning (ICML)
(2012)
9. Daniel, C., Neumann, G., Peters, J.: Hierarchical Relative
Entropy Policy Search. In: International Conference on Artificial
Intelligence and Statistics (AISTATS) (2012)
10. Deisenroth, M.P., Englert, P., Peters, J., Fox, D.: Multi-task
Policy Search for Robotics. In: IEEE International Conference on
Robotics and Automation (ICRA) (2014)
11. Ha, S., Liu, C.: Evolutionary optimization for parameterized
whole-body dynamic motor skills. In: Proceedings of IEEE
International Conference on Robotics and Automation (ICRA)
(2016)
12. Hansen, N., Muller, S., Koumoutsakos, P.: Reducing the
Time Complexity of the Derandomized Evolution Strategy
with Covariance Matrix Adaptation (CMA-ES). Evolutionary
Computation (2003)
13. Hansen, N., Ostermeier, A.: Completely derandomized selfadaptation in evolution strategies. Evol. Comput. 9(2), 159–195
(2001)
14. Igel, C., Suttorp, T., Hansen, N.: A computational efficient
covariance matrix update and a (1+ 1)-CMA for evolution
strategies. In: Proceedings of the 8th annual conference on Genetic
and evolutionary computation (2006)
15. Ijspeert, A., Schaal, S.: Learning Attractor Landscapes for
Learning Motor Primitives. In: Advances in Neural Information
Processing Systems 15(NIPS) (2003)
J Intell Robot Syst
16. Kober, J., Oztop, E., Peters, J.: Reinforcement Learning to adjust
Robot Movements to New Situations. In: Proceedings of the
Robotics: Science and Systems Conference (RSS) (2010)
17. Kober, J., Peters, J.: Policy Search for Motor Primitives in
Robotics. Mach. Learn. 8, 1–33 (2010)
18. Kupcsik, A., Deisenroth, M.P., Peters, J., Neumann, G.: DataEfficient contextual policy search for robot movement skills. In:
Proceedings of the National Conference on Artificial Intelligence
(AAAI) (2013)
19. Mannor, S., Rubinstein, R., Gat, Y.: The Cross Entropy method
for Fast Policy Search. In: Proceedings of the 20th International
Conference on Machine Learning (ICML) (2003)
20. Molga, M., Smutnicki, C.: Test Functions for Optimization Needs.
In: http://www.zsd.ict.pwr.wroc.pl/files/docs/functions.pdf (2005)
21. Niehaus, C., Röfer, T., Laue, T.: Gait optimization on a humanoid
robot using particle swarm optimization. In: Proceedings of the
Second Workshop on Humanoid Soccer Robots in conjunction
with the, pp. 1–7 (2007)
22. Peters, J., Mülling, K., Altun, Y.: Relative Entropy Policy Search.
In: Proceedings of the 24th National Conference on Artificial
Intelligence (AAAI). AAAI Press (2010)
23. Rückstieß, T., Felder, M., Schmidhuber, J.: State-dependent
Exploration for Policy Gradient Methods. In: Proceedings of the
European Conference on Machine Learning (ECML) (2008)
24. Stulp, F., Raiola, G., Hoarau, A., Ivaldi, S., Sigaud, O.: Learning
Compact Parameterized Skills with a Single Regression. In: IEEERAS International Conference on Humanoid Robots (Humanoids)
(2013)
25. Stulp, F., Sigaud, O.: Path Integral Policy Improvement with
Covariance Matrix Adaptation. In: International Conference on
Machine Learning (ICML) (2012)
26. Suganthan, P.N., Hansen, N., Liang, J.J., Deb, K., Chen, Y.P.,
Auger, A., Tiwari, S.: Problem Definitions and Evaluation Criteria
for the CEC 2005 Special Session on Real-Parameter Optimization.
Tech. rep., Nanyang Technological University, Singapore (2005)
27. Sun, Y., Wierstra, D., Schaul, T., Schmidhuber, J.: Efficient
Natural Evolution Strategies. In: Proceedings of the 11th Annual
conference on Genetic and evolutionary computation(GECCO).
https://doi.org/10.1145/1569901.1569976 (2009)
28. Theodorou, E., Buchli, J., Schaal, S.: A Generalized Path Integral
Control Approach to Reinforcement Learning. The Journal of
Machine Learning Research (2010)
29. Wang, J.M., Fleet, D.J., Hertzmann, A.: Optimizing walking
controllers. ACM Trans. Graph. (TOG) 28(5), 168 (2009)
30. Wierstra, D., Schaul, T., Peters, J., Schmidhuber, J.: Fitness
Expectation Maximization. In: International Conference on
Parallel Problem Solving from Nature, pp. 337–346. Springer
(2008)
Abbas Abdolmaleki obtained B.Sc. (2009) and M.Sc. (2011) in
Computer Engineering field of Artificial Intelligence from the
University of Isfahan (Iran). He is currently a research scientist at
Google DeepMind and Ph.D. student in a joint PhD program at the
University of Minho, Aveiro and Porto (Portugal). His thesis topic is on
information theoretic stochastic search. He has worked on simulated
rescue robots and simulated humanoid robot and achieved different
ranks in Robocup competitions including 2 world championships.
His main research interests include stochastic search for black box
optimization, policy search for robotics and multi agent systems.
David Simões obtained a M.Sc. (2015) in Computer and Telematics
Engineering from the University of Aveiro, Portugal, and is current
a Ph.D. student in a joint PhD program at the Universities
of Minho, Aveiro and Porto (Portugal). His thesis topic is on
learning coordination in multi-agent systems. He has worked on
simulated humanoid robots and achieved different ranks in Robocup
competitions including 3 world championships, and has worked in
robotic and simulated maze-solving competitions, winning several
national Micro-Rato competitions. His main research interests include
multi-agent systems, deep learning, and game theory.
Nuno Lau is Assistant Profess or at Aveiro University, Portugal and
Researcher at the Institute of Electronics and Informatics Engineering
of Aveiro (IEETA), where he leads the Intelligent Robotics and
Systems group (IRIS). He got his Electrical Engineering Degree from
Oporto University in 1993, a DEA degree in Biomedical Engineering
from Claude Bernard University, France, in 1994 and the PhD from
Aveiro University in 2003. His research interests are focused on
Intelligent Robotics, Artificial Intelligence, Multi-Agent Systems and
Simulation. Nuno Lau participated in more than 15 international
and national research projects, having the tasks of general or local
coordinator in about half of them. Nuno Lau won more than 50
scientific awards in robotic competitions, conferences (best papers)
and education. He has lectured courses at Phd and MSc levels
on Intelligent Robotics, Distributed Artificial Intelligence, Computer
Architecture, Programming, etc. Nuno Lau is the author of more than
160 publications in international conferences and journals. He was
President of the Portuguese Robotics Society from 2015 to 2017, and
is currently the Vice-President of this Society.
J Intell Robot Syst
Luı́s Paulo Reis is an Associate Professor at the Faculty of Engineering
of the University of Porto in Portugal and Director of LIACC Artificial Intelligence and Computer Science Laboratory at the same
University. He is an IEEE Senior Member and he was president
of the Portuguese Society for Robotics and is vice-president of
the Portuguese Association for Artificial Intelligence. During the
last 25 years, he has lectured courses on Artificial Intelligence,
Intelligent Robotics, Multi-Agent Systems, Simulation and Modelling,
Games and Interaction, Educational/Serious Games and Computer
Programming. He was the principal investigator of more than 10
research projects in those areas. He won more than 50 scientific awards
including wining more than 15 RoboCup international competitions
and best papers at conferences such as ICEIS, Robotica, IEEE
ICARSC and ICAART. He supervised 20 PhD and 102 MSc theses
to completion and is supervising 8 PhD theses. He organized more
than 50 international scientific events and belonged to the Program
Committee of more than 250 scientific events. He is the author of
more than 300 publications in international conferences and journals
(indexed at SCOPUS or ISI Web of Knowledge).
Gerhard Neumann is a Professor of Robotics & Autonomous
Systems in College of Science at the University of Lincoln. Before
coming to Lincoln, he has been an Assistant Professor at the TU
Darmstadt from September 2014 to October 2016 and head of the
Computational Learning for Autonomous Systems (CLAS) group.
Before that, he was Post-Doc and Group Leader at the Intelligent
Autonomous Systems Group (IAS) also in Darmstadt under the
guidance of Prof. Jan Peters. Gerhard obtained his Ph.D. under
the supervision of Prof. Wolfgang Mass at the Graz University of
Technology. Gerhard already authored 50+ peer reviewed papers,
many of them in top ranked machine learning and robotics journals
or conferences such as NIPS, ICML, ICRA, IROS, JMLR, Machine
Learning and AURO. He is principle investigator for the National
Center for Nuclear Robotics (NCNR) in Lincoln which is an EPSRC
RAI Hub and also leading 1 Innovate UK project on Tomato Picking.
In Darmstadt, he is principle investigator of the EU H2020 project
Romans and acquired DFG funding. He organized several workshops
and is area chair for conferences such as NIPS and CoRL.
Affiliations
Abbas Abdolmaleki1 · David Simões1
· Nuno Lau1 · Luı́s Paulo Reis2 · Gerhard Neumann3,4
David Simões
david.simoes@ua.pt
Nuno Lau
nunolau@ua.pt
Luı́s Paulo Reis
lpreis@fe.up.pt
Gerhard Neumann
neumann@ias.tu-darmstadt.de
1
IEETA - Institute of Electronics and Informatics Engineering
of Aveiro, University of Aveiro, Aveiro, Portugal
2
LIACC - Artificial Intelligence and Computer Science
Laboratory, University of Porto, Porto, Portugal
3
CLAS - Computational Learning for Autonomous Systems,
Technische Universität Darmstadt, Darmstadt, Germany
University of Lincoln, Lincoln, UK
4