Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Contextual Direct Policy Search

2019, Journal of Intelligent and Robotic Systems

Journal of Intelligent & Robotic Systems https://doi.org/10.1007/s10846-018-0968-4 Contextual Direct Policy Search With Regularized Covariance Matrix Estimation Abbas Abdolmaleki1 · David Simões1 · Nuno Lau1 · Luı́s Paulo Reis2 · Gerhard Neumann3,4 Received: 14 December 2017 / Accepted: 3 December 2018 © Springer Nature B.V. 2019 Abstract Stochastic search and optimization techniques are used in a vast number of areas, ranging from refining the design of vehicles, determining the effectiveness of new drugs, developing efficient strategies in games, or learning proper behaviors in robotics. However, they specialize for the specific problem they are solving, and if the problem’s context slightly changes, they cannot adapt properly. In fact, they require complete re-leaning in order to perform correctly in new unseen scenarios, regardless of how similar they are to previous learned environments. Contextual algorithms have recently emerged as solutions to this problem. They learn the policy for a task that depends on a given context, such that widely different contexts belonging to the same task are learned simultaneously. That being said, the state-of-the-art proposals of this class of algorithms prematurely converge, and simply cannot compete with algorithms that learn a policy for a single context. We describe the Contextual Relative Entropy Policy Search (CREPS) algorithm, which belongs to the before-mentioned class of contextual algorithms. We extend it with a technique that allows the algorithm to severely increase its performance, and we call it Contextual Relative Entropy Policy Search with Covariance Matrix Adaptation (CREPS-CMA). We propose two variants, and demonstrate their behavior in a set of classic contextual optimization problems, and on complex simulator robot tasks. Keywords Multi-task learning · Stochastic policy search · Contextual learning · Covariance matrix adaptation 1 Introduction A black-box objective function which depends on a highdimensional parameter vector can be optimized by gradientfree black-box optimizers, such as stochastic search machine learning algorithms. By black-box, we mean that This paper is an extended version of an ICARSC 2016 paper [5]. The second author is supported by Fundação para a Ciência e a Tecnologia under grant PD/BD/113963/2015. The work was also partially funded by the Operational Programme for Competitiveness and Internationalisation - COMPETE 2020 and by FCT – Portuguese Foundation for Science and Technology under projects PEst-OE/EEI/UI0027/2013 and UID/CEC/00127/2013 (IEETA). The work was also funded by project EuRoC, reference 608849 from call FP7-2013-NMP-ICT-FOF.  Abbas Abdolmaleki abbas.a@ua.pt Extended author information available on the last page of the article. we do not possess access to this function or its properties, such as analytical gradient or second order information. We can only sample from this function, which may take significant amounts of time, and might have a high degree of variance. Therefore, we need algorithms that do not obtain gradient information or make assumptions on the form of the function (e.g., being linear or quadratic), but instead evaluate the performance function value for a specific parameter vector through the return of an episode. In the context of robotics, the return is computed by generating a roll-out or trajectory (also known as episode) on the real robot platform by following the control policy of the robot with a given parameter vector. The trajectory is subsequently evaluated by an objective function. For example, an episode would be the entire experiment of kicking a ball, while its return would be the distance traveled by the ball, which we want to maximize. One class of methods called stochastic search algorithms, for example, refine populations of candidate solutions (known as individuals) to a given problem. Initially, an initial population of random J Intell Robot Syst individuals is generated. Each candidate is then evaluated, and the algorithm selects, breeds and mutates a new generation of individuals based on their evaluation. The new generation replaces the older one, and the process is repeated. This class of algorithms tries to optimize a set of parameters, and maintains a distribution over them. Samples are drawn from this search distribution, and then tested with respect to their performance. The distribution is then updated based on the samples and their corresponding evaluations. There many different update rules such as path integrals [25, 28], gradient based updates [23, 27], crossentropy methods [19], information-theoretic policy updates [1, 2, 18], or evolutionary strategies [12]. However, most of these algorithms just optimize for a fixed objective function, and multi-task learning is not supported by many of them. In such a setting, one can naively learn one task at a time independently. However, if the tasks are similar, we can exploit the correlation of the tasks to speed up the learning substantially. We use a context vector to describe a task, changing between different executions, but remaining constant throughout each task’s episode. Our goal, known as contextual learning, is to learn a policy that can handle multiple contexts for the same task. In robotics, we can assume controllers for robotic walking, where different speeds on different surfaces or inclinations require different models to be learned, or for ball-kicking behaviors, where the physical properties of the ball or the kick’s target (distance, accuracy, etc.) may change. As another example, consider learning to retrieve relevant ads, given a context vector which describes features of the current user, or rating search results according to the user profile and search query. This allows a further degree of customization in on-line services. For contextual learning and generalization between the tasks, one could independently optimize for several target contexts, like optimizing a ball kick for different distances (contexts). Optimized contexts could then be generalized through regression methods to new unseen contexts [21, 29]. However, it is time-consuming to do this for a number of different unseen contexts, as well as sample-inefficient, since we don’t share any information between contexts in order to speed-up the learning processes. Learning a set of parameters for multiple simultaneous tasks is known as contextual policy search [16, 18]. Such concurrent learning enables the algorithm to transfer the knowledge across the context space, which speeds up learning of the desired context dependent policy. Such a function allows us to adapt quickly to new situation by generalizing learned tasks from similar contexts to the new context. Several algorithms have succeeded in the multi-task learning field, such as contextual policy search algorithms that are based in information theory [22]. One of these algorithms, named Contextual Relative Entropy Policy Search (CREPS) [9, 18], keeps a search distribution based on a Gaussian curve and iteratively computes its parameters (mean and covariance matrix). However, we will show that the search distribution might collapse prematurely to a point-estimate before finding a good solution, due to the covariance matrix update rule of CREPS. This results in premature convergence, which avoids the algorithm to be fully effective and be competitive. Other stochastic search algorithms [12, 27], such as CMA-ES and NES, or commonly used policy search methods [17, 25], don’t typically suffer from premature convergence, although they also do not have multi-task learning capabilities. Therefore, to solve the premature convergence problem of CREPS, we use the rank-μ covariance matrix update rule of CMA-ES [12] along with CREPS. CMA-ES does not have the multi-task learning feature, despite being considered state of the art in stochastic search and being highly competitive with other stochastic optimizers. We propose a new contextual stochastic search algorithm, inspired by CREPS and CMA-ES, in this paper, which can learn for multiple task without premature convergence. We name our proposal Contextual Relative Entropy Policy Search with Covariance Matrix Adaptation (CREPS-CMA) and we evaluate two of its variants in standard contextual optimization problems and in robotics problems in simulated environments. We will demonstrate how the premature convergence issue is solved by CREPS-CMA, which outperforms the original CREPS by orders of magnitude. 1.1 Problem Statement Stochastic search is a class of methods for optimization, for example, finding a set of parameters for a humanoid robot controller such that robot walks as fast as possible. We consider the objective function R(θ ) : Rn → R and we optimize it for the parameter vector θ ∈ Rn . We assume our objective function is a black box and therefore only accessible informations are rewards for the parameters. On the other word we do not assume analytical first order or second order information. More concretely, our goal is to find the best solution θ ∗ in a search space S : Rn , i.e, θ ∗ = argmax R(θ ), θ∈S where R : Rn  → R is our objective function (also known as fitness function). Considering our robot walking example, the search point θ would be the parameters of the robot walking controller parameters and R(θ ) is speed of the robot that can be achieved if we use the parameters θ . So far we considered a setting where the objective function is fixed and the solution is a point estimate θ ∗ in the search space. Now we consider a setting where the objective J Intell Robot Syst function can changes slightly dependent on the context of the task. Considering our robot locomotion task, in contextual setting we are interested in not only the controller parameters for highest possible speed but controller parameters for any feasible speed s. In such a setting the solution is not only a point θ but a function of context s, i.e, m(s). In the contextual setting, we represent the context with a vector s, and we aim to find a function that for each given context s it outputs optimal vector θ given an objective function R(θ , s) which depends on context. In this paper, we consider contextual optimization problems where the objective function depends on an m-dimensional context vector s. For each possible context vector s drawn from some unknown context distribution, we would like to compute the optimal parameter vector θ ∗s , such that the objective function R(s, θ ) : {Rm × Rn }. → R is optimized. Since we assume continuous context space, our goal is to find an optimal policy m∗ (s). We assume R(θ , s) is a black box and therefore we only have access to the evaluations {R [k] }k=1...N of samples {s [k] , θ [k] }k=1...N , where k denotes the sample index and N the number of samples. In the context of robotics, the reward (objective value) of a parameter vector is computed by generating a roll-out or trajectory on the real robot platform by following the control policy of the robot with the given parameter vector. The reward for the trajectory is then given by the summed collected immediate reward at each time step throughout the trajectory. 1.2 Contextual Stochastic Search In this section we explain the general procedure for contextual stochastic search algorithms. Contextual stochastic search algorithms maintain a conditional search distribution π(θ|s) over the parameter space. In this paper, we model the search distribution π(θ|s) as a linear Gaussian policy, i.e., π(θ|s) = N (θ|mπ (s), Σπ ) , (1) where mπ (s) is a context dependent mean function, and Σ π is a context independent full covariance matrix. Function mπ (s) is our current estimation for the context dependent function and Σ π is effectively the search distribution’s exploration over parameter space. Please note that we use a full covariance matrix which enable us to model the correlation of parameters. We use a linear model for the mean function θ|mπ (s) = ATπ ϕ(s), where Aπ is matrix of coefficients for the linear terms, and ϕ(s) is an arbitrary feature function for context s. If we now select the feature function as ϕ(s) = [1 s], the generalization would be linear across the contexts. Alternatively one could use non-linear feature functions, such as radial basis functions (RBF) [7], which allows for nonlinear generalization over contexts. Such controllers can find more complex policies, for example, for a robot locomotion task [4]. In contextual setting, in each iteration we use the current search distribution q(θ|s) to generate samples θ [k] of the parameter vector θ given context samples1 s [k] . Next, we use the objective function R(θ , s) to evaluate R [k] of {s [k] , θ [k] }. Subsequently, a weight d [k] for each sample k is computed by using the samples {s [k] , θ [k] , R [k] }k=1...N . Finally, the search distribution π(θ |s) (a Gaussian distribution) is estimated using {s [k] , θ [k] , d [k] }k=1...N . The algorithm does this procedure iteratively till it converges to a solution. Algorithm 1 shows an exemplary pseudo-code for a stochastic search algorithm with contextual capacities. Algorithm 1 Contextual stochastic search algorithm Repeat Set q(θ|s) to π(θ |s) Generate context samples {s [k] }k=1...N Sample parameters {θ [k] }k=1...N from current search distribution q(θ|s) given context samples {s [k] }k=1...N Evaluate the reward R [k] of each sample in the sample set {s [k] , θ [k] }k=1...N Use the data set {s [k] , R [k] }k=1...N to compute a weight d [k] for each sample Use the data set {s [k] , θ [k] , d [k] }k=1...N to update the new search distribution π(θ |s) Until search distribution π(θ |s) converges. The goal is to find a new coefficient matrix Aπ and a new full covariance matrix Σπ in each iteration. Therefore, in the following sections, we explain the update rules for obtaining the distribution parameters. 2 Related Work For generalizing a learned parameter vector across context space, a naive solution is to choose several contexts and find a solution for each one by running an optimization algorithm for each independently. Subsequently, given contexts and their optimal parameters, we can use supervised 1 Context samples generally depends on the task and comes form environment. However for simplicity we use a uniform distribution throughout this paper. J Intell Robot Syst learning and regression methods to fit a context dependent function that generalizes the optimized contexts to a new, unseen context [8, 24]. For example, [8] use such method. Although we can use such approaches and they have been shown to be successful to some extent, they cannot reuse data-points obtained from optimizing a task with context s to improve and accelerate the optimization of a task with another context s ′ . The reason is that such methods assume that learning parameters for the contexts and the generalization between them are two independent processes. Such contextual learning, i.e, learning multiple tasks and generalizing between them simultaneously without restarting the learning process has been established with the name of contextual (multi-task) policy search [10, 16, 18]. Among all methods, information-theoretic based contextual policy search algorithms [22], such as the episodic Contextual Relative Entropy Policy Search (CREPS) algorithm [9, 18], have been shown to be successful for multi task learning without restarting. In application space CREPS could successfully optimize a humanoid robot to be able to walk for different speeds [4]. However, while CREPS has been shown to be effective it also suffers from premature convergence which avoid the algorithm to be easily used. Recently in [11] a contextual learning algorithm based on the (1+1)-CMA-ES [14] algorithm was proposed. The main problem of this approach is that it is only applicable for one dimensional context spaces, which significantly limits the applicability of the algorithm. 3 Contextual Relative Entropy Policy Search with Covariance Matrix Adaptation Relative Entropy Policy Search (REPS) [22] is a reinforcement learning method that tries to attain maximal expected reward while bounding the amount of information loss. It allows an exact policy update and uses data generated while following an unknown policy to generate a new, better policy. An information-theoretic contextual stochastic search algorithm was recently presented [18] as a modification of the REPS algorithm. Despite showing good results to some extent, we will demonstrate that the algorithm suffers from premature convergence. By using a new update rule, such as the rank-μ covariance matrix adaptation method of the CMA-ES algorithm, we can solve this issue, thus contributing with the CREPS-CMA algorithm. Starting with a random distribution q(θ|s), we use it to create a set of unweighted samples {θ [k] }k=1...N , which are then evaluated against the environment to create the dataset {s [k] , θ [k] , R [k] }k=1...N . Afterwards, our contribution computes a weight d [k] for each element of the dataset, and these weights are used to calculate a new search distribution π(θ |s). The following sections describe how the weights are found, and how they are used to calculate the new search distribution. 3.1 Weight Computation CREPS-CMA use the same method as CREPS [18] to calculate the weights for each sample. CREPS assumes all the current samples have the same probabilities and then it changes the weights of the samples such that the expected reward is optimized. However, simply optimizing the expected reward based on the samples will give a probability of 1 to samples with the best reward value and zero to the others. To solve this problem, CREPS limits the relative entropy between the previous sample-based search distribution q(θ|s) and the next one, π(θ|s), such that the learning process is not unstable and has a smooth evolution. More formally, CREPS solves the following optimization program in each iteration  μ(s)π(θ |s)Rsθ dθds, max π    s.t. μ(s)KL π(θ |s)||q(θ|s) ds ≤ ǫ,  ∀s : 1 = π(θ |s)dθ , (2) where μ(s) represents the context distribution, which is dependent on the current task, Rsθ is the expected reward when the parameter vector θ is evaluated in  the environment using context s, and KL π(θ |s)||q(θ|s) is the KullbackLeibler divergence from q(θ |s) to π(θ|s). This equation can be solved in closed form through the use of Lagrangian multipliers [6], and the closed form solution for policy π(θ |s) is given by π(θ |s) ∝ q(θ |s) exp (Rsθ /η) , (3) where η is a Lagrangian multiplier that sets the temperature of the soft-max distribution given in the previous equation. The temperature parameter η can be found efficiently by optimizing the dual function      Rsθ dθ ds. g(η) = ηǫ + η μ(s) log q(θ|s) exp η (4) By minimizing the dual function g(η) such that η > 0, we can obtain the optimal value for η [6]. However, we would need a lot of samples {θ k0 , θ k1 , . . .} for each context s k to approximate the log integral in the dual function (4). This is not feasible, as the context can not often be directly J Intell Robot Syst controlled, and we have only access to a single action θ k per context s k . Therefore CREPS reformulates the performance criteria to tackle this issue. Instead of optimizing for the policy π(θ|s), CREPS optimizes for the joint probabilities p(s, θ). Additionally, CREPS use the constraints ∀s  : p(s) = μ(s) to enforce the condition that p(s) = θ p(s, θ)dθ still reproduces a context distribution μ(s) that matches the distribution used to draw the context samples from. However, this results in an infinite number of constraints as the context is continuous. In order to solve this problem, CREPS matches feature expectations, instead of matching single probabilities, i.e.,  p(s)φ(s)ds = φ̂, (5) s  where φ̂ = s μ(s)φ(s)ds is the expected feature vector for the given context distribution μ(s) and a given feature space φ. For example, if we assume the feature vector φ(s) has the linear and squared terms [s, s 2 ] of the context vector s, in practice we are matching the first and second moment of the distribution p(s) with the moments of the context distribution μ(s).2 Note that φ can be different from ϕ. After these consideration, the new optimization program is  p(s, θ)Rsθ dsdθ max p s.t. ǫ ≥ KL(p(s, θ)||μ(s)q(θ |s)),  φ̂ = p(s, θ)φ(s)dsdθ, 1=  p(s, θ)dsdθ . Input: Data Set D{s [k] , θ [k] , R [k] }k=1...N and the old covariance matrix Σ q Compute the weights d [k] for each sample: 1- Optimize the dual function g and find optimum η and w T g(η, w) = ηǫ + φ̂ w  N  [k]  1 R − φ(s [k] )T w +η log exp . N η K=1  [k]  R − φ(s [k] )T w [k] 2- Compute weights d = exp . η N 3- Normalize d [k] such that k=1 d [k] = 1. Compute the new mean function mπ (s): Use weighted maximum likelihood to estimate parameters Aπ of the new mean function −1 Aπ = (Φ T DΦ + λI ) (6) (7) where V (s) = φ(s)T w can be considered a context dependent baseline which is subtracted from the return Rsθ . The parameters w and η are again Lagrangian multipliers that can be obtained by optimizing the dual function, given as Φ T DU , where Φ T = [ϕ [1] , ..., ϕ [N] ] contains the feature vector for all samples, U = [θ [1] , ..., θ [N] ] contains all the sample parameters and D is the diagonal weighting matrix containing the weights d [k] . Compute the sample covariance S: N Sq = k=1 The solution for p(s, θ ) is now given by p(s, θ) ∝ q(θ|s)μ(s) exp ((Rsθ − V (s))/η) , Algorithm 2 CREPS-CMA ⎛ Z=⎝  T  d [k] θ [k] −ATq ϕ(s [k] ) θ [k] −ATq ϕ(s [k] ) /Z, (d [k] d [k] )2 − k=1 2 N N k=1 ⎞ ⎠/ N (d [k] ) k=1 Compute the number of effective samples φeff and λ:   1 φeff , λ = min 1, 2 φeff = N [k] 2 n k=1 (d ) Compute the new covariance matrix Σ: Σ π = λΣ q + (1 − λ)S q . T g(η, w) = ηǫ + φ̂ w     Rsθ −φ(s)T w μ(s)q(θ |s)exp + η log dθds . η (8) 2 In this paper, in all experiments we use a feature vector that contains all the linear and squared terms of the context vector, i.e, φ(s) = [s, s 2 ] which corresponds to matching the first and second moment of the distributions. This policy update results in a weight d [k] = exp ((Rsθ − V (s))/η) (9) for each sample [s [k] , θ [k] ], which we can use to estimate a new search distribution ππ (θ|s). Please note that the optimization program that CREPS solves, is a convex optimization problem, which can be demonstrated by showing that the second derivatives of J Intell Robot Syst the objective and constraints are always positive. For the constraints, ∂ 2 KL(p(s, θ)||μ(s)q(θ|s)) ∂p(s, θ )2 = = In order to obtain the context dependent mean function mπ , we directly solve the maximum likelihood estimate problem in Eq. 10 for Aπ . It is a weighted linear regression problem and the solution for Aπ can be obtained in closed form, which is given by p(s,θ ) ∂ 2 p(s, θ ) log Q(s,θ) ∂p(s, θ )2 ∂ 2 p(s, θ ) log p(s, θ ) − p(s, θ ) log Q(s, θ) ∂p(s, θ )2 −1 Aπ = (Φ T DΦ + λI ) ∂ 2 p(s, θ ) log p(s, θ ) ∂ 2 p(s, θ ) log Q(s, θ) − = ∂p(s, θ )2 ∂p(s, θ )2 = =0+ = ∂ 2 log p(s, θ ) −0 ∂p(s, θ )2 2 2 1 +p(s, θ ) = − p(s, θ ) ∂p(s, θ ) p(s, θ ) p(s, θ ) For the objective, we are taking the gradient of Eq. 6 w.r.t. p(s, θ ) and (s, θ) is given, i.e., p(s = s ′ , θ = θ ′ ) and (s ′ , θ ′ ) is a constant. In this case, the gradient with respect to p(s = s ′ , θ = θ ′ ) will be zero everywhere but at (s ′ , θ ′ ), where it is Rs ′ θ ′ . The gradient of Rs ′ θ ′ w.r.t. p(s ′ , θ ′ ) is simply zero. 3.2 Search Distribution Update Rule In this section we propose update rules to calculate the context-dependent mean function mπ of the search distribution, as well as the context independent covariance matrix Σπ . Given the weights we obtained in the previous section for each context-parameters joint {s [k] , θ [k] , d [k] }k=1...N and the old Gaussian search distribution,  we want to find the new  search distribution π(θ |s) = N θ|mπ (s) = ATπ ϕ(s), π ,by finding Aπ and π . In fact, at this point we can use a supervised learning algorithm to fit a new distribution. Therefore, similarly to contextual REPS, we can directly use a weighted maximum likelihood estimate method to obtain a new distribution, i.e, N d [k] log π(θ [k] |s [k] ; Σ π , Aπ ). (11) 3.2.2 Covariance Matrix Update Rule 1 ∂ p(s,θ ) 1 ≥ 0. p(s, θ ) argmax Φ T DU , where Φ T = [ϕ [1] , ..., ϕ [N] ] has all the feature vectors for all samples, U = [θ [1] , ..., θ [N] ] includes all the sample parameters, and D is the diagonal matrix with the weights d [k] . ∂ 2 p(s, θ ) ∂p(s, θ ) log p(s, θ ) log p(s, θ ) + 2 ∂p(s, θ ) ∂p(s, θ )2 +p(s, θ ) 3.2.1 Context-Dependent Mean-Function Update Rule (10) Σπ ,Aπ k=1 The maximum likelihood estimate gives us the update rules for both mean function and covariance matrix. However, maximum likelihood for fitting covariance matrix leads to over-fitting and therefore premature convergence. We will explain how we solve this problem. First we present the update rule for the mean function. In practice, again similar to CREPS, we could also just solve the optimization program in Eq. 10 for Σπ . In this case, we can solve for the covariance matrix Σπ = S and the solution can be obtained in closed form. This solution is also known as sample covariance S matrix because it is purely based on samples and is given by  T  N d [k] θ [k] − ATπ ϕ(s [k] ) θ [k] − ATπ ϕ(s [k] ) , S = k=1 Z N [k] 2 [k] 2 ( N k=1 d ) − k=1 (d ) . (12) Z= N [k] k=1 (d ) CREPS directly uses this covariance update rule. However this solution is typically over-fitted. The reason is that the number of free parameters of the covariance matrix are typically much larger than the number of available samples, since they cannot fully span the parameter space. Therefore, fitting these samples typically cause over-fitting. And thus we can argue that the sample covariance matrix of Eq. 12 estimates the true covariance matrix poorly [3]. In other words, fitting this limited number of samples causes an exploration decrease along many dimensions of the parameter space that are not present in our samples. That is the main reason CREPS will suffer from premature convergence. In order to have a competitive contextual stochastic search algorithm, this loss of exploration after each distribution fitting should be controlled. Therefore we need to limit the change of the new covariance matrix with respect to the old covariance matrix. This constraint will maintain exploration along the different dimensions of the parameter space. Such bounding has been used by the CMA-ES algorithm, which is a non-contextual algorithm. Therefore, inspired by CMA-ES, we use a convex combination of the old covariance matrix Σ q and the sample covariance matrix S from Eq. 12, i.e., Σ π = (1 − λ)Σ q + (λ)S. (13) J Intell Robot Syst The interpolation factor λ ∈ [0, 1] controls the information loss of the new covariance matrix by defining how much the new covariance matrix diverges from the old covariance matrix Σ q towards the sample covariance matrix S. This factor enables the algorithm to define the amount of information from the new samples to incorporate into the covariance matrix. This approach simply avoids over-fitting the new samples by bounding the change of the new covariance matrix. Note that if we set λ = 0, we will obtain the update rule is used by CREPS. The λ factor can be set in different ways. For example we can choose the λ such that the entropy of the new search distribution is reduced by a certain amount [3]. However, for CREPS-CMA, we use an update rule similar to the rank-μ update in CMA-ES algorithm [12], i.e.,   1 φeff , λ = min 1, 2 , , (14) φeff = N [k] 2 n k=1 (d ) where φeff is the number of effective samples and n is the dimension of the parameter space θ. In order to calculate the sample covariance matrix in Eq. 12, we can use the new mean function or the old mean function from the current search distribution. Using the new mean function mπ to calculate the sample covariance matrix in Eq. 12 will increase the probability of reproducing the current samples we have in the dataset. Therefore the distribution will shrink in each iteration in order to cover those weighted samples. In other words, we will have a distribution that tends to repeat producing the current samples. This may still cause premature convergence. We would instead prefer our new distribution to increase the probability of selected steps, as opposed to selected samples, i.e, we prefer to repeat the steps that result in good rewarded samples, instead of the samples we have in the dataset. That is, we are interested in repeating the mutations that resulted in the good samples in our data set. To achieve this, we can simply use the old mean function mq to compute the sample covariance matrix, i.e, Sq =  N [k] θ [k] k=1 d  T − ATq ϕ(s [k] ) θ [k] − ATq ϕ(s [k] ) Z , (15) where Z is given in Eq. 12. By using the old mean, in fact we encode the information about the steps we took in the last iteration in the new covariance matrix. As these steps are weighted, the new distribution with new covariance matrix will repeat the successful steps and the unsuccessful steps will be discarded. Please note that the weights of the samples define the amount of success of a step. Therefore, the steps with more weight will have more probability to be repeated. We will use both the new mean function and the old mean function, and compare both possible variants of the algorithm. The first variant uses the mean calculated for the current iteration mπ to obtain the new covariance matrix, and is referred to as CREPS-CMACurr , see Eq. 13. The second variant, referred to as CREPS-CMAOld , uses the old mean function mq , i.e., Σ π = (1 − λ)Σ q + (λ)S q . (16) We will compare both approaches and provide an empirical analysis to show that using the old mean function effectively avoids the premature convergence while using the new mean can result in premature convergence. See Algorithm 2 for a compact representation of the CREPS-CMA algorithm. 4 Interpretation of the Regularized Covariance Matrix Update Rule So far, we discussed the importance of regularizing covariance matrices with intuitive reasons. In this section, we propose a KL regularized objective that enables us to derive the covariance matrix update rules from a single principle. We can obtain such an update rule by increasing the likelihood of weighted steps subjected to a KL-divergence penalty between new and old search distribution to avoid overfitting samples, i.e, N d [k] log π(θ k |s k ) arg max J = Σ k=1    incorporates successful steps (J1 ) N KL(πold (θ |s [k] )|π(θ |s [k] )) . −γ k=1   avoids overfitting (J2 )  In this objective, γ > 0 is the trade off of maximizing the likelihood of weighted steps and limiting the information loss between the new and old search distributions. If we set γ to zero, we obtain the maximum likelihood objective without any penalty on information loss, which as we discussed will lead to overfitting and premature convergence. If we use a Gaussian distribution as the underlying sampling policy, we will be able to solve this KL regularized objective in closed form, and the solution is the regularized covariance update rule we proposed in the last section. Next we explain how we solve this objective. Given samples {s [k] , θ [k] , d [k] }k=1...N and a linear multivariate normal distribution, i.e, π(θ |s) = N (θ|m(s), Σ) , we maximize the objective J to obtain a new covariance matrix Σ. As in this objective we are only interested in covariance matrix, we set the mean function of the objective to the old one, i.e. m(s) = mold (s). As we discussed J Intell Robot Syst in the previous sections, using this simple trick we find a covariance matrix Σ that optimizes the likelihood of weighted steps which can considerably reduce the risk of premature convergence as will be shown in experiments. First, we need to calculate the gradient of J with respect to Σ −1 . Please note that there are to terms in the objective. Therefore we take the gradient for terms (J1 , J2 ) in the objective separately, i.e., ∇Σ −1 J = ∇Σ −1 J1 − ∇Σ −1 J2 . First we take the gradient w.r.t J1 , which results in N J1 = k=1 1 d [k] log π(θ [k] |s [k] ) = const− log det Σ 2 1 − tr(Σ −1 2 × (θ [k] N d [k] (θ [k] −mold (s [k] )) k=1 −mold (s [k] ))T ), where const is the constant value from the KL divergence of two Gaussians, and det(Σ) and tr(Σ) are the determinant and trace of a matrix Σ, respectively. We obtain the gradient as 1 1 Σ− 2 2 ∇Σ −1 J1 = N d [k] k=1 ×(θ k − mold (s [k] ))(θ [k] − mold (s [k] ))T . and we get, N 1 1 dk (θ [k] − mold (s [k] ))(θ [k] − mold (s [k] ))T Σ− 2 2 k=1 γ γ − Σold + Σ = 0. 2 2 To find the new Σ, we rearrange the terms in the above equation and we get 1 γ Σold + Σ = 1+γ 1+γ N d [k] k=1 × (θ [k] − mold (s [k] ))(θ k − mold (s [k] ))T . We can see that the weights in above update rule, sum to 1, i.e., γ 1 + = 1. 1+γ 1+γ We can now rewrite this 1 γ λ= ,1 − λ = , 1+γ 1+γ and by rewriting the equation for Σ we get, N Σ = (1 − λ)Σold + λ dk k=1 [k] × (θ [k] − mold (s ))(θ [k] − mold (s [k] ))T . This is the exact regularized covariance matrix update rule we discussed in the previous section. Please note that as γ >= 0, we can easily infer that λ ≤ 1. Second, we take the gradient of J2 , i.e, N KL(πold (θ|s [k] )|π(θ |s [k] )) J2 = γ k=1 N =γ k=1 1 const + tr(Σ −1 Σold ) 2 1 + (mold (s [k] ) − mold (s [k] ))T Σ −1 (mold (s [k] ) 2 1 det Σ . − mold (s [k] )) + ln 2 det Σold The gradient of J2 is as follows: ∇Σ −1 J2 = γ γ Σold − Σ. 2 2 Please note that the covariance matrix in our set-up is context independent. That is the reason the gradient does not depend on the context distribution. Now to find the optimum for Σ, we simply set the derivative of the KL regularized objective function ∇J to zero i.e., ∇J1 − ∇J2 = 0, 5 Experiments We now demonstrate and compare the performance of our algorithms, i.e, CREPS-CMACurr and CREPS-CMAold against the state of the art methods. We use three different environments to evaluate our algorithms exhaustively, ranging from mathematical functions, chosen for their complexity and non-linear landscapes, a robotic arm motion control problem, applicable in real-world contexts, and a high-dimensional simulated humanoid kick, which was later integrated into the FCPortugal3D team, participating in the worldwide RoboCup 3D Simulation League. The first environment consists of a set of standard optimization test functions [13, 20, 26, 30], including Schwefel’s Problem and the Rosenbrock function. The functions were extended to the contextual paradigm, and the optimization target is the optimal 15-dimensional array θ for a 1-dimensional context s. The results can be seen in Figs. 1 and 2, where CREPS-CMA could successfully learn the contextual tasks, despite standard Contextual REPS suffering from premature convergence. J Intell Robot Syst The second environment consists on a robotic arm with five joints that must reach a certain target, dependent on the task’s context. A complex but comprehensive way to represent the arm’s movements are through the use of dynamic movement primitives (DMPs) [15], using five functions for each joint, and totaling a 25-dimensional Error CREPS parameter vector. The task’s context, or in other words, the point which must be reached by the robotic arm, is a 2dimensional position. Figure 4 shows the setup of the robot and the optimization results. The third environment is a simulated humanoid ball kick, which was split into two different tasks. One of CREPS-CMA CREPS-CMA (a) Rosenbrock Function (b) Sphere Function (c) Shifted Sphere Function (d) Shifted Schwefel’s Problem (e) CigTab Function (f) Tablet Function (g) Elliptic Function (h) Elliptic Function variant Iterations Taken Fig. 1 The performance comparison of CREPS (red), CREPSCMAOld (blue) and CREPS-CMACurr (green) to calculate the variance in the distribution update). The y-axis is the error (in logarithmic scale) and the x-axis is the amount of iterations elapsed. Results are shown on the optimization of the contextual version of standard functions a Rosenbrock, b Sphere, c Shifted Sphere, d Shifted Schwefel’s, e CigTab, f Tablet, g Elliptic, and h an Elliptic variant. The results show that Contextual REPS suffers from premature convergence, CREPS-CMAOld solves the problem, despite being slower than CREPS-CMACurr J Intell Robot Syst Fig. 2 The performance comparison of CREPS (red), CREPS-CMAOld (blue) and CREPS-CMACurr (green) to calculate the variance in the distribution update). The y-axis is the error (in logarithmic scale) and the x-axis is the amount of iterations elapsed. Results are shown on the optimization of the contextual version of standard functions a Different Powers, b Plane, c Two Axes, d Cigar, e Rastrigin’s, f Parabolic Ridge, and g Sharp Ridge. The results show that Contextual REPS suffers from premature convergence and CREPS-CMAOld solves the problem, despite being slower than CREPS-CMACurr Error CREPS CREPS-CMA CREPS-CMA (a) Different Powers Function (b) Plane Function (c) Two Axes Function (d) Cigar Function (e) Rastrigin’s Function (f) Parabolic Ridge Function (g) Sharp Ridge Function Iterations Taken the tasks focuses on precision, and the context defines a 2-dimensional point where the ball should stop at. The remaining task focuses on flexibility, where the agent must kick the ball as far as possible, but the ball’s initial position with respect to the robot varies and is defined by a 2dimensional context. A linear interpolation model is used to define the robot’s motion, by defining the initial and final robot positions, as well as the time taken to perform the movement. Figure 5 shows an example of the humanoid’s movements and the performance results for several contexts. Figure 6 shows examples of possible ball positions relative to the robot, the range of possible positions and the results for several contexts. Figures 1 and 2 show the average, as well as the standard deviation, of the optimization results for the first series of tasks. The results are shown in a logarithmic or linear scale, over 5 trials for each experiment. Figure 4 shows the results for the planar reaching task. Figure 5c shows the average and two times the standard deviation of the results over 10 trials for the first humanoid kick task. Figure 6c shows the J Intell Robot Syst average kick distance over 10 trials for the second humanoid kick task. In next sections we will analyze the results for each environment in details. In general we will show that our new proposed algorithms, i.e, CREPS-CMACurr and CREPSCMAold can achieve state of art results and can successfully avoid premature convergence. 5.1 Standard Optimization Test Functions In this section, we measured the performance of our proposed algorithms with fifteen popular and challenging optimization functions [13, 20, 26, 30] as are given in table in Table 1. These functions originally are non-contextual. We there for contextualized these functions, i.e, we choose x = θ + As, and the matrix A is a constant matrix that was chosen randomly. Due to the fact that our context s is 3-dimensional, A is a p × 3 dimensional vector. Our definition for x means the optimal θ for these functions is linearly dependent on the given context s. The initial search area of θ for all experiments is restricted to the hypercube −5 ≤ θ i ≤ 5, i = 1, . . . , p and contexts are uniformly sampled from the interval 0 ≤ s i ≤ 3, i = 1, . . . , z where z is the dimension of the context space s. In our experiments, the mean of the initial distributions has been chosen randomly in the defined search area. We generated 50 new samples per iteration, and compared both versions of CREPS-CMA, CREPS-CMACurr and CREPS-CMAOld , against the original Contextual REPS. We compared CREPS-CMACurr and CREPSCMAOld against the standard Contextual REPS. In each Table 1 The 15 optimization functions used to compare the performance of the algorithms where x = θ + As iteration, we generated 50 new samples. The results are shown in Figs. 1 and 2, where CREPS-CMA converged to the solutions, as opposed to Contextual REPS, which converged prematurely to a poor solution. We can also see that CREPS-CMACurr speeds up the convergence process in some cases, but leads to pre mature convergence in others. CREPS-CMAOld , on the other hand, does not prematurely converge, despite not being as fast sometimes. We can see while CREPS-CMAOld is sometimes slower than CREPS-CMACurr , it robustly solves all tasks. In contrast CREPS-CMACurr can also suffers from premature convergence. However both these algorithms outperform the original CREPS by order of magnitude. We also demonstrate the applicability and performance of CREPS-CMAOld for several combinations of context and problem dimension. Figure 3 shows the amount of samples needed to learn five functions, namely the Rosenbrock, Sphere, Shifted Schwefel, and Elliptic functions. Results show, as expected, if the problem or context dimensionality increases, the learning process becomes more complex and a larger amount of samples are needed. The results also show applicability of CREPS-CMAOld to highdimensional problems (up to 64-dimensional problems and 5-dimensional contexts) (Fig. 4). 5.2 Planar Reaching This environment consisted on using a 5-joint robotic arm, controlled with DMPs, to reach a certain point in space. Each segment of the robot’s arm had a 1-meter length. The arm’s first target is a point v 50 , which must be reach by the arm’s end effector within 50 time-steps, and the second Name Function Stop Rosenbrock Sphere p−1 2 2 2 i=1 [100(x i+1 − xi ) + (1 − x i ) ] p x2  i=1 i p 2 i=1 xi p i 2 i=1 ( j =1 x j ) p−2 2 2 8 x1 + 10 xp−1 + 104 i=2 xi2 p 2 (1000x1 ) + 100 i=2 xi2 i−1 p 6 p−1 x 2 i=1 (10 ) i i−1 p 4 ) p−1 x 2 (10 i=1 i i−1 p 2+10 p−1 |x | i i=1 10−5 10−5 Shifted sphere Shifted Schwefel CigTab Tablet Shifted rotated high conditioned elliptic 6 Shifted rotated high conditioned elliptic 4 Different powers Plane Two Axes Cigar Rastrigin’s multimodal Sharp ridge Parabolic ridge x1 ⌊ p2 ⌋ 2 i=1 xi p x2 i=⌊ p2 ⌋+1 i p x12 + 100 i=2 (1000x i )2 p 10p + i=1 xi2 − 10 cos(2πx i )  p 2 x1 + 100 i=2 xi p 2 x1 + 100 i=2 xi + 106 10−5 10−5 10−5 10−5 10−5 10−5 10−5 −10000 10−5 10−5 10−5 −10000 −10000 J Intell Robot Syst Fig. 3 The amount of samples needed by CREPS-CMAOld to converge to the target values shown in Figs. 1 and 2 for the functions a Rosenbrock, b Sphere, c Schwefel, and d Elliptic. The x-axis represents the context dimension, from 1 to 5, and the y-axis represents the problem dimension, from 2 to 64 -3.5 (a) Rosenbrock Function (b) Sphere Function (c) Shifted Schwefel’s Problem (d) Elliptic Function -10y 5 4 3 -4.5 y [m] Average Return -4 2 -5 1 Contextual REPS Contextual REPS-CMA -5.5 0 -6 0 50 100 150 200 Iterations 250 300 350 (a) Reaching task algorithm comparison Fig. 4 a Algorithmic comparison for a planar reaching task (5 joints, 25 parameters). In this task, CREPS-CMAOld has converged faster and learned the task well. Contextual REPS suffers from premature convergence and cannot learn the task. b The planar reaching task used for our comparisons. A 5-link planar robot has to reach a waypoint 400 -1 -1 0 1 2 x [m] 3 4 5 (b) Planar robot v 50 = [1, 1] in task space. The waypoint position is the 2 dimensional context vector and is given. The waypoint is indicated by the red cross. The postures of the resulting motion are shown as overlay, where darker postures indicate a posture which is close in time to the waypoint (a) Initial position of kick move- (b) Final position of kick move- (c) Accuracy of the kick. ment. ment. Fig. 5 a The initial position of an exemplary humanoid kick. b The final position of an exemplary humanoid kick. c The performance of the learned linear (blue) and non-linear (red) policies. The y-axis represents the distance at which the ball was from the intended target, in meters, while the x-axis represents the distance at which the ball was being kicked from, also in meters J Intell Robot Syst (a) Exemplary (close) ball posi- (b) Exemplary (far) ball position (c) Possible ball positions in relation in relation to agent. in relation to agent. tion to agent. Fig. 6 a A possible ball position, close to the agent. b A possible ball position, far from the agent. c The range of possible ball positions, relative to the agent target is a point v 100 = [5, 0], reached by time-step 100. The first point’s coordinates are the tasks context, between the points [0, 0] and [2, 2]. We modeled the task’s reward, based on a quadratic cost term for distance from the two target points as well as quadratic costs for high accelerations to punish jerky movements and energy consumption. We used 5 basis functions per joint for the DMPs, while the goal attractor for reaching the final state was assumed to be known, totaling 25 dimensions in our parameter vector. We generated 100 new samples per iteration. The result show that CREPSCMAOld successfully learns the task without premature convergence, and significantly outperform the original Contextual REPS. Fig. 7 The median kick distances of the policies learned by CREPS and CREPS-CMAOld over 10 trials. The x-axis represents the ball position xB and the y-axis the ball position yB, both in relation to the agent’s center. Warmer colors represent longer distances traveled by the ball, and it is clear that CREPS-CMAOld outperforms CREPS in the possible range of ball positions (a) Kick distances of the CREPS policy. (b) Kick distances of the CREPS-CMA policy. J Intell Robot Syst Fig. 8 Two exemplary movement sequences for a kick where a–c the ball is close to the agent and b–d when it is far from the agent (a) The first part of a movement sequence to kick a ball close to the agent. (b) The first part of a movement sequence to kick a ball far from the agent. (c) The last part of a movement sequence to kick a ball close to the agent. (d) The last part of a movement sequence to kick a ball far from the agent. 5.3 Humanoid Kick The third and final environment consisted in two separate distinct tasks, modeled in a simulated soccer environment for humanoid robots. The first task focuses on precision. The ball is in a fixed initial location with respect to the agent, and the agent is given a 1-dimensional context describing the distance the ball should travel, which is within the [3, 12] meter interval. The motion controller is a linear interpolator between an initial position (l-dimensional vector of joint angles), a final position (l-dimensional vector of joint angles), and the time t between positions. The agent has 6 joints per leg, and remaining joints are ignored, totaling 12-dimensional joints for each position, and a final 25-dimensional parameter vector. Figure 5a and b show the initial and final positions of an exemplary kick. The reward function as modeled as basis features to generalize over non-linear contexts and generate 20 samples per iteration. Using CREPS-CMAOld , we achieved great accuracy, as showing in Fig. 5c,3 after 1000 iterations, with an average error distance of 0.34±0.11 meters. The second tasks focuses on flexibility. The ball is in a context-dependent initial location with respect to the agent, bound within the box shown in Fig. 6c, where xB ranges from 0.15 to 0.3 meters and yB from −0.15 to 0.15 meters. These values were chosen based on the agent architecture and movement capabilities. Smaller values would cause the agent’s body to overlap the ball, while larger values would make the ball out of reach for the agent. The goal of this task is to kick the ball as further as possible. To speed-up training, we train only one of the agent’s legs, and mirror the joints for the other leg. Figure 6a and b show exemplary ball positions in relation to the agent. The reward function as modeled as R(θ , s) = −(x − s)2 − y 2 , R(θ , s) = x 2 − y 2 , where x and y are the distances traveled by the ball along the X and Y axes. This function penalizes deviations from the target distance s. Based on previous work [4], we used radial 3 Demonstration video is available on-line at https://www.dropbox. com/s/bl27w9uqe7qh1sd/ICARSC16kick.mp4. J Intell Robot Syst where x and y are the distances traveled by the ball along the X and Y axes. This function rewards for distance traveled and penalized for side-ways deviation. We generated 100 new samples per iteration, and compared Contextual REPS and CREPS-CMAOld after 700 iterations, with the results shown in Fig. 7.4 The original algorithm achieved an average kick distance of 2.67 ± 2.69 meters, while our proposal achieved 6.50 ± 2.95 meters. We show two distinct kick motions in Fig. 8. Figure 8a and c show the sequence for a kick where the context is xB = 0.15m and yB = 0.05m, while Fig. 8b and d show a kick with the context xB = 0.25m and yB = 0.02m. 6 Conclusion There are many optimization algorithms that have been proposed by the community. However, most of these algorithms optimize a fixed task with a single context, such as optimizing for the lowest energy consumption, the ideal gait for the highest speed, or both. Stochastic search methods, and in particular CMA-ES, have shown many successes for such single context black-box optimization problems. Although stochastic search algorithms are well studied, in this paper we study contextual stochastic search algorithms where we optimize for a context dependent function instead of a point estimate. We built on previous state of art algorithms for contextual stochastic search, i.e, CREPS, and a non-contextual stochastic search method, i.e., CMA-ES. While CREPS enjoys contextual learning feature, it systematically suffers from premature convergence. On the other hand, CMA-ES does not have contextual learning capabilities, but effectively avoids premature convergence. We therefore introduce a powerful contextual stochastic search algorithm that has the best of both worlds, i.e, contextual learning and premature convergence avoidance. In this paper, inspired by CMA-ES, we alleviated the premature convergence problem of contextual REPS, which resulted in two variants, the CREPS-CMAOld and CREPSCMACurr algorithms. One variant uses the old mean and the other variant uses the new mean. We did an exhaustive evaluation using three different environments, ranging from standard functions to complex simulated robotics tasks. The results show that both algorithms perform favorably and outperform the original CMA-ES by orders of magnitude. Additionally, we showed that CREPS-CMAOld solves the premature convergence issue effectively and robustly solves all the tasks. We also show the applicability of the algorithm in practical situations, such as a humanoid robot kick task 4 Demonstration video of CREPS-CMAOld is available on-line at https://www.dropbox.com/s/uoyetyxt1slonhh/vidKicks2.wmv. and a planar reaching task. In the future, we will investigate different ways to incorporate CMA-ES’s step-size control feature into CREPS-CMA for faster convergence and less sensitivity to hyper parameters. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. References 1. Abdolmaleki, A., Lau, N., Reis, L.P., Peters, J., Neumann, G.: Contextual policy search for linear and nonlinear generalization of a humanoid walking controller. J. Intell. Robot. Syst. 10, 1–16 (2016) 2. Abdolmaleki, A., Lioutikov, R., Peters, J., Lua, N., Reis, L., Neumann, G.: Regularized Covariance Estimation for Weighted Maximum Likelihood Policy Search Methods. In: Advances in Neural Information Processing Systems (NIPS). MIT Press (2015) 3. Abdolmaleki, A., Lua, N., Reis, L., Neumann, G.: Regularized covariance estimation for weighted maximum likelihood policy search methods. In: Proceedings of the International Conference on Humanoid Robots (HUMANOIDS) (2015) 4. Abdolmaleki, A., Lua, N., Reis, L., Peters, J., Neumann, G.: Contextual Policy Search for Generalizing a Parameterized Biped Walking Controller. In: IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC) (2015) 5. Abdolmaleki, A., Simoes, D., Lau, N., Reis, L.P., Neumann, G.: Contextual Relative Entropy Policy Search with Covariance Matrix Adaptation. In: 2016 IEEE International Conference On Autonomous Robot Systems and Competitions (ICARSC), pp. 94–99. IEEE (2016) 6. Boyd, S., Vandenberghe, L.: Convex optimization. University Press, Cambridge (2004) 7. Broomhead, D.S., Lowe, D.: Radial Basis Functions, MultiVariable Functional Interpolation and Adaptive Networks. Tech. rep., DTIC Document (1988) 8. Da Silva, B., Konidaris, G., Barto, A.: Learning parameterized skills. International Conference on Machine Learning (ICML) (2012) 9. Daniel, C., Neumann, G., Peters, J.: Hierarchical Relative Entropy Policy Search. In: International Conference on Artificial Intelligence and Statistics (AISTATS) (2012) 10. Deisenroth, M.P., Englert, P., Peters, J., Fox, D.: Multi-task Policy Search for Robotics. In: IEEE International Conference on Robotics and Automation (ICRA) (2014) 11. Ha, S., Liu, C.: Evolutionary optimization for parameterized whole-body dynamic motor skills. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA) (2016) 12. Hansen, N., Muller, S., Koumoutsakos, P.: Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation (2003) 13. Hansen, N., Ostermeier, A.: Completely derandomized selfadaptation in evolution strategies. Evol. Comput. 9(2), 159–195 (2001) 14. Igel, C., Suttorp, T., Hansen, N.: A computational efficient covariance matrix update and a (1+ 1)-CMA for evolution strategies. In: Proceedings of the 8th annual conference on Genetic and evolutionary computation (2006) 15. Ijspeert, A., Schaal, S.: Learning Attractor Landscapes for Learning Motor Primitives. In: Advances in Neural Information Processing Systems 15(NIPS) (2003) J Intell Robot Syst 16. Kober, J., Oztop, E., Peters, J.: Reinforcement Learning to adjust Robot Movements to New Situations. In: Proceedings of the Robotics: Science and Systems Conference (RSS) (2010) 17. Kober, J., Peters, J.: Policy Search for Motor Primitives in Robotics. Mach. Learn. 8, 1–33 (2010) 18. Kupcsik, A., Deisenroth, M.P., Peters, J., Neumann, G.: DataEfficient contextual policy search for robot movement skills. In: Proceedings of the National Conference on Artificial Intelligence (AAAI) (2013) 19. Mannor, S., Rubinstein, R., Gat, Y.: The Cross Entropy method for Fast Policy Search. In: Proceedings of the 20th International Conference on Machine Learning (ICML) (2003) 20. Molga, M., Smutnicki, C.: Test Functions for Optimization Needs. In: http://www.zsd.ict.pwr.wroc.pl/files/docs/functions.pdf (2005) 21. Niehaus, C., Röfer, T., Laue, T.: Gait optimization on a humanoid robot using particle swarm optimization. In: Proceedings of the Second Workshop on Humanoid Soccer Robots in conjunction with the, pp. 1–7 (2007) 22. Peters, J., Mülling, K., Altun, Y.: Relative Entropy Policy Search. In: Proceedings of the 24th National Conference on Artificial Intelligence (AAAI). AAAI Press (2010) 23. Rückstieß, T., Felder, M., Schmidhuber, J.: State-dependent Exploration for Policy Gradient Methods. In: Proceedings of the European Conference on Machine Learning (ECML) (2008) 24. Stulp, F., Raiola, G., Hoarau, A., Ivaldi, S., Sigaud, O.: Learning Compact Parameterized Skills with a Single Regression. In: IEEERAS International Conference on Humanoid Robots (Humanoids) (2013) 25. Stulp, F., Sigaud, O.: Path Integral Policy Improvement with Covariance Matrix Adaptation. In: International Conference on Machine Learning (ICML) (2012) 26. Suganthan, P.N., Hansen, N., Liang, J.J., Deb, K., Chen, Y.P., Auger, A., Tiwari, S.: Problem Definitions and Evaluation Criteria for the CEC 2005 Special Session on Real-Parameter Optimization. Tech. rep., Nanyang Technological University, Singapore (2005) 27. Sun, Y., Wierstra, D., Schaul, T., Schmidhuber, J.: Efficient Natural Evolution Strategies. In: Proceedings of the 11th Annual conference on Genetic and evolutionary computation(GECCO). https://doi.org/10.1145/1569901.1569976 (2009) 28. Theodorou, E., Buchli, J., Schaal, S.: A Generalized Path Integral Control Approach to Reinforcement Learning. The Journal of Machine Learning Research (2010) 29. Wang, J.M., Fleet, D.J., Hertzmann, A.: Optimizing walking controllers. ACM Trans. Graph. (TOG) 28(5), 168 (2009) 30. Wierstra, D., Schaul, T., Peters, J., Schmidhuber, J.: Fitness Expectation Maximization. In: International Conference on Parallel Problem Solving from Nature, pp. 337–346. Springer (2008) Abbas Abdolmaleki obtained B.Sc. (2009) and M.Sc. (2011) in Computer Engineering field of Artificial Intelligence from the University of Isfahan (Iran). He is currently a research scientist at Google DeepMind and Ph.D. student in a joint PhD program at the University of Minho, Aveiro and Porto (Portugal). His thesis topic is on information theoretic stochastic search. He has worked on simulated rescue robots and simulated humanoid robot and achieved different ranks in Robocup competitions including 2 world championships. His main research interests include stochastic search for black box optimization, policy search for robotics and multi agent systems. David Simões obtained a M.Sc. (2015) in Computer and Telematics Engineering from the University of Aveiro, Portugal, and is current a Ph.D. student in a joint PhD program at the Universities of Minho, Aveiro and Porto (Portugal). His thesis topic is on learning coordination in multi-agent systems. He has worked on simulated humanoid robots and achieved different ranks in Robocup competitions including 3 world championships, and has worked in robotic and simulated maze-solving competitions, winning several national Micro-Rato competitions. His main research interests include multi-agent systems, deep learning, and game theory. Nuno Lau is Assistant Profess or at Aveiro University, Portugal and Researcher at the Institute of Electronics and Informatics Engineering of Aveiro (IEETA), where he leads the Intelligent Robotics and Systems group (IRIS). He got his Electrical Engineering Degree from Oporto University in 1993, a DEA degree in Biomedical Engineering from Claude Bernard University, France, in 1994 and the PhD from Aveiro University in 2003. His research interests are focused on Intelligent Robotics, Artificial Intelligence, Multi-Agent Systems and Simulation. Nuno Lau participated in more than 15 international and national research projects, having the tasks of general or local coordinator in about half of them. Nuno Lau won more than 50 scientific awards in robotic competitions, conferences (best papers) and education. He has lectured courses at Phd and MSc levels on Intelligent Robotics, Distributed Artificial Intelligence, Computer Architecture, Programming, etc. Nuno Lau is the author of more than 160 publications in international conferences and journals. He was President of the Portuguese Robotics Society from 2015 to 2017, and is currently the Vice-President of this Society. J Intell Robot Syst Luı́s Paulo Reis is an Associate Professor at the Faculty of Engineering of the University of Porto in Portugal and Director of LIACC Artificial Intelligence and Computer Science Laboratory at the same University. He is an IEEE Senior Member and he was president of the Portuguese Society for Robotics and is vice-president of the Portuguese Association for Artificial Intelligence. During the last 25 years, he has lectured courses on Artificial Intelligence, Intelligent Robotics, Multi-Agent Systems, Simulation and Modelling, Games and Interaction, Educational/Serious Games and Computer Programming. He was the principal investigator of more than 10 research projects in those areas. He won more than 50 scientific awards including wining more than 15 RoboCup international competitions and best papers at conferences such as ICEIS, Robotica, IEEE ICARSC and ICAART. He supervised 20 PhD and 102 MSc theses to completion and is supervising 8 PhD theses. He organized more than 50 international scientific events and belonged to the Program Committee of more than 250 scientific events. He is the author of more than 300 publications in international conferences and journals (indexed at SCOPUS or ISI Web of Knowledge). Gerhard Neumann is a Professor of Robotics & Autonomous Systems in College of Science at the University of Lincoln. Before coming to Lincoln, he has been an Assistant Professor at the TU Darmstadt from September 2014 to October 2016 and head of the Computational Learning for Autonomous Systems (CLAS) group. Before that, he was Post-Doc and Group Leader at the Intelligent Autonomous Systems Group (IAS) also in Darmstadt under the guidance of Prof. Jan Peters. Gerhard obtained his Ph.D. under the supervision of Prof. Wolfgang Mass at the Graz University of Technology. Gerhard already authored 50+ peer reviewed papers, many of them in top ranked machine learning and robotics journals or conferences such as NIPS, ICML, ICRA, IROS, JMLR, Machine Learning and AURO. He is principle investigator for the National Center for Nuclear Robotics (NCNR) in Lincoln which is an EPSRC RAI Hub and also leading 1 Innovate UK project on Tomato Picking. In Darmstadt, he is principle investigator of the EU H2020 project Romans and acquired DFG funding. He organized several workshops and is area chair for conferences such as NIPS and CoRL. Affiliations Abbas Abdolmaleki1 · David Simões1 · Nuno Lau1 · Luı́s Paulo Reis2 · Gerhard Neumann3,4 David Simões david.simoes@ua.pt Nuno Lau nunolau@ua.pt Luı́s Paulo Reis lpreis@fe.up.pt Gerhard Neumann neumann@ias.tu-darmstadt.de 1 IEETA - Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Aveiro, Portugal 2 LIACC - Artificial Intelligence and Computer Science Laboratory, University of Porto, Porto, Portugal 3 CLAS - Computational Learning for Autonomous Systems, Technische Universität Darmstadt, Darmstadt, Germany University of Lincoln, Lincoln, UK 4