1 Introduction

Policy search methods which are widely adopted in high-dimensional robotic problems with continuous states and actions learn a policy directly from a reward function as an alternative to value function-based reinforcement learning [1]. Classic policy gradient algorithms such as REINFORCE [2] and GPOMDP [3] suffer from high variance in the gradient estimates because of the noise added at every time step for a stochastic policy.

Sehnke et al. proposed a method so-called Policy Gradients with Parameter-based Exploration (PGPE) [4] to solve this problem by evaluating deterministic policies with the parameters sampled from a prior distribution described by policy hyper parameters. The use of deterministic policy reduces the variance of the performance gradient with respect to the hyper parameters and the deterministic policy does not need to be differentiable. PGPE outperformed classical policy gradients algorithms, but still requires learning rate tuning for gradient ascent.

Peters and Schaal developed an EM-based Reinforcement Learning framework [5] to maximize the lower bound of the objective function. Depending on the choice of the probability distribution of the policy, the closed form solution for the maximization can be obtained.

Here, we propose a novel policy search algorithm called EM-based Policy Hyper Parameter Exploration (EPHE) that combines PGPE and an EM-based strategy to search for optimal hyper parameters of deterministic policies without the needs of gradient computation and learning rate tuning.

This method is designed in our smartphone robot project [68] for practical and efficient policy parameter learning. The project aims to develop a low-cost platform for multi-agent research. In the first stage, we developed a two-wheeled balancer as a single agent to evaluate various control and learning algorithm. In the real robot system, a deterministic policy is preferred over a stochastic policy to avoid unexpected behaviors and no need for differentiability allows a wider choice of control architectures with Parameter-based Exploration. Because learning without a huge number experience is also critical in hardware experiments, efficient update without learning rate tuning is also a highly favored feature of EM-based learning. We compare the EPHE mothed with PGPE [4] and Finite Difference method [3] in benchmark tasks of pendulum swing-up with limited torque, cart-pole balancing, and our two-wheeled smartphone robot simulator. Results show that EPHE outperforms the previous policy search methods.

2 Learning method

2.1 REINFORCE, PGPE and EM-based algorithms

We assume a standard discrete-time Markov Decision Process (MDP) setting. At each time step t, an agent takes an action \(u_{t}\) based on a state \(x_{t}\) according to the policy \(\pi (u_{t} |x_{t} ,\theta )\) parameterized by a vector \(\theta\). The environment makes a transition to a next state \(x_{t + 1}\) according to \(p(x_{t + 1} |x_{t} ,u_{t} )\) and gives a scalar reward \(r_{t}\) to the agent. We denote a state-action-reward sequence as \(h = [x_{1} ,u_{1} ,r_{1} , \ldots ,x_{T} ,u_{T} ,r_{T} ,x_{T + 1} ]\). The goal of reinforcement learning is to find the parameter \(\theta\) that maximizes an objective function defined as the agent’s expected reward

$$J\left( \theta \right) = \int {p(h|\theta )R(h){\text{d}}h}$$
(1)

where \(R(h)\) is the cumulative reward of the sequence \(h\), and \(p(h|\theta )\) is the probability to observe \(h\). Under the Markovian environmental assumption, \(p(h|\theta )\) is given by:

$$p\left( {h |\theta } \right) = p(x_{1} )\mathop \prod \limits_{t = 1}^{T} p(x_{t + 1} |x_{t} ,u_{t} )\pi (u_{t} |x_{t} ,\theta ).$$
(2)

To maximize \(J(\theta )\), one way is to estimate the gradient \(\nabla J(\theta )\) to perform gradient ascent. REINFORCE [2] obtains the gradient by estimating \(\nabla_{\theta } \log p(h|\theta )\) directly, which yields

$$\nabla_{\theta } J\left( \theta \right) = \mathop \int \limits_{H}^{{}} p\left( {h |\theta } \right)\nabla_{\theta } \log p(h|\theta )R(h){\text{d}}h.$$
(3)

Substituting (3) with (2), we have

$$\nabla_{\theta } J\left( \theta \right) = \mathop \int \limits_{H}^{{}} p(h|\theta )\mathop \sum \limits_{t = 1}^{T} \nabla_{\theta } \log \pi (u_{t} |x_{t} ,\theta )R(h){\text{d}}h.$$

Although it is not practical to integrate over the entire space of histories, we can use sampling to obtain the estimate

$$\nabla_{\theta } J(\theta ) \approx \frac{1}{N}\mathop \sum \limits_{n = 1}^{N} \mathop \sum \limits_{t = 1}^{T} \nabla_{\theta } \log \pi (u_{t}^{n} |x_{t}^{n} ,\theta )R(h^{n} )$$

where \(N\) denotes the number of histories. To reduce the variance of the gradient, \(R(h)\) can be replaced with \(R\left( h \right) - b\), where \(b\) is a reward baseline [5].

The problems of REINFORCE are that the policy has to be differentiable with respect to the policy parameters and that evaluation of a stochastic policy can lead to a high variance in the entire histories.

PGPE [4] addresses these problems by considering a distribution of deterministic policies with the policy parameters \(\theta\) sampled from a prior distribution defined by the hyper parameters \(\rho\), which are typically the mean and the variance of \(\theta\). The objective function is given by

$$J\left( \rho \right) = \mathop \int \limits_{\varTheta }^{{}} \mathop \int \limits_{\rm H}^{{}} p\left( {h |\theta } \right)p\left( {\theta |\rho } \right)R\left( h \right){\text{d}}h{\text{d}}\theta .$$
(4)

It should be noted that the variance of \(p\left( {h |\theta } \right)\) of PGPE can be kept small because a deterministic policy is adopted. By updating the hyper parameter vector \(\rho\), we can obtain the deterministic policy \(\pi (u_{t} |x_{t} ,\theta )\) where \(\theta\) is eventually computed by the expectation of the prior distribution.

Differentiating Eq. (4) with respect to \(\rho\) gives us

$$\nabla_{\rho } J\left( \rho \right) = \mathop \int \limits_{\varTheta }^{{}} \mathop \int \limits_{\rm H}^{{}} p\left( {h |\theta } \right)p\left( {\theta |\rho } \right)\nabla_{\rho } \log \rho \left( {\theta |\rho } \right)R\left( h \right){\text{d}}h{\text{d}}\theta .$$

Again by sampling, we have the gradient estimate

$$\nabla_{\rho } J(\rho ) \approx \frac{1}{N}\mathop \sum \limits_{n = 1}^{N} \nabla_{\rho } \log p(\theta |\rho )R(h^{n} )$$

For PGPE to perform gradient ascent, a learning rate parameter has to be tuned.

EM-based Policy Search [9] estimates a lower bound of the expected return from histories and iteratively updates the policy parameter using an analytic solution for the maximum of the lower bound. This way, there is no learning rate parameter for an EM-based method. The detail of the EM-based method is illustrated below in the context of our hyper parameter learning.

2.2 Proposed method

Here we describe our proposed method, EM-based Policy Hyper Parameter Exploration (EPHE) by integrating the features of PGPE [4] and EM-based Policy Search [9]. To establish the lower bound, we consider a new parameter distribution over hyper parameter vector \(\rho^{{\prime }}\). Using Jensen’s inequality under the assumption that \(R(h)\) is strictly positive, we have the log ratio of two objective functions

$$\begin{aligned} { \log }\frac{{J(\rho^{{\prime }} )}}{J(\rho )} = { \log }\mathop \int \limits_{\varTheta }^{{}} \mathop \int \limits_{\rm H}^{{}} \frac{R(h)p(h|\theta )p(\theta |\rho )}{J(\rho )}\frac{{p(\theta |\rho^{{\prime }} )}}{p(\theta |\rho )}{\text{d}}h{\text{d}}\theta \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\; \ge \mathop \int \limits_{\varTheta }^{{}} \mathop \int \limits_{\rm H}^{{}} \frac{R\left( h \right)p(h|\theta )p(\theta |\rho )}{J(\rho )}{ \log }\frac{{p(\theta |\rho^{{\prime }} )}}{p(\theta |\rho )}{\text{d}}h{\text{d}}\theta \hfill \\ \end{aligned}$$

Hence, the lower bound is defined by

$$\begin{aligned} \log J_{L} (\rho^{{\prime }} ) = \log J\left( \rho \right) + \mathop \int \limits_{\varTheta }^{{}} \mathop \int \limits_{\rm H}^{{}} \frac{R(h)p(h|\theta )p(\theta |\rho )}{J(\rho )}{ \log }\frac{{p(\theta |\rho^{{\prime }} )}}{p(\theta |\rho )}{\text{d}}h{\text{d}}\theta \hfill \\ \end{aligned} .$$
(5)

To maximize this lower bound, the derivative of (5) with respect to \(\rho^{{\prime }}\) should equal to zero

$$\nabla_{{\rho^{{\prime }} }} \log J_{L} (\rho^{{\prime }} ) = \mathop \int \limits_{\varTheta }^{{}} \mathop \int \limits_{\rm H}^{{}} \frac{R(h)p(h|\theta )p(\theta |\rho )}{J(\rho )}\nabla_{{\rho^{{\prime }} }} { \log }\frac{{p(\theta |\rho^{{\prime }} )}}{p(\theta |\rho )}{\text{d}}h{\text{d}}\theta = 0.$$

Since \(J(\rho )\) is constant, this equation can be simplified as

$$\mathop \int \limits_{\varTheta }^{{}} \mathop \int \limits_{\rm H}^{{}} R\left( h \right)p\left( {h |\theta } \right)p\left( {\theta |\rho } \right)\nabla_{{\rho^{{\prime }} }} { \log }p\left( {\theta |\rho^{{\prime }} } \right){\text{d}}h{\text{d}}\theta = 0.$$

By applying sampling trick again, we have

$$\frac{1}{N}\mathop \sum \limits_{n = 1}^{N} \nabla_{{\rho^{{\prime }} }} \log p\left( {\theta^{n} |\rho^{{\prime }} } \right)R\left( {h^{n} } \right) = 0.$$
(6)

If \(p(\theta |\rho^{{\prime }} )\) is represented by an exponential family distribution, the update rule is given by a closed form. In particular, we consider that \(p(\theta |\rho^{{\prime }} )\) is given by a product of independent Gaussian distributions \(N(\theta_{i} |\eta_{i}^{\prime} ,\sigma_{i}^{'2} )\) for each parameter \(\theta_{i}\) in \(\theta\). The log derivatives of \(p(\theta |\rho^{{\prime }} )\) with respect to \(\eta_{i}^{\prime}\) and \(\sigma_{i}^{\prime}\) are computed as

$$\nabla_{{\eta_{i}^{\prime}}} \log p\left( {\theta |\rho^{\prime} } \right) = \frac{{\theta_{i} - \eta_{i}^{\prime} }}{{\sigma_{i}^{'2} }}$$
(7)
$$\nabla_{{\sigma_{i}^{\prime} }} \log p\left( {\theta |\rho^{\prime} } \right) = \frac{{(\theta_{i} - \eta_{i}^{\prime} )^{2} - \sigma_{i}^{'2} }}{{\sigma_{i}^{'3} }}$$
(8)

Substituting Eqs. (7) and (8) into Eq. (6) yields

$$\eta^{\prime} = \frac{{\mathop \sum \nolimits_{n = 1}^{N} [R\left( {h^{n} } \right)\theta_{i}^{n} ]}}{{\mathop \sum \nolimits_{n = 1}^{N} R(h^{n} )}}$$
$$\sigma^{\prime} = \sqrt {\frac{{\mathop \sum \nolimits_{n = 1}^{N} [R\left( {h^{n} } \right)\left( {\theta_{i}^{n} - \eta_{i}^{\prime} } \right)^{2} ]}}{{\mathop \sum \nolimits_{n = 1}^{N} R(h^{n} )}}}$$
(9)

It should be noted that the denominator is positive because we assume that \(R(h)\) is strictly positive so that it can resemble an (improper) probability distribution to weight the parameters. To obtain a good sampling performance, we only take the parameters from the best \(K\) returns in \(N\) trajectories for updating.

3 Experiments

In this section, we compare our method EPHE with PGPE [4] and classic policy gradient method Finite Difference (FD) [3]. For each method we use \(N = 20\) trajectories to update one set of parameters, and select \(K = 10\) to obtain the elite parameters for updating in our method. The results are taken by the average of 20 independent runs. We plot the learning curves of the average and the standard error of cumulative returns against the iterations of parameter updating.

3.1 Pendulum swing-up with limited torque

The target of this non-linear control task is to swing up the pendulum to the upright position and stay as long as possible [10]. We use 16*16 radial basis functions to represent the two-dimensional state variables, the angle, and the angular velocity of the pendulum: \(\varvec{x} = \{ \varphi ,\dot{\varphi }\}\). The action is the torque applied to the pendulum \(u = 5 * {\text{tanh}}\left( {\varvec{\theta}^{T} \varPhi \left( \varvec{x} \right)} \right)\) with maximum torque 5 [N*m], where \(\varvec{\theta}\) is the policy parameter and \(\varPhi \left( \varvec{x} \right)\) is the basis function vector. The system starts from an initial state \(\varvec{x}_{0} = \{ \varphi_{0} ,0\}\), where \(\varphi_{0}\) is randomly selected from \([ - \pi ,\pi ]\) [rad], and terminates when \(\left| {\dot{\varphi }} \right| \ge 4{{\pi }}\) [rad/s]. The sampling rate is 0.02 [s] for each time step and maximum time steps is 1000 (=20 [s]) for one episode. The strictly positive reward for one history is given by

$$R\left( h \right) = \mathop \sum \limits_{t = 1}^{T} { \exp }( - x^{T} Qx - u^{T} Ru)$$
(10)

where \(Q\) and R are the quadratic penalty matrix determined by users. For Finite Difference, the initialization of policy parameters is \(\varvec{\theta}_{0} = 0\), and the steps for the policy parameter update are \(\delta\varvec{\theta}\sim U\left( { - 3.46,3.46} \right)\), a uniform distribution with variance 1. For PGPE and our method EPHE, the initial hyper parameters are \(\varvec{\eta}_{0} = 0\), \(\varvec{\sigma}_{0} = 1\).

Figure 1 shows the performance of FD, PGPE and our method. We hand-tuned the learning rates for each method and found that separate learning rates for each parameter are required for PGPE. The optimal parameters were \(\alpha = 0.1\) for FD, and \(\alpha_{\eta } = 0.001\), \(\alpha_{\sigma } = 0.0001\) for PGPE. We also showed the performance of PGPE with \(\alpha_{\eta } = 0.0001\), \(\alpha_{\sigma } = 0.0001\) to illustrate its parameter dependence. The proposed method learned faster and achieved better performance after 30 iterations without the need for tuning the learning rate.

Fig. 1
figure 1

Learning curves of the swing-up task

3.2 Cart-pole balancing

In this task, the agent aims to maximize the length of time of a movable cart balancing a pole upright in the center of a track [11]. The state variables are the position and the velocity of the cart on the track, and the angle and the angular velocity of the pole: \(\varvec{x} = \left\{ {x,\dot{x}, \theta , \dot{\theta }} \right\}\). The action is the force applied to the cart given by a linear parameterized policy \(u =\varvec{\theta}^{T} \varvec{x}\). We add Gaussian white noise with standard deviation of 0.001 [rad/s] and 0.01 [m/s] to the dynamics. The system starts within a random position and a random angle inside \(\left[ { - 0.2, + 0.2} \right]\) [rad], and \(\left[ { - 0.5, + 0.5} \right]\)  [m] until it reaches the target region of \(\left[ { - 0.05, + 0.05} \right]\)  [rad] and \(\left[ { - 0.05, + 0.05} \right]\)  [m], and terminates at \(\left| x \right| \ge 2.4\)  [m], and \(\left| \theta \right| \ge 0.7\)  [rad]. The sampling rate is 0.02 [s] for each time step and maximum time steps is 1000 (=20 [s]) for one episode. The strictly positive reward is the same as (10). The initializations of policy parameters for FD and hyper parameters for PGPE and EPHE are from a reasonable prior knowledge, which indicates certain distance from the optima. The parameter update for the FD controller is \(\delta\varvec{\theta}\sim U\left( { - 3.9,3.9} \right)\), a uniform distribution with variance 5. We tested with the same initialization as FD of \(\varvec{\eta}_{0}\) with different \(\varvec{\sigma}_{0} = 5\) for PGPE, and \(\varvec{\sigma}_{0} = 35\) for EPHE.

Figure 2 shows the performance of FD, PGPE and our method. The best learning rates were, \(\alpha = 0.01\) for FD, \(\alpha_{\eta } = 0.001\), \(\alpha_{\sigma } = 0.0001\) for PGPE. Our method EPHE achieved faster learning without learning rate tuning.

Fig. 2
figure 2

Learning curves of the cart-pole task

3.3 Two-wheeled smartphone robot

The goal of the smartphone robot project is to construct an affordable, high-performance multi-agent platform for researching on robot social behaviors [6]. Even as a single agent, it has lots of possibilities to achieve various behaviors for testing and developing motor control algorithms under control theory and Reinforcement Learning domain. In our previous work [7, 8], we developed a two-wheel balancer and successfully realized standing-up and balancing behavior by a switching control architecture with an optimal linear controller and a hand-tuned non-linear controller. With our new method EPHE, the robot is expected to optimize the policy parameters automatically in a more practical and efficient fashion. In this section, we compared our method with PGPE and FD in our two-wheeled smartphone robot simulator.

The state variables are tilting angle and angular velocity of the body, and rotating angle and angular velocity of the wheel where \(\varvec{x} = \{ \varphi ,\dot{\varphi }, \vartheta , \dot{\vartheta }\}\). The control input \(u\) is the motor torque applied to the left and right wheel. We adopt a switching framework in which a linear feedback stabilizer is selected to achieve balancing if the tilting angle of the robot body is within the range of \([ - \varphi_{s} ,\varphi_{s} ]\), otherwise the CPG-based destabilizer is applied. The policy parameters are the four-dimensional control gain vector for the linear stabilizer, the switching threshold, and two parameters of the oscillator: \(\varvec{\theta}= \{ k_{1} ,k_{2} ,k_{3} ,k_{4} ,\varphi_{s} ,\omega ,\beta \}\). Figure 3 shows the architecture. We also added observation Gaussian white noise with standard deviation of 0.01 to the system.

Fig. 3
figure 3

Switching control architecture of smartphone robot

The agent is required to start moving from the resting angle 60°, bounce with the bumper to stand up and finally achieve balancing. The simulation runs with a sampling rate 0.02 [s] for each step. The agent learns one episode within the maximum of 1000 steps (=20 [s]). The cumulative reward is the same as (10). We initialize the parameters \(\varvec{\theta}_{0}\) and the step size \(\delta\varvec{\theta}\) for FD, and the hyper parameters \(\varvec{\eta}_{0}\) for PGPE and our method with uniform distributions, and fixed \(\varvec{\sigma}_{0}\) based on the prior knowledge we obtained in [7].

Figure 4 shows the learning performance. The best learning rates are, \(\alpha_{k} = 0.0001\), \(\alpha_{{\varphi_{s} }} = 0.00001\), \(\alpha_{\omega } = 0.01\), \(\alpha_{\beta } = 0.01\) for FD, \(\alpha_{\eta } = 0.001\), \(\alpha_{\sigma } = 0.001\) for PGPE. The success rates of each method are illustrated in Table 1. Our method outperformed others after 10 iterations and achieved a more reliable performance after 20 iterations.

Fig. 4
figure 4

Learning curves of the smartphone robot simulator

Table 1 Successful rate of each method

We also plot the distribution of 20 final optimized parameters in Fig. 5. FD has the most centralized distribution of final optimized parameters because it represents the policy parameters while PGPE and EPHE represent the distribution of the policy parameters. We pick up one set of the optimized parameters \(\varvec{\eta}= \{ 0.0021, 0.0982, 0.2953, 0.0651, 47^{o} , 11.7169, 18.7890\}\) based on our method and sample 10 sets of policy parameters based on the Gaussian prior distribution to illustrate the bouncing and stabilizing behaviors in Fig. 6. It shows that the control signals are synchronized with the angular velocity and the switcher can successfully coordinate the two controllers to achieve the expected behaviors.

Fig. 5
figure 5

Distributions of the optimized parameters in the case of the smartphone robot experiment

Fig. 6
figure 6

Trajectories realized by the policy of which the parameters are sampled from the optimized prior distribution

4 Discussion

Our method EPHE can compete the other two methods in all the tasks with a faster convergence speed and a steadily higher return (Figs. 1, 2 and 4). In the smartphone robot simulator case, Finite Difference method learns better in the beginning because it searches in the policy parameters space directly under one-dimensional uniform distribution domain, and it is easy to be trapped in local optima. This is illustrated in Fig. 5.

We also found that the difference between the initialization of the variances illustrates different insights of PGPE and our method: PGPE optimizes the hyper parameters by computing gradients, in which case, smaller variance leads to a more precise approximation. While our method computes the average of the sampled points which suggests larger initial variance explores more. The learning behavior is much improved by the K-elite selection mechanism. Because parameters are quite far away from the optima in the beginning of learning, frequent failures will slow down the learning process. Also, it is easy to be trapped in local optima by updating with all the parameters (when K = N). Figures 7, 8 and 9 reveal the sensitivity of K to N in three tasks. Agents with smaller setting of K learn relatively faster but reach no better performance in the end. There is no significant difference between different settings of K, but agents without the selection mechanism achieve much worse behaviors. Another interesting finding is, in the Android robot case, the threshold of switching condition seems not crucial parameters to be tuned. This saves the burden of making different arms for the robot body in real hardware.

Fig. 7
figure 7

Sensitivity of K to N in the swing-up task

Fig. 8
figure 8

Sensitivity of K to N in the cart-pole task

Fig. 9
figure 9

Sensitivity of K to N in the smartphone robot task

5 Conclusion and future work

In this paper, we developed a new policy search algorithm, EM-based Policy Hyper Parameter Exploration (EPHE). We tested it in two benchmark tasks and our two-wheeled Android phone robot simulator with a non-linear, non-differentiable controller. Our method integrated PGPE with the EM-based update that maximizes a lower bound of the expected return in each iteration of hyper parameter updating. Simulation results showed that our method outperforms other policy gradient methods such as Finite Difference and PGPE after fine tuning of the learning rates. The advantages of our method are: (1) the controller is deterministic, (2) it does not require the controller to be differentiable, (3) it avoids hand-tuning of the learning rate, which are highly favored in practical robot systems.

For future work, we will improve our method by taking into account the correlation of the parameters and developing more sophisticated sampling methods. We notice that our algorithm with a Gaussian prior is similar to the CMA-ES optimization method [12], and further comparisons with recent policy search methods such as the path integral framework [13] are also required. We will test our method in the real robot system to realize efficient tuning-free performance.