RLC Project Report

Combining RL and MPC for Biped Walking
Aastha Mishra1 , Ishita Ganjoo1 , Varad Vaidya1 ,

Prakrut Kotecha1 , Shishir Kothalaya1
Abstract— The fusion of Reinforcement Learning (RL) with 1) LOOP: Off-policy reinforcement learning algorithms
Model Predictive Control (MPC) is a promising approach have been widely used in many robotic applications due to
that leverages the strengths of model-based and model-free their sample efficiency and ability to incorporate data from
paradigms for learning complex control policies for highly non-
linear, dynamic high-dimensional systems. It provides improved different sources [2]. As addressed in POLO [1], LOOP has
sample efficiency, faster learning, and flexibility to incorporate also used terminal value function to exploit the smaller bound
constraints. In this work, we study two approaches to the H-step that can be achieved towards optimality using the H-step
lookahead policies, viz. LOOP and TD-MPC, which leverage lookahead method. Additionally, they considered the model
trajectory optimization over a learned dynamics model over a dynamics as unknown and employed a neural network to
horizon H and terminal value function to account for future
rewards. We have tested both methods on benchmark tasks and learn it using supervised learning using the samples from the
also implemented them for a custom biped. The performance replay buffer. They have tested it for multiple systems and
of both algorithms on Walker 2D has been compared and the have claimed significant improvements from other methods.
challenges in their application to train more complex custom 2) TD-MPC: Data-driven MPC although it has better
environments have been brought out. Another contribution of performance and improved sample efficiency, it also requires
this work is the theoretical proof for finding an optimality
bound for the value function for TD-MPC. The video of results an accurate model of the environment. Hence this work
can be found at video has proposed a task-oriented latent dynamics model that
Index Terms— Legged locomotion, Reinforcement learning, can learn the dynamics of complex systems like dog and
MPC humanoid [3].
In this work, we have tested these methods for bipedal
I. I NTRODUCTION systems and summarized the contributions, limitations, ad-
vantages, and disadvantages pertaining to the legged system,
RL agents need to learn from the environment while etc.
interacting with it, but large state and action spaces make it
challenging to explore everything and find optimal actions. B. Contributions
Planning with a dynamics model, which predicts future In this work, we have tested the methods mentioned above
outcomes based on current states and actions, can improve ef- (sections.I-A.1 and I-A.2) on different bipedal environments
ficiency in these situations. Model Predictive Control (MPC) and summarized our results, observations, and analyses of
is a planning approach that uses a dynamics model to find the same. We have successfully compared the performance
a sequence of controls leading to a good trajectory over a of both algorithms on Walker 2D and also put forth and
short horizon. However, these solutions are locally optimal, analyzed the problems faced in the application of these
and obtaining globally optimal solutions requires an accurate algorithms to more complex environments. We have also
value function (predicting future rewards) in addition to the proposed a theoretical proof for finding an optimality bound
dynamics model, which can be difficult to obtain. for the value function for TD-MPC which we have shown
in this report
A. Related Work
II. M ETHOD
Plan Online Learn Offline (POLO) [1] combines RL and A. LOOP
MPC by proposing a framework wherein a ground-truth
1) Training: The algorithm works by first filling a replay
dynamics model is used for local trajectory optimization
buffer with samples of state, action, reward, and next state.
and simultaneously an approximate global value function
Then, it uses a model-free RL algorithm (Soft Actor-Critic
is used to generate return beyond the fixed horizon. A
in this case) to optimize the policy and learn the Q function
theoretical proof of the generation of optimal actions, using
based on the replay buffer. Once the model is trained, it
this conjunction has been provided. In addition, trajectory
is used for trajectory optimization. Finally, new actions are
optimization is used to enable efficient exploration, which in
sampled and the process repeats.
turn, is useful for getting better global information and to
2) Planning: The algorithm first trains a set of dynamic
escape local minima.
models by feeding them data from a replay buffer. Then, it
1 Robert Bosch Centre of Cyber-Physical Systems, Indian Institute of Sci-
uses a separate policy for learning the value function and
ence (IISc), Bengaluru. {aasthamishra,ishitaganjoo,varadmandar, another policy for collecting data. However, this mismatch
prakrutpk,shishirk}@iisc.ac.in . between the two policies can cause problems during learning.
To overcome this issue, the algorithm introduces a new However not shrinking the latent dimension would lead
routine called ARC that utilizes the trained models and the to more memory consumption and higher computational
RL policy together for trajectory optimization. This ensures requirements.
that the data used for optimization comes from the same Opportunities: As we summarized above a lot of work
policy that will be used in practice, avoiding actor divergence is remaining in these areas to reach the goal of complete
issues. generalization of algorithms. This brings us to some recent
The key idea of ARC is to constrain the action distribution works that have attempted to address some of these problems
of the trajectory optimization routine to be close to that of through various methods. The newer version of TD-MPC [3]:
the parametrized actor, for which we set TD-MPC2 [4] has shown a lot more tasks than TD-MPC and
has also attempted on solving the issues related to scaling
pτprior = βπθ + (1 − β)N (µt−1 , σ) (1)
with environment that is faced in TD-MPC. Although [4]
Therefore, the prior is a linear combination of πθ and the claims to have solved the problem of generalization over
action sequence from the previous environment timestep with continuous control tasks, discrete action tasks are still an
N (µ, σ). open problem using TD-MPC types of method.
Parallelizing off-policy algorithms can have a huge impact
B. TD-MPC on the performance of these algorithms. [5] have attempted
1) Training: A key technical contribution of TD-MPC to scale off-policy algorithms successfully and integrating
is the approach to learning the latent representation of the that with these algorithms could be a future direction.
dynamics model purely from rewards, ensuring more sample With the specific observations from TD-MPC regarding
efficiency than state or image prediction. This also makes it expanding latent dimension, the suggestion given by Prof.
agnostic to observation modality. Shishir about using single rigid body dynamics with aug-
The gradients are back propagated from the reward and mented latent dimensions only being used for residual dy-
TD-objective over multiple rollout steps. The prediction loss namics/extra information encoding can be a future direction.
in latent space enforces temporal consistency in the learned
representation without explicit state or image prediction. IV. C ONCLUSIONS
The deterministic RL policy is parametrised by minimiz- In this project, we have analyzed and compared two H-
ing the following loss, step lookahead methods under the learned model and value
t+H function. We tested it on different bipedal systems and
X
Jπ (θ; Γ) = − λi−t Qθ (zi , πθ (zi )) (2) derived some significant inferences that would be beneficial
i=t for future work along these lines. We have also proposed a
This is similar to policy objective commonly used in theorem to find the optimality bound corresponding to TD-
model-free actor-critic methods used to derive the RL policy MPC. More details about the same are available slides.
πθ which is found to be sufficiently expressive for efficient ACKNOWLEDGMENTS
value learning. We would like to thank Professor Shishir, Manan Tayal,
It is to be noted that for training, the samples are generated Aditya Shirwatkar, Naman Saxena, and GVS Mothish for
through the MPC policy ΠΘ . their valuable insights during this project.
2) Planning: The planning algorithm leverages the Model
Predictive Path Integral (MPPI) control which is a stochastic R EFERENCES
optimization method. The trajectories used by the MPPI are [1] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch,
generated both from the MPC policy ΠΘ as well as RL “Plan online, learn offline: Efficient learning and exploration via model-
policy. based control,” in International Conference on Learning Representa-
tions, 2018.
III. D ISCUSSION [2] H. Sikchi, W. Zhou, and D. Held, “Learning off-policy with online
planning,” in Conference on Robot Learning. PMLR, 2022, pp. 1622–
Lessons: While the results we produced do not signify the 1633.
generalization of these algorithms, they show a direction in [3] N. Hansen, X. Wang, and H. Su, “Temporal difference learning for
model predictive control,” arXiv preprint arXiv:2203.04955, 2022.
which RL research is moving forward. Researchers are trying [4] N. Hansen, H. Su, and X. Wang, “Td-mpc2: Scalable, robust world
to make algorithms that can generalize over all tasks and models for continuous control,” arXiv preprint arXiv:2310.16828, 2023.
would require lesser hyperparameter tuning for continuous [5] Z. Li, T. Chen, Z.-W. Hong, A. Ajay, and P. Agrawal, “Parallel q-
learning: Scaling off-policy reinforcement learning under massively
control tasks. Having said that as we have already shown parallel simulation,” in International Conference on Machine Learning.
these algorithms still require a lot of work to reach that level PMLR, 2023, pp. 19 440–19 459.
of generalization. Scaling up for these algorithms is non-
trivial, and parallelizing these off-policy methods is crucial
to achieving meaningful generalizations.
Apart from these the specific issues with TD-MPC can
be seen about actor divergence and not having a shrinking
latent space. Using latent space decreases interpretability but
was able to achieve success in more complex environments.

RLC Project Report

Uploaded by

Copyright:

Available Formats

RLC Project Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RLC Project Report

Uploaded by

Copyright:

Available Formats

Combining RL and MPC for Biped Walking

Aastha Mishra1 , Ishita Ganjoo1 , Varad Vaidya1 ,

You might also like