Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Devesh Jha

    Devesh Jha

    We propose a trust region method for policy optimization that employs QuasiNewton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization (QNTRPO). Gradient descent is the de facto algorithm for reinforcement... more
    We propose a trust region method for policy optimization that employs QuasiNewton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization (QNTRPO). Gradient descent is the de facto algorithm for reinforcement learning tasks with continuous controls. The algorithm has achieved state-of-the-art performance when used in reinforcement learning across a wide range of tasks. However, the algorithm suffers from a number of drawbacks including: lack of stepsize selection criterion, and slow convergence. We investigate the use of a trust region method using dogleg step and a Quasi-Newton approximation for the Hessian for policy optimization. We demonstrate through numerical experiments over a wide range of challenging continuous control tasks that our particular choice is efficient in terms of number of samples and improves performance.
    Time-varying network topology plays a key role in mobile sensor networks for detection of an event of interest and subsequent awareness propagation within a monitoring and surveillance framework. While physical space parameters such as... more
    Time-varying network topology plays a key role in mobile sensor networks for detection of an event of interest and subsequent awareness propagation within a monitoring and surveillance framework. While physical space parameters such as communication range and mobility characteristics directly drive the network structure, feedback from the information space can be useful to improve network topology and facilitate efficient information management. In this context, the paper proposes a feedback control scheme for tuning key network topology parameters, such as average degree and degree distribution under the recently proposed generalized gossip framework for distributed belief/awareness propagation in mobile sensor networks. The crux of this decentralized control policy is to modify the timelines of the asynchronous belief update protocol depending on the node-level belief/awareness. Using a proximity network representation for a mobile sensor network, the paper presents both analytic ...
    We propose a trust region method for policy optimization that employs QuasiNewton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization (QNTRPO). Gradient descent is the de facto algorithm for reinforcement... more
    We propose a trust region method for policy optimization that employs QuasiNewton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization (QNTRPO). Gradient descent is the de facto algorithm for reinforcement learning tasks with continuous controls. The algorithm has achieved state-of-the-art performance when used in reinforcement learning across a wide range of tasks. However, the algorithm suffers from a number of drawbacks including: lack of stepsize selection criterion, and slow convergence. We investigate the use of a trust region method using dogleg step and a Quasi-Newton approximation for the Hessian for policy optimization. We demonstrate through numerical experiments over a wide range of challenging continuous control tasks that our particular choice is efficient in terms of number of samples and improves performance Optimization Foundations for Reinforcement Learning Workshop at NeurIPS This work may not be copied or reproduced in whole or in p...
    We propose a trust region method for policy optimization that employs QuasiNewton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization (QNTRPO). Gradient descent is the de facto algorithm for reinforcement... more
    We propose a trust region method for policy optimization that employs QuasiNewton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization (QNTRPO). Gradient descent is the de facto algorithm for reinforcement learning tasks with continuous controls. The algorithm has achieved state-of-the-art performance when used in reinforcement learning across a wide range of tasks. However, the algorithm suffers from a number of drawbacks including: lack of stepsize selection criterion, and slow convergence. We investigate the use of a trust region method using dogleg step and a Quasi-Newton approximation for the Hessian for policy optimization. We demonstrate through numerical experiments over a wide range of challenging continuous control tasks that our particular choice is efficient in terms of number of samples and improves performance Optimization Foundations for Reinforcement Learning Workshop at NeurIPS c © 2019 MERL. This work may not be copied or reproduced i...
    One of the main challenges in peg-in-a-hole (PiH) insertion tasks is in handling the uncertainty in the location of the target hole. In order to address it, high-dimensional sensor inputs from sensor modalities such as vision,... more
    One of the main challenges in peg-in-a-hole (PiH) insertion tasks is in handling the uncertainty in the location of the target hole. In order to address it, high-dimensional sensor inputs from sensor modalities such as vision, force/torque sensing, and proprioception can be combined to learn control policies that are robust to this uncertainty in the target pose. Whereas deep learning has shown success in recognizing objects and making decisions with high-dimensional inputs, the learning procedure might damage the robot when applying directly trial- and-error algorithms on the real system. At the same time, learning from Demonstration (LfD) methods have been shown to achieve compelling performance in real robotic systems by leveraging demonstration data provided by experts. In this paper, we investigate the merits of multiple sensor modalities such as vision, force/torque sensors, and proprioception when combined to learn a controller for real world assembly operation tasks using Lf...
    Humans quickly solve tasks in novel systems with complex dynamics, without requiring much interaction. While deep reinforcement learning algorithms have achieved tremendous success in many complex tasks, these algorithms need a large... more
    Humans quickly solve tasks in novel systems with complex dynamics, without requiring much interaction. While deep reinforcement learning algorithms have achieved tremendous success in many complex tasks, these algorithms need a large number of samples to learn meaningful policies. In this paper, we present a task for navigating a marble to the center of a circular maze. While this system is very intuitive and easy for humans to solve, it can be very difficult and inefficient for standard reinforcement learning algorithms to learn meaningful policies. We present a model that learns to move a marble in the complex environment within minutes of interacting with the real system. Learning consists of initializing a physics engine with parameters estimated using data from the real system. The error in the physics engine is then corrected using Gaussian process regression, which is used to model the residual between real observations and physics engine simulations. The physics engine equip...
    In this paper, we present algorithms for synthesizing controllers to distribute a group (possibly swarms) of homogeneous robots (agents) over heterogeneous tasks which are operated in parallel. We present algorithms as well as analysis... more
    In this paper, we present algorithms for synthesizing controllers to distribute a group (possibly swarms) of homogeneous robots (agents) over heterogeneous tasks which are operated in parallel. We present algorithms as well as analysis for global and local-feedback-based controller for the swarms. Using ergodicity property of irreducible Markov chains, we design a controller for global swarm control. Furthermore, to provide some degree of autonomy to the agents, we augment this global controller by a local feedback-based controller using Language measure theory. We provide analysis of the proposed algorithms to show their correctness. Numerical experiments are shown to illustrate the performance of the proposed algorithms.
    Learning tasks from simulated data using reinforcement learning has been proven effective. A major advantage of using simulation data for training is that it reduces the burden of acquiring real data. Specifically when robots are... more
    Learning tasks from simulated data using reinforcement learning has been proven effective. A major advantage of using simulation data for training is that it reduces the burden of acquiring real data. Specifically when robots are involved, it is important to limit the amount of time a robot is occupied with learning, and can instead be used for its intended (manufacturing) task. A policy learned on simulation data can be transferred and refined for real data. In this paper we propose to learn a robustified policy during reinforcement learning using simulation data. A robustified policy is learned by exploiting the ability to change the simulation parameters (appearance and dynamics) for successive training episodes. We demonstrate that the amount of transfer learning for a robustified policy is reduced for transfer from a simulated to real task. We focus on tasks which involve reasl-time non-linear dynamics, since non-linear dynamics can only be approximately modeled in physics engi...
    Deep reinforcement learning (RL) algorithms have recently achieved remarkable successes in various sequential decision making tasks, leveraging advances in methods for training large deep networks. However, these methods usually require... more
    Deep reinforcement learning (RL) algorithms have recently achieved remarkable successes in various sequential decision making tasks, leveraging advances in methods for training large deep networks. However, these methods usually require large amounts of training data, which is often a big problem for real-world applications. One natural question to ask is whether learning good representations for states and using larger networks helps in learning better policies. In this paper, we try to study if increasing input dimensionality helps improve performance and sample efficiency of model-free deep RL algorithms. To do so, we propose an online feature extractor network (OFENet) that uses neural nets to produce good representations to be used as inputs to deep RL algorithms. Even though the high dimensionality of input is usually supposed to make learning of RL agents more difficult, we show that the RL agents in fact learn more efficiently with the high-dimensional representation than wi...
    This paper addresses the problem of learning dynamic models of hybrid systems from demonstrations and then the problem of imitation of those demonstrations by using Bayesian filtering. A linear programming-based approach is used to... more
    This paper addresses the problem of learning dynamic models of hybrid systems from demonstrations and then the problem of imitation of those demonstrations by using Bayesian filtering. A linear programming-based approach is used to develop nonparametric kernel-based conditional density estimation technique to infer accurate and concise dynamic models of system evolution from data. The training data for these models have been acquired from demonstrations by teleoperation. The trained data-driven models for mode-dependent state evolution and state-dependent mode evolution are then used online for imitation of demonstrated tasks via particle filtering. The results of simulation and experimental validation with a hexapod robot are reported to establish generalization of the proposed learning and control algorithms.
    This paper addresses the problem of target detection in dynamic environments in a semi-supervised data-driven setting with low-cost passive sensors. A key challenge here is to simultaneously achieve high probabilities of correct detection... more
    This paper addresses the problem of target detection in dynamic environments in a semi-supervised data-driven setting with low-cost passive sensors. A key challenge here is to simultaneously achieve high probabilities of correct detection with low probabilities of false alarm under the constraints of limited computation and communication resources. In general, the changes in a dynamic environment may significantly affect the performance of target detection due to limited training scenarios and the assumptions made on signal behavior under a static environment. To this end, an algorithm of binary hypothesis testing is proposed based on clustering of features extracted from multiple sensors that may observe the target. First, the features are extracted individually from time-series signals of different sensors by using a recently reported feature extraction tool, called symbolic dynamic filtering. Then, these features are grouped as clusters in the feature space to evaluate homogeneit...
    ABSTRACT
    We propose a trust region method for policy optimization that employs Quasi-Newton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization (QNTRPO). Gradient descent is the de facto algorithm for reinforcement... more
    We propose a trust region method for policy optimization that employs Quasi-Newton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization (QNTRPO). Gradient descent is the de facto algorithm for reinforcement learning tasks with continuous controls. The algorithm has achieved state-of-the-art performance when used in reinforcement learning across a wide range of tasks. However, the algorithm suffers from a number of drawbacks including: lack of stepsize selection criterion, and slow convergence. We investigate the use of a trust region method using dogleg step and a Quasi-Newton approximation for the Hessian for policy optimization. We demonstrate through numerical experiments over a wide range of challenging continuous control tasks that our particular choice is efcient in terms of number of samples and improves performance. Conference on Robot Learning (CoRL) c © 2019 MERL. This work may not be copied or reproduced in whole or in part for any commercia...
    We propose a trust region method for policy optimization that employs Quasi-Newton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization QNTRPO. Gradient descent is the de facto algorithm for reinforcement... more
    We propose a trust region method for policy optimization that employs Quasi-Newton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization QNTRPO. Gradient descent is the de facto algorithm for reinforcement learning tasks with continuous controls. The algorithm has achieved state-of-the-art performance when used in reinforcement learning across a wide range of tasks. However, the algorithm suffers from a number of drawbacks including: lack of stepsize selection criterion, and slow convergence. We investigate the use of a trust region method using dogleg step and a Quasi-Newton approximation for the Hessian for policy optimization. We demonstrate through numerical experiments over a wide range of challenging continuous control tasks that our particular choice is efficient in terms of number of samples and improves performance
    Robots need to learn skills that can not only generalize across similar problems but also be directed to a specific goal. Previous methods either train a new skill for every different goal or do not infer the specific target in the... more
    Robots need to learn skills that can not only generalize across similar problems but also be directed to a specific goal. Previous methods either train a new skill for every different goal or do not infer the specific target in the presence of multiple goals from visual data. We introduce an end-to-end method that represents targetable visuomotor skills as a goal-parameterized neural network policy. By training on an informative subset of available goals with the associated target parameters, we are able to learn a policy that can zero-shot generalize to previously unseen goals. We evaluate our method in a representative 2D simulation of a button-grid and on both button-pressing and peg-insertion tasks on two different physical arms. We demonstrate that our model trained on 33% of the possible goals is able to generalize to more than 90% of the targets in the scene for both simulation and robot experiments. We also successfully learn a mapping from target pixel coordinates to a robo...