-
Understanding Physical Dynamics with Counterfactual World Modeling
Authors:
Rahul Venkatesh,
Honglin Chen,
Kevin Feigelis,
Daniel M. Bear,
Khaled Jedoui,
Klemen Kotar,
Felix Binder,
Wanhee Lee,
Sherry Liu,
Kevin A. Smith,
Judith E. Fan,
Daniel L. K. Yamins
Abstract:
The ability to understand physical dynamics is critical for agents to act in the world. Here, we use Counterfactual World Modeling (CWM) to extract vision structures for dynamics understanding. CWM uses a temporally-factored masking policy for masked prediction of video data without annotations. This policy enables highly effective "counterfactual prompting" of the predictor, allowing a spectrum o…
▽ More
The ability to understand physical dynamics is critical for agents to act in the world. Here, we use Counterfactual World Modeling (CWM) to extract vision structures for dynamics understanding. CWM uses a temporally-factored masking policy for masked prediction of video data without annotations. This policy enables highly effective "counterfactual prompting" of the predictor, allowing a spectrum of visual structures to be extracted from a single pre-trained predictor without finetuning on annotated datasets. We demonstrate that these structures are useful for physical dynamics understanding, allowing CWM to achieve the state-of-the-art performance on the Physion benchmark.
△ Less
Submitted 22 July, 2024; v1 submitted 10 December, 2023;
originally announced December 2023.
-
Unifying (Machine) Vision via Counterfactual World Modeling
Authors:
Daniel M. Bear,
Kevin Feigelis,
Honglin Chen,
Wanhee Lee,
Rahul Venkatesh,
Klemen Kotar,
Alex Durango,
Daniel L. K. Yamins
Abstract:
Leading approaches in machine vision employ different architectures for different tasks, trained on costly task-specific labeled datasets. This complexity has held back progress in areas, such as robotics, where robust task-general perception remains a bottleneck. In contrast, "foundation models" of natural language have shown how large pre-trained neural networks can provide zero-shot solutions t…
▽ More
Leading approaches in machine vision employ different architectures for different tasks, trained on costly task-specific labeled datasets. This complexity has held back progress in areas, such as robotics, where robust task-general perception remains a bottleneck. In contrast, "foundation models" of natural language have shown how large pre-trained neural networks can provide zero-shot solutions to a broad spectrum of apparently distinct tasks. Here we introduce Counterfactual World Modeling (CWM), a framework for constructing a visual foundation model: a unified, unsupervised network that can be prompted to perform a wide variety of visual computations. CWM has two key components, which resolve the core issues that have hindered application of the foundation model concept to vision. The first is structured masking, a generalization of masked prediction methods that encourages a prediction model to capture the low-dimensional structure in visual data. The model thereby factors the key physical components of a scene and exposes an interface to them via small sets of visual tokens. This in turn enables CWM's second main idea -- counterfactual prompting -- the observation that many apparently distinct visual representations can be computed, in a zero-shot manner, by comparing the prediction model's output on real inputs versus slightly modified ("counterfactual") inputs. We show that CWM generates high-quality readouts on real-world images and videos for a diversity of tasks, including estimation of keypoints, optical flow, occlusions, object segments, and relative depth. Taken together, our results show that CWM is a promising path to unifying the manifold strands of machine vision in a conceptually simple foundation.
△ Less
Submitted 2 June, 2023;
originally announced June 2023.
-
ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation
Authors:
Chuang Gan,
Jeremy Schwartz,
Seth Alter,
Damian Mrowca,
Martin Schrimpf,
James Traer,
Julian De Freitas,
Jonas Kubilius,
Abhishek Bhandwaldar,
Nick Haber,
Megumi Sano,
Kuno Kim,
Elias Wang,
Michael Lingelbach,
Aidan Curtis,
Kevin Feigelis,
Daniel M. Bear,
Dan Gutfreund,
David Cox,
Antonio Torralba,
James J. DiCarlo,
Joshua B. Tenenbaum,
Josh H. McDermott,
Daniel L. K. Yamins
Abstract:
We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation. TDW enables simulation of high-fidelity sensory data and physical interactions between mobile agents and objects in rich 3D environments. Unique properties include: real-time near-photo-realistic image rendering; a library of objects and environments, and routines for their customization; generative procedu…
▽ More
We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation. TDW enables simulation of high-fidelity sensory data and physical interactions between mobile agents and objects in rich 3D environments. Unique properties include: real-time near-photo-realistic image rendering; a library of objects and environments, and routines for their customization; generative procedures for efficiently building classes of new environments; high-fidelity audio rendering; realistic physical interactions for a variety of material types, including cloths, liquid, and deformable objects; customizable agents that embody AI agents; and support for human interactions with VR devices. TDW's API enables multiple agents to interact within a simulation and returns a range of sensor and physics data representing the state of the world. We present initial experiments enabled by TDW in emerging research directions in computer vision, machine learning, and cognitive science, including multi-modal physical scene understanding, physical dynamics predictions, multi-agent interactions, models that learn like a child, and attention studies in humans and neural networks.
△ Less
Submitted 28 December, 2021; v1 submitted 9 July, 2020;
originally announced July 2020.
-
Flexible and Efficient Long-Range Planning Through Curious Exploration
Authors:
Aidan Curtis,
Minjian Xin,
Dilip Arumugam,
Kevin Feigelis,
Daniel Yamins
Abstract:
Identifying algorithms that flexibly and efficiently discover temporally-extended multi-phase plans is an essential step for the advancement of robotics and model-based reinforcement learning. The core problem of long-range planning is finding an efficient way to search through the tree of possible action sequences. Existing non-learned planning solutions from the Task and Motion Planning (TAMP) l…
▽ More
Identifying algorithms that flexibly and efficiently discover temporally-extended multi-phase plans is an essential step for the advancement of robotics and model-based reinforcement learning. The core problem of long-range planning is finding an efficient way to search through the tree of possible action sequences. Existing non-learned planning solutions from the Task and Motion Planning (TAMP) literature rely on the existence of logical descriptions for the effects and preconditions for actions. This constraint allows TAMP methods to efficiently reduce the tree search problem but limits their ability to generalize to unseen and complex physical environments. In contrast, deep reinforcement learning (DRL) methods use flexible neural-network-based function approximators to discover policies that generalize naturally to unseen circumstances. However, DRL methods struggle to handle the very sparse reward landscapes inherent to long-range multi-step planning situations. Here, we propose the Curious Sample Planner (CSP), which fuses elements of TAMP and DRL by combining a curiosity-guided sampling strategy with imitation learning to accelerate planning. We show that CSP can efficiently discover interesting and complex temporally-extended plans for solving a wide range of physically realistic 3D tasks. In contrast, standard planning and learning methods often fail to solve these tasks at all or do so only with a huge and highly variable number of training samples. We explore the use of a variety of curiosity metrics with CSP and analyze the types of solutions that CSP discovers. Finally, we show that CSP supports task transfer so that the exploration policies learned during experience with one task can help improve efficiency on related tasks.
△ Less
Submitted 8 July, 2020; v1 submitted 22 April, 2020;
originally announced April 2020.
-
Modular Continual Learning in a Unified Visual Environment
Authors:
Kevin T. Feigelis,
Blue Sheffer,
Daniel L. K. Yamins
Abstract:
A core aspect of human intelligence is the ability to learn new tasks quickly and switch between them flexibly. Here, we describe a modular continual reinforcement learning paradigm inspired by these abilities. We first introduce a visual interaction environment that allows many types of tasks to be unified in a single framework. We then describe a reward map prediction scheme that learns new task…
▽ More
A core aspect of human intelligence is the ability to learn new tasks quickly and switch between them flexibly. Here, we describe a modular continual reinforcement learning paradigm inspired by these abilities. We first introduce a visual interaction environment that allows many types of tasks to be unified in a single framework. We then describe a reward map prediction scheme that learns new tasks robustly in the very large state and action spaces required by such an environment. We investigate how properties of module architecture influence efficiency of task learning, showing that a module motif incorporating specific design principles (e.g. early bottlenecks, low-order polynomial nonlinearities, and symmetry) significantly outperforms more standard neural network motifs, needing fewer training examples and fewer neurons to achieve high levels of performance. Finally, we present a meta-controller architecture for task switching based on a dynamic neural voting scheme, which allows new modules to use information learned from previously-seen tasks to substantially improve their own learning efficiency.
△ Less
Submitted 11 December, 2017; v1 submitted 20 November, 2017;
originally announced November 2017.
-
A Useful Motif for Flexible Task Learning in an Embodied Two-Dimensional Visual Environment
Authors:
Kevin T. Feigelis,
Daniel L. K. Yamins
Abstract:
Animals (especially humans) have an amazing ability to learn new tasks quickly, and switch between them flexibly. How brains support this ability is largely unknown, both neuroscientifically and algorithmically. One reasonable supposition is that modules drawing on an underlying general-purpose sensory representation are dynamically allocated on a per-task basis. Recent results from neuroscience a…
▽ More
Animals (especially humans) have an amazing ability to learn new tasks quickly, and switch between them flexibly. How brains support this ability is largely unknown, both neuroscientifically and algorithmically. One reasonable supposition is that modules drawing on an underlying general-purpose sensory representation are dynamically allocated on a per-task basis. Recent results from neuroscience and artificial intelligence suggest the role of the general purpose visual representation may be played by a deep convolutional neural network, and give some clues how task modules based on such a representation might be discovered and constructed. In this work, we investigate module architectures in an embodied two-dimensional touchscreen environment, in which an agent's learning must occur via interactions with an environment that emits images and rewards, and accepts touches as input. This environment is designed to capture the physical structure of the task environments that are commonly deployed in visual neuroscience and psychophysics. We show that in this context, very simple changes in the nonlinear activations used by such a module can significantly influence how fast it is at learning visual tasks and how suitable it is for switching to new tasks.
△ Less
Submitted 21 June, 2017;
originally announced June 2017.