Unsupervised Discovery of Parts, Structure, and Dynamics

Figure 1: Observing human moving, humans are able to perceive disentangled object parts, understand their hierarchical structure, and capture their corresponding motion fields (without any annotations)

Figure 2: Our PSD model has seven components: (a) motion encoder; (b) kernel decoder; (c) image encoder; (d) cross convolution; (e) motion decoder; (f) structural descriptor; and (g) image decoder.

Supplementary video for our setup, model, and results

Abstract

Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future. In this paper, we propose a novel formulation that simultaneously learns a hierarchical, disentangled object representation and a dynamics model for object parts from unlabeled videos. Our Parts, Structure, and Dynamics (PSD) model learns to, first, recognize the object parts via a layered image representation; second, predict hierarchy via a structural descriptor that composes low-level concepts into a hierarchical structure; and third, model the system dynamics by predicting the future. Experiments on multiple real and synthetic datasets demonstrate that our PSD model works well on all three tasks: segmenting object parts, building their hierarchical structure, and capturing their motion distributions.

Publication

Unsupervised Discovery of Parts, Structure, and Dynamics

Zhenjia Xu*, Zhijian Liu*, Chen Sun, Kevin Murphy, William T. Freeman, Joshua B. Tenenbaum, and Jiajun Wu

ICLR 2019 Paper BibTeX (* indicates equal contributions)

Downloads

code, pretrained models, and data: GitHub repo

Related Publications

Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks

Tianfan Xue*, Jiajun Wu*, Katherine L. Bouman, and William T. Freeman

IEEE TPAMI 2018, NIPS 2016 Paper (conference) Paper (journal) Project Page (* indicates equal contributions)