article

3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Authors:

Joshua B. Tenenbaum,

Antonio Torralba,

William T. FreemanAuthors Info & Claims

International Journal of Computer Vision, Volume 126, Issue 9

Pages 1009 - 1026

https://doi.org/10.1007/s11263-018-1074-6

Published: 01 September 2018 Publication History

Abstract

Understanding 3D object structure from a single image is an important but challenging task in computer vision, mostly due to the lack of 3D object annotations to real images. Previous research tackled this problem by either searching for a 3D shape that best explains 2D annotations, or training purely on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Networks (3D-INN), an end-to-end trainable framework that sequentially estimates 2D keypoint heatmaps and 3D object skeletons and poses. Our system learns from both 2D-annotated real images and synthetic 3D data. This is made possible mainly by two technical innovations. First, heatmaps of 2D keypoints serve as an intermediate representation to connect real and synthetic data. 3D-INN is trained on real images to estimate 2D keypoint heatmaps from an input image; it then predicts 3D object structure from heatmaps using knowledge learned from synthetic 3D shapes. By doing so, 3D-INN benefits from the variation and abundance of synthetic 3D objects, without suffering from the domain difference between real and synthesized images, often due to imperfect rendering. Second, we propose a Projection Layer, mapping estimated 3D structure back to 2D. During training, it ensures 3D-INN to predict 3D structure whose projection is consistent with the 2D annotations to real images. Experiments show that the proposed system performs well on both 2D keypoint estimation and 3D structure recovery. We also demonstrate that the recovered 3D information has wide vision applications, such as image retrieval.

References

[1]

Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3d human pose reconstruction. In IEEE conference on computer vision and pattern recognition.

[2]

Aubry, M., Maturana, D., Efros, A., Russell, B., & Sivic, J. (2014). Seeing 3d chairs: Exemplar part-based 2d-3d alignment using a large dataset of cad models. In IEEE conference on computer vision and pattern recognition.

[3]

Bansal, A., & Russell, B. (2016). Marr revisited: 2d-3d alignment via surface normal prediction. In IEEE conference on computer vision and pattern recognition.

[4]

Barrow, H. G., & Tenenbaum, J. M. (1978). Recovering intrinsic scene characteristics from images. Computer Vision Systems.

[5]

Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2013). Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine intelligence, 35(12), 2930-2940.

Digital Library

[6]

Bever, T. G., & Poeppel, D. (2010). Analysis by synthesis: A (re-) emerging program of research for language and vision. Biolinguistics, 4(2-3), 174-200.

[7]

Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In European Conference on Computer Vision.

[8]

Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In IEEE conference on computer vision and pattern recognition.

[9]

Chen, J., Izadi, S., & Fitzgibbon, A. (2012). Kinêtre: Animating the world with the human body. In ACM symposium on user interface software and technology.

[10]

Choy, C. B., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016). 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision.

[11]

Dosovitskiy, A., Tobias Springenberg, J., & Brox, T. (2015). Learning to generate chairs with convolutional neural networks. In IEEE conference on computer vision and pattern recognition.

[12]

Fidler, S., Dickinson, S. J., & Urtasun, R. (2012). 3d object detection and viewpoint estimation with a deformable 3d cuboid model. In Advances in neural information processing systems.

[13]

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition.

[14]

Hejrati, M., & Ramanan, D. (2012). Analyzing 3d objects in cluttered images. In Advances in neural information processing systems.

[15]

Hejrati, M., & Ramanan, D. (2014). Analysis by synthesis: 3d object recognition by object reconstruction. In IEEE conference on computer vision and pattern recognition.

[16]

Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 352(1358), 1177-1190.

[17]

Hinton, G. F. (1981). A parallel computation that assigns canonical object-based frames of reference. In International joint conference on artificial intelligence.

[18]

Hu, W., & Zhu, S. C. (2015). Learning 3d object templates by quantizing geometry and appearance spaces. IEEE Transactions on Pattern Analysis and Machine intelligence, 37(6), 1190-1205.

[19]

Huang, Q., Wang, H., & Koltun, V. (2015). Single-view reconstruction via joint analysis of image and shape collections. ACM Transactions on Graphics, 34(4), 87.

[20]

Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. In Advances in neural information processing systems.

Digital Library

[21]

Kar, A., Tulsiani, S., Carreira, J., & Malik, J. (2015). Category-specific object reconstruction from a single image. In IEEE conference on computer vision and pattern recognition.

[22]

Kar, A., Häne, C., & Malik, J. (2017). Learning a multi-view stereo machine. In Advances in neural information processing systems.

[23]

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.

Digital Library

[24]

Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., & Mansinghka, V. (2015a). Picture: A probabilistic programming language for scene perception. In IEEE conference on computer vision and pattern recognition.

[25]

Kulkarni, T. D., Whitney, W. F., Kohli, P., & Tenenbaum, J. B. (2015b) Deep convolutional inverse graphics network. In Advances in neural information processing systems.

[26]

Leclerc, Y. G., & Fischler, M. A. (1992). An optimization-based approach to the interpretation of single line drawings as 3d wire frames. International Journal of Computer Vision, 9(2), 113-136.

Digital Library

[27]

Li, Y., Su, H., Qi, C. R., Fish, N., Cohen-Or, D., & Guibas, L. J. (2015). Joint embeddings of shapes and images via cnn image purification. ACM Transactions on Graphics, 34(6), 234.

Digital Library

[28]

Lim, J. J., Pirsiavash, H., & Torralba, A. (2013). Parsing ikea objects: Fine pose estimation. In IEEE international conference on computer vision.

[29]

Lim, J. J., Khosla, A., Torralba, A. (2014). FPM: Fine pose parts-based model with 3d cad models. In European conference on computer vision.

[30]

Liu, J., & Belhumeur, P. N. (2013). Bird part localization using exemplar-based models with enforced pose and subcategory consistency. In IEEE international conference on computer vision.

[31]

Lowe, D. G. (1987). Three-dimensional object recognition from single two-dimensional images. Artificial intelligence, 31(3), 355-395.

[32]

McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2017). Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation. In IEEE international conference on computer vision.

[33]

Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., Kohi, P., Shotton, J., Hodges, S., & Fitzgibbon, A. (2011). Kinectfusion: Real-time dense surface mapping and tracking. In IEEE international symposium on mixed and augmented reality (pp. 127-136).

[34]

Newell, A., Yang, K., & Deng, J. (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision.

[35]

Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2012). Teaching 3d geometry to deformable part models. In IEEE conference on computer vision and pattern recognition.

[36]

Prasad, M., Fitzgibbon, A., Zisserman, A., & Van Gool, L. (2010). Finding nemo: Deformable object class modelling using curve matching. In IEEE conference on computer vision and pattern recognition.

[37]

Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3d human pose from 2d image landmarks. In European conference on computer vision.

[38]

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems.

Digital Library

[39]

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211-252.

Digital Library

[40]

Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In IEEE conference on computer vision and pattern recognition.

[41]

Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3D models. In British machine vision conference.

[42]

Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In IEEE international conference on computer vision.

[43]

Shih, K. J., Mallya, A., Singh, S., & Hoiem, D. (2015). Part localization using multi-proposal consensus for fine-grained categorization. In British machine vision conference.

[44]

Shrivastava, A., & Gupta, A. (2013). Building part-based object detectors via 3d geometry. In IEEE international conference on computer vision.

[45]

Soltani, A. A., Huang, H., Wu, J., Kulkarni, T. D., & Tenenbaum, J. B. (2017) Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In IEEE conference on computer vision and pattern recognition.

[46]

Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In IEEE conference on computer vision and pattern recognition.

[47]

Su, H., Huang, Q., Mitra, N. J., Li, Y., & Guibas, L. (2014). Estimating image depth using shape collections. ACM Transactions on Graphics, 33(4), 37.

[48]

Su, H., Qi, C. R., Li, Y., & Guibas, L. (2015). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In IEEE international conference on computer vision.

[49]

Sun, B., & Saenko, K. (2014) From virtual to reality: Fast adaptation of virtual object detectors to real domains. In British machine vision conference.

[50]

Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2015). Web-scale training for face identification. In IEEE conference on computer vision and pattern recognition.

[51]

Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object localization using convolutional networks. In IEEE conference on computer vision and pattern recognition.

[52]

Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems.

[53]

Torralba, A., & Efros, A. A. (2011) Unbiased look at dataset bias. In IEEE conference on computer vision and pattern recognition.

[54]

Torresani, L., Hertzmann, A., & Bregler, C. (2003). Learning non-rigid 3d shape from 2d motion. In Advances in neural information processing systems.

[55]

Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In IEEE conference on computer vision and pattern recognition (pp. 1653-1660).

[56]

Tulsiani, S., & Malik, J. (2015). Viewpoints and keypoints. In IEEE conference on computer vision and pattern recognition.

[57]

Tulsiani, S., Zhou, T., Efros, A. A., & Malik, J. (2017). Multiview supervision for single-view reconstruction via differentiable ray consistency. In IEEE conference on computer vision and pattern recognition.

[58]

Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(11), 2579-2605.

[59]

Vicente, S., Carreira, J., Agapito, L., & Batista, J. (2014). Reconstructing pascal voc. In IEEE conference on computer vision and pattern recognition.

[60]

Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.

[61]

Wu, J., Yildirim, I., Lim, J. J., Freeman, B., & Tenenbaum, J. (2015). Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In Advances in neural information processing systems.

[62]

Wu, J., Zhang, C., Xue, T., Freeman, W. T., & Tenenbaum, J. B. (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in neural information processing systems.

[63]

Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W. T., & Tenenbaum, J. B. (2017). Marrnet: 3d shape reconstruction via 2.5d sketches. In Advances in neural information processing systems.

[64]

Xiang, Y., Mottaghi, R., & Savarese, S. (2014). Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision.

[65]

Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010) Sun database: Large-scale scene recognition from abbey to zoo. In IEEE conference on computer vision and pattern recognition.

[66]

Xue, T., Liu, J., & Tang, X. (2012). Example-based 3d object reconstruction from line drawings. In IEEE conference on computer vision and pattern recognition.

[67]

Yang, Y., & Ramanan, D. (2011) Articulated pose estimation with flexible mixtures-of-parts. In IEEE conference on computer vision and pattern recognition.

[68]

Yasin, H., Iqbal, U., Krüger, B., Weber, A., & Gall, J. (2016). A dual-source approach for 3d pose estimation from a single image. In IEEE conference on computer vision and pattern recognition.

[69]

Yuille, A., & Kersten, D. (2006). Vision as bayesian inference: Analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301-308.

[70]

Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J. (2017). 3dmatch: Learning the matching of local 3d geometry in range scans. In IEEE conference on computer vision and pattern recognition.

[71]

Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., & Efros, A. A. (2016) Learning dense correspondence via 3d-guided cycle consistency. In IEEE conference on computer vision and pattern recognition.

[72]

Zhou, X., Leonardos, S., Hu, X., & Daniilidis, K. (2015) 3d shape reconstruction from 2d landmarks: A convex formulation. In IEEE conference on computer vision and pattern recognition.

[73]

Zia, M. Z., Stark, M., Schiele, B., & Schindler, K. (2013). Detailed 3d representations for object recognition and modeling. IEEE Transactions on Pattern Analysis and Machine intelligence, 35(11), 2608-2623.

Digital Library

Cited By

Wu J(2024)Physical scene understandingAI Magazine10.1002/aaai.1214845:1(156-164)Online publication date: 9-Feb-2024
https://dl.acm.org/doi/10.1002/aaai.12148
Yang HSun BChen LPavel AHuang Q(2023)GeoLatent: A Geometric Approach to Latent Space Design for Deformable Shape GeneratorsACM Transactions on Graphics10.1145/361837142:6(1-20)Online publication date: 5-Dec-2023
https://dl.acm.org/doi/10.1145/3618371
Su JWang CMa XZeng WWang Y(2022)VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual DataComputer Vision – ECCV 202210.1007/978-3-031-20068-7_4(55-71)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-20068-7_4
Show More Cited By

Index Terms

3D Interpreter Networks for Viewer-Centered Wireframe Modeling
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Image and video acquisition
        3D imaging
  2. Computer graphics
    1. Shape modeling

Index terms have been assigned to the content through auto-classification.

Recommendations

Recovering ball motion from a single motion-blurred image

Motion blur often affects the ball image in photographs and video frames in many sports such as tennis, table tennis, squash and golf. In this work, we operate on a single calibrated image depicting a moving ball over a known background, and show that ...
3D surface point and wireframe reconstruction from multiview photographic images

This paper describes a new method for reconstructing 3D surface points and a wireframe on the surface of a freeform object using a small number, e.g. 10, of 2D photographic images. The images are taken at different viewing directions by a perspective ...
Neural Implicit 3D Shapes from Single Images with Spatial Patterns
Image and Graphics
Abstract
Neural implicit representations are highly effective for single-view 3D reconstruction (SVR). It represents 3D shapes as neural fields and conditions shape prediction on input image features. Image features can be less effective when significant ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of Computer Vision

International Journal of Computer Vision Volume 126, Issue 9

September 2018

162 pages

ISSN:0920-5691

Issue’s Table of Contents

Copyright © Copyright © 2018 Springer Science+Business Media, LLC, part of Springer Nature.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 September 2018

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu J(2024)Physical scene understandingAI Magazine10.1002/aaai.1214845:1(156-164)Online publication date: 9-Feb-2024
https://dl.acm.org/doi/10.1002/aaai.12148
Yang HSun BChen LPavel AHuang Q(2023)GeoLatent: A Geometric Approach to Latent Space Design for Deformable Shape GeneratorsACM Transactions on Graphics10.1145/361837142:6(1-20)Online publication date: 5-Dec-2023
https://dl.acm.org/doi/10.1145/3618371
Su JWang CMa XZeng WWang Y(2022)VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual DataComputer Vision – ECCV 202210.1007/978-3-031-20068-7_4(55-71)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-20068-7_4
Lili ZSu Y(2021)Planning and Design of Urban Landscape Architecture under the Background of Big Data2021 2nd International Conference on Computers, Information Processing and Advanced Education10.1145/3456887.3457471(1118-1122)Online publication date: 25-May-2021
https://dl.acm.org/doi/10.1145/3456887.3457471
Zheng JZhang JLi JTang RGao SZhou Z(2020)Structured3D: A Large Photo-Realistic Dataset for Structured 3D ModelingComputer Vision – ECCV 202010.1007/978-3-030-58545-7_30(519-535)Online publication date: 23-Aug-2020
https://dl.acm.org/doi/10.1007/978-3-030-58545-7_30
Cheng RWang ZFragkiadaki K(2018)Geometry-aware recurrent neural networks for active visual recognitionProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327345.3327415(5086-5096)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327345.3327415
Insafutdinov EDosovitskiy A(2018)Unsupervised learning of shape and pose with differentiable point cloudsProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327144.3327204(2807-2817)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327144.3327204

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents