Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

3D Interpreter Networks for Viewer-Centered Wireframe Modeling

Published: 01 September 2018 Publication History

Abstract

Understanding 3D object structure from a single image is an important but challenging task in computer vision, mostly due to the lack of 3D object annotations to real images. Previous research tackled this problem by either searching for a 3D shape that best explains 2D annotations, or training purely on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Networks (3D-INN), an end-to-end trainable framework that sequentially estimates 2D keypoint heatmaps and 3D object skeletons and poses. Our system learns from both 2D-annotated real images and synthetic 3D data. This is made possible mainly by two technical innovations. First, heatmaps of 2D keypoints serve as an intermediate representation to connect real and synthetic data. 3D-INN is trained on real images to estimate 2D keypoint heatmaps from an input image; it then predicts 3D object structure from heatmaps using knowledge learned from synthetic 3D shapes. By doing so, 3D-INN benefits from the variation and abundance of synthetic 3D objects, without suffering from the domain difference between real and synthesized images, often due to imperfect rendering. Second, we propose a Projection Layer, mapping estimated 3D structure back to 2D. During training, it ensures 3D-INN to predict 3D structure whose projection is consistent with the 2D annotations to real images. Experiments show that the proposed system performs well on both 2D keypoint estimation and 3D structure recovery. We also demonstrate that the recovered 3D information has wide vision applications, such as image retrieval.

References

[1]
Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3d human pose reconstruction. In IEEE conference on computer vision and pattern recognition.
[2]
Aubry, M., Maturana, D., Efros, A., Russell, B., & Sivic, J. (2014). Seeing 3d chairs: Exemplar part-based 2d-3d alignment using a large dataset of cad models. In IEEE conference on computer vision and pattern recognition.
[3]
Bansal, A., & Russell, B. (2016). Marr revisited: 2d-3d alignment via surface normal prediction. In IEEE conference on computer vision and pattern recognition.
[4]
Barrow, H. G., & Tenenbaum, J. M. (1978). Recovering intrinsic scene characteristics from images. Computer Vision Systems.
[5]
Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2013). Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine intelligence, 35(12), 2930-2940.
[6]
Bever, T. G., & Poeppel, D. (2010). Analysis by synthesis: A (re-) emerging program of research for language and vision. Biolinguistics, 4(2-3), 174-200.
[7]
Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In European Conference on Computer Vision.
[8]
Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In IEEE conference on computer vision and pattern recognition.
[9]
Chen, J., Izadi, S., & Fitzgibbon, A. (2012). Kinêtre: Animating the world with the human body. In ACM symposium on user interface software and technology.
[10]
Choy, C. B., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016). 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision.
[11]
Dosovitskiy, A., Tobias Springenberg, J., & Brox, T. (2015). Learning to generate chairs with convolutional neural networks. In IEEE conference on computer vision and pattern recognition.
[12]
Fidler, S., Dickinson, S. J., & Urtasun, R. (2012). 3d object detection and viewpoint estimation with a deformable 3d cuboid model. In Advances in neural information processing systems.
[13]
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition.
[14]
Hejrati, M., & Ramanan, D. (2012). Analyzing 3d objects in cluttered images. In Advances in neural information processing systems.
[15]
Hejrati, M., & Ramanan, D. (2014). Analysis by synthesis: 3d object recognition by object reconstruction. In IEEE conference on computer vision and pattern recognition.
[16]
Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 352(1358), 1177-1190.
[17]
Hinton, G. F. (1981). A parallel computation that assigns canonical object-based frames of reference. In International joint conference on artificial intelligence.
[18]
Hu, W., & Zhu, S. C. (2015). Learning 3d object templates by quantizing geometry and appearance spaces. IEEE Transactions on Pattern Analysis and Machine intelligence, 37(6), 1190-1205.
[19]
Huang, Q., Wang, H., & Koltun, V. (2015). Single-view reconstruction via joint analysis of image and shape collections. ACM Transactions on Graphics, 34(4), 87.
[20]
Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. In Advances in neural information processing systems.
[21]
Kar, A., Tulsiani, S., Carreira, J., & Malik, J. (2015). Category-specific object reconstruction from a single image. In IEEE conference on computer vision and pattern recognition.
[22]
Kar, A., Häne, C., & Malik, J. (2017). Learning a multi-view stereo machine. In Advances in neural information processing systems.
[23]
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.
[24]
Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., & Mansinghka, V. (2015a). Picture: A probabilistic programming language for scene perception. In IEEE conference on computer vision and pattern recognition.
[25]
Kulkarni, T. D., Whitney, W. F., Kohli, P., & Tenenbaum, J. B. (2015b) Deep convolutional inverse graphics network. In Advances in neural information processing systems.
[26]
Leclerc, Y. G., & Fischler, M. A. (1992). An optimization-based approach to the interpretation of single line drawings as 3d wire frames. International Journal of Computer Vision, 9(2), 113-136.
[27]
Li, Y., Su, H., Qi, C. R., Fish, N., Cohen-Or, D., & Guibas, L. J. (2015). Joint embeddings of shapes and images via cnn image purification. ACM Transactions on Graphics, 34(6), 234.
[28]
Lim, J. J., Pirsiavash, H., & Torralba, A. (2013). Parsing ikea objects: Fine pose estimation. In IEEE international conference on computer vision.
[29]
Lim, J. J., Khosla, A., Torralba, A. (2014). FPM: Fine pose parts-based model with 3d cad models. In European conference on computer vision.
[30]
Liu, J., & Belhumeur, P. N. (2013). Bird part localization using exemplar-based models with enforced pose and subcategory consistency. In IEEE international conference on computer vision.
[31]
Lowe, D. G. (1987). Three-dimensional object recognition from single two-dimensional images. Artificial intelligence, 31(3), 355-395.
[32]
McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2017). Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation. In IEEE international conference on computer vision.
[33]
Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., Kohi, P., Shotton, J., Hodges, S., & Fitzgibbon, A. (2011). Kinectfusion: Real-time dense surface mapping and tracking. In IEEE international symposium on mixed and augmented reality (pp. 127-136).
[34]
Newell, A., Yang, K., & Deng, J. (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision.
[35]
Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2012). Teaching 3d geometry to deformable part models. In IEEE conference on computer vision and pattern recognition.
[36]
Prasad, M., Fitzgibbon, A., Zisserman, A., & Van Gool, L. (2010). Finding nemo: Deformable object class modelling using curve matching. In IEEE conference on computer vision and pattern recognition.
[37]
Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3d human pose from 2d image landmarks. In European conference on computer vision.
[38]
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems.
[39]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211-252.
[40]
Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In IEEE conference on computer vision and pattern recognition.
[41]
Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3D models. In British machine vision conference.
[42]
Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In IEEE international conference on computer vision.
[43]
Shih, K. J., Mallya, A., Singh, S., & Hoiem, D. (2015). Part localization using multi-proposal consensus for fine-grained categorization. In British machine vision conference.
[44]
Shrivastava, A., & Gupta, A. (2013). Building part-based object detectors via 3d geometry. In IEEE international conference on computer vision.
[45]
Soltani, A. A., Huang, H., Wu, J., Kulkarni, T. D., & Tenenbaum, J. B. (2017) Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In IEEE conference on computer vision and pattern recognition.
[46]
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In IEEE conference on computer vision and pattern recognition.
[47]
Su, H., Huang, Q., Mitra, N. J., Li, Y., & Guibas, L. (2014). Estimating image depth using shape collections. ACM Transactions on Graphics, 33(4), 37.
[48]
Su, H., Qi, C. R., Li, Y., & Guibas, L. (2015). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In IEEE international conference on computer vision.
[49]
Sun, B., & Saenko, K. (2014) From virtual to reality: Fast adaptation of virtual object detectors to real domains. In British machine vision conference.
[50]
Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2015). Web-scale training for face identification. In IEEE conference on computer vision and pattern recognition.
[51]
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object localization using convolutional networks. In IEEE conference on computer vision and pattern recognition.
[52]
Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems.
[53]
Torralba, A., & Efros, A. A. (2011) Unbiased look at dataset bias. In IEEE conference on computer vision and pattern recognition.
[54]
Torresani, L., Hertzmann, A., & Bregler, C. (2003). Learning non-rigid 3d shape from 2d motion. In Advances in neural information processing systems.
[55]
Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In IEEE conference on computer vision and pattern recognition (pp. 1653-1660).
[56]
Tulsiani, S., & Malik, J. (2015). Viewpoints and keypoints. In IEEE conference on computer vision and pattern recognition.
[57]
Tulsiani, S., Zhou, T., Efros, A. A., & Malik, J. (2017). Multiview supervision for single-view reconstruction via differentiable ray consistency. In IEEE conference on computer vision and pattern recognition.
[58]
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(11), 2579-2605.
[59]
Vicente, S., Carreira, J., Agapito, L., & Batista, J. (2014). Reconstructing pascal voc. In IEEE conference on computer vision and pattern recognition.
[60]
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.
[61]
Wu, J., Yildirim, I., Lim, J. J., Freeman, B., & Tenenbaum, J. (2015). Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In Advances in neural information processing systems.
[62]
Wu, J., Zhang, C., Xue, T., Freeman, W. T., & Tenenbaum, J. B. (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in neural information processing systems.
[63]
Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W. T., & Tenenbaum, J. B. (2017). Marrnet: 3d shape reconstruction via 2.5d sketches. In Advances in neural information processing systems.
[64]
Xiang, Y., Mottaghi, R., & Savarese, S. (2014). Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision.
[65]
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010) Sun database: Large-scale scene recognition from abbey to zoo. In IEEE conference on computer vision and pattern recognition.
[66]
Xue, T., Liu, J., & Tang, X. (2012). Example-based 3d object reconstruction from line drawings. In IEEE conference on computer vision and pattern recognition.
[67]
Yang, Y., & Ramanan, D. (2011) Articulated pose estimation with flexible mixtures-of-parts. In IEEE conference on computer vision and pattern recognition.
[68]
Yasin, H., Iqbal, U., Krüger, B., Weber, A., & Gall, J. (2016). A dual-source approach for 3d pose estimation from a single image. In IEEE conference on computer vision and pattern recognition.
[69]
Yuille, A., & Kersten, D. (2006). Vision as bayesian inference: Analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301-308.
[70]
Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J. (2017). 3dmatch: Learning the matching of local 3d geometry in range scans. In IEEE conference on computer vision and pattern recognition.
[71]
Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., & Efros, A. A. (2016) Learning dense correspondence via 3d-guided cycle consistency. In IEEE conference on computer vision and pattern recognition.
[72]
Zhou, X., Leonardos, S., Hu, X., & Daniilidis, K. (2015) 3d shape reconstruction from 2d landmarks: A convex formulation. In IEEE conference on computer vision and pattern recognition.
[73]
Zia, M. Z., Stark, M., Schiele, B., & Schindler, K. (2013). Detailed 3d representations for object recognition and modeling. IEEE Transactions on Pattern Analysis and Machine intelligence, 35(11), 2608-2623.

Cited By

View all

Index Terms

  1. 3D Interpreter Networks for Viewer-Centered Wireframe Modeling
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image International Journal of Computer Vision
      International Journal of Computer Vision  Volume 126, Issue 9
      September 2018
      162 pages

      Publisher

      Kluwer Academic Publishers

      United States

      Publication History

      Published: 01 September 2018

      Author Tags

      1. 3D skeleton
      2. Keypoint estimation
      3. Neural network
      4. Single image 3D reconstruction
      5. Synthetic data

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 03 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Physical scene understandingAI Magazine10.1002/aaai.1214845:1(156-164)Online publication date: 9-Feb-2024
      • (2023)GeoLatent: A Geometric Approach to Latent Space Design for Deformable Shape GeneratorsACM Transactions on Graphics10.1145/361837142:6(1-20)Online publication date: 5-Dec-2023
      • (2022)VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual DataComputer Vision – ECCV 202210.1007/978-3-031-20068-7_4(55-71)Online publication date: 23-Oct-2022
      • (2021)Planning and Design of Urban Landscape Architecture under the Background of Big Data2021 2nd International Conference on Computers, Information Processing and Advanced Education10.1145/3456887.3457471(1118-1122)Online publication date: 25-May-2021
      • (2020)Structured3D: A Large Photo-Realistic Dataset for Structured 3D ModelingComputer Vision – ECCV 202010.1007/978-3-030-58545-7_30(519-535)Online publication date: 23-Aug-2020
      • (2018)Geometry-aware recurrent neural networks for active visual recognitionProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327345.3327415(5086-5096)Online publication date: 3-Dec-2018
      • (2018)Unsupervised learning of shape and pose with differentiable point cloudsProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327144.3327204(2807-2817)Online publication date: 3-Dec-2018

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media