3-D Depth Reconstruction from a Single Still Image

Saxena, Ashutosh; Chung, Sung H.; Ng, Andrew Y.

doi:10.1007/s11263-007-0071-y

3-D Depth Reconstruction from a Single Still Image

Open access
Published: 16 August 2007

Volume 76, pages 53–69, (2008)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

3-D Depth Reconstruction from a Single Still Image

Download PDF

Ashutosh Saxena¹,
Sung H. Chung¹ &
Andrew Y. Ng¹

10k Accesses
419 Citations
9 Altmetric
Explore all metrics

Abstract

We consider the task of 3-d depth estimation from a single still image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured indoor and outdoor environments which include forests, sidewalks, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the value of the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufficient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a hierarchical, multiscale Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models the depths and the relation between depths at different points in the image. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. We further propose a model that incorporates both monocular cues and stereo (triangulation) cues, to obtain significantly more accurate depth estimates than is possible using either monocular or stereo cues alone.

Article PDF

Top–Down Bayesian Inference of Indoor Scenes

Integrating Geometrical Context for Semantic Labeling of Indoor Scenes using RGBD Images

Article 03 July 2015

Modeling Pose/Appearance Relations for Improved Object Localization and Pose Estimation in 2D images

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., & Davis, J. (2005). SCAPE: shape completion and animation of people. ACM Transactions on Graphics, 24(3), 408–416.
Article Google Scholar
Barron, J. L., Fleet, D. J., & Beauchemin, S. S. (1994). Performance of optical flow techniques. International Journal of Computer Vision, 12, 43–77.
Article Google Scholar
Brown, M. Z., Burschka, D., & Hager, G. D. (2003). Advances in computational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8), 993–1008.
Article Google Scholar
Bulthoff, I., Bulthoff, H., & Sinha, P. (1998). Top-down influences on stereoscopic depth-perception. Nature Neuroscience, 1, 254–257.
Article Google Scholar
Cornelis, N., Leibe, B., Cornelis, K., & Van Gool, L. (2006). 3d city modeling using cognitive loops. In Video proceedings of CVPR (VPCVPR).
Criminisi, A., Reid, I., & Zisserman, A. (2000). Single view metrology. International Journal of Computer Vision, 40, 123–148.
Article MATH Google Scholar
Das, S., & Ahuja, N. (1995). Performance analysis of stereo, vergence, and focus as depth cues for active vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12), 1213–1219.
Article Google Scholar
Davies, E. R. (1997). Laws’ texture energy in texture. In Machine vision: theory, algorithms, practicalities (2nd ed.). San Diego: Academic Press.
Google Scholar
Delage, E., Lee, H., & Ng, A. Y. (2005). Automatic single-image 3d reconstructions of indoor Manhattan world scenes. In 12th International Symposium of Robotics Research (ISRR).
Delage, E., Lee, H., & Ng, A. Y. (2006). A dynamic Bayesian network model for autonomous 3D reconstruction from a single indoor image. In Computer vision and pattern recognition (CVPR).
Forsyth, D. A., & Ponce, J. (2003). Computer vision: a modern approach. New York: Prentice Hall.
Google Scholar
Frueh, C., & Zakhor, A. (2003). Constructing 3D city models by merging ground-based and airborne views. In Computer vision and pattern recognition (CVPR).
Gini, G., & Marchi, A. (2002). Indoor robot navigation with single camera vision. In PRIS.
Harkness, L. (1977). Chameleons use accommodation cues to judge distance. Nature, 267, 346–349.
Article Google Scholar
He, X., Zemel, R., & Perpinan, M. (2004). Multiscale conditional random fields for image labeling. In Computer vision and pattern recognition (CVPR).
Hertzmann, A., & Seitz, S. M. (2005). Example-based photometric stereo: Shape reconstruction with general, varying brdfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1254–1264.
Article Google Scholar
Hoiem, D., Efros, A. A., & Herbert, M. (2005a). Geometric context from a single image. In International conference on computer vision (ICCV).
Hoiem, D., Efros, A. A., & Herbert, M. (2005b). Automatic photo pop-up. In ACM SIGGRAPH.
Hoiem, D., Efros, A. A., & Herbert, M. (2006). Putting objects in perspective. In Computer vision and pattern recognition (CVPR).
Huang, J., Lee, A. B., & Mumford, D. (2000). Statistics of range images. In Computer vision and pattern recognition (CVPR).
Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., & Rother, C. (2006). Probabilistic fusion of stereo with color and contrast for bilayer segmentation. IEEE Pattern Analysis and Machine Intelligence, 28(9), 1480–1492.
Article Google Scholar
Konishi, S., & Yuille, A. (2000). Statistical cues for domain specific image segmentation with performance analysis. In Computer vision and pattern recognition (CVPR).
Kumar, S., & Hebert, M. (2003). Discriminative fields for modeling spatial dependencies in natural images. In Neural information processing systems (NIPS) (Vol. 16).
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In International conference on machine learning (ICML).
Lindeberg, T., & Garding, J. (1993). Shape from texture from a multi-scale perspective. In International conference on computer vision (ICCV).
Loomis, J. M. (2001). Looking down is looking up. Nature News and Views, 414, 155–156.
Article Google Scholar
Maki, A., Watanabe, M., & Wiles, C. (2002). Geotensity: combining motion and lighting for 3d surface reconstruction. International Journal of Computer Vision, 48(2), 75–90.
Article MATH Google Scholar
Malik, J., & Perona, P. (1990). Preattentive texture discrimination with early vision mechanisms. Journal of the Optical Society of America A, 7(5), 923–932.
Article Google Scholar
Malik, J., & Rosenholtz, R. (1997). Computing local surface orientation and shape from texture for curved surfaces. International Journal of Computer Vision, 23(2), 149–168.
Article Google Scholar
Michels, J., Saxena, A., & Ng, A. Y. (2005). High speed obstacle avoidance using monocular vision and reinforcement learning. In 22nd international conference on machine learning (ICML).
Moldovan, T. M., Roth, S., & Black, M. J. (2006). Denoising archival films using a learned Bayesian model. In International conference on image processing (ICIP).
Mortensen, E. N., Deng, H., & Shapiro, L. (2005). A SIFT descriptor with global context. In Computer vision and pattern recognition (CVPR).
Murphy, K., Torralba, A., & Freeman, W. T. (2003). Using the forest to see the trees: a graphical model relating features, objects, and scenes. In Neural information processing systems (NIPS) (Vol. 16).
Nagai, T., Naruse, T., Ikehara, M., & Kurematsu, A. (2002). Hmm-based surface reconstruction from single images. In IEEE international conference on image processing (ICIP).
Narasimhan, S. G., & Nayar, S. K. (2003). Shedding light on the weather. In Computer vision and pattern recognition (CVPR)
Nestares, O., Navarro, R., Portilia, J., & Tabernero, A. (1998). Efficient spatial-domain implementation of a multiscale image representation based on Gabor functions. Journal of Electronic Imaging, 7(1), 166–173.
Article Google Scholar
Oliva, A., & Torralba, A. (2006). Building the gist of a scene: the role of global image features in recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 155, 23–36.
Google Scholar
Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an over-complete basis set: a strategy employed by v1? Vision Research, 37, 3311–3325.
Article Google Scholar
Porrill, J., Frisby, J. P., Adams, W. J., & Buckley, D. (1999). Robust and optimal use of information in stereo vision. Nature, 397, 63–66.
Article Google Scholar
Quartulli, M., & Datcu, M. (2001). Bayesian model based city reconstruction from high resolution ISAR data. In IEEE/ISPRS joint workshop remote sensing and data fusion over urban areas.
Saxena, A., Anand, A., & Mukerjee, A. (2004). Robust facial expression recognition using spatially localized geometric model. In International conf systemics, cybernetics and informatics (ICSCI).
Saxena, A., Chung, S. H., & Ng, A. Y. (2005). Learning depth from single monocular images. In Neural information processing system (NIPS) (Vol. 18).
Saxena, A., Driemeyer, J., Kearns, J., Osondu, C., & Ng, A. Y. (2006a). Learning to grasp novel objects using vision. In 10th international symposium on experimental robotics (ISER).
Saxena, A., Sun, M., Agarwal, R., & Ng, A. Y. (2006b). Learning 3-d scene structure from a single still image. Stanford Technical Report, November 2006.
Saxena, A., Driemeyer, J., Kearns, J., & Ng, A. Y. (2006c). Robotic grasping of novel objects. In Neural information processing systems (NIPS) (Vol. 19).
Saxena, A., Schulte, J., & Ng, A. Y. (2007). Depth estimation using monocular and stereo cues. In International joint conference on artificial intelligence (IJCAI).
Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1), 7–42.
Article MATH Google Scholar
Scharstein, D., & Szeliski, R. (2003) High-accuracy stereo depth maps using structured light. In Computer vision and pattern recognition (CVPR).
Schwartz, S. H. (1999). Visual perception (2nd ed.). Connecticut: Appleton and Lange.
Google Scholar
Serre, T., Wolf, L., & Poggio, T. (2005). Object recognition with features inspired by visual cortex. In Computer vision and pattern recognition (CVPR).
Strang, G., & Nguyen, T. (1997). Wavelets and filter banks. Wellesley: Wellesley-Cambridge Press.
Google Scholar
Sudderth, E. B., Torralba, A., Freeman, W. T., & Willisky, A. S. (2006). Depth from familiar objects: A hierarchical model for 3D scenes. In Computer vision and pattern recognition (CVPR)
Szeliski, R. (1990). Bayesian modeling of uncertainty in low-level vision. In International conference on computer vision (ICCV).
Thrun, S., & Wegbreit, B. (2005). Shape from symmetry. In International conference on computer vision (ICCV).
Torralba, A., & Oliva, A. (2002). Depth estimation from image structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1–13.
Article Google Scholar
Torresani, L., & Hertzmann, A. (2004). Automatic non-rigid 3D modeling from video. In European conference on computer vision.
Wandell, B. A. (1995). Foundations of vision. Sunderland: Sinauer Associates.
Google Scholar
Welchman, A. E., Deubelius, A., Conrad, V., Bülthoff, H. H., & Kourtzi, Z. (2005). 3D shape perception from combined depth cues in human visual cortex. Nature Neuroscience, 8, 820–827.
Article Google Scholar
Wexler, M., Panerai, F., Lamouret, I., & Droulez, J. (2001). Self-motion and the perception of stationary objects. Nature, 409, 85–88.
Article Google Scholar
Willsky, A. S. (2002). Multiresolution Markov models for signal and image processing. Proceedings IEEE, 90(8), 1396–1458.
Article Google Scholar
Wu, B., Ooi, T. L., & He, Z. J. (2004). Perceiving distance accurately by a directional process of integrating ground information. Letters to Nature, 428, 73–77.
Article Google Scholar
Zhang, R., Tsai, P.-S., Cryer, J. E., & Shah, M. (1999). Shape from shading: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 690–706.
Article Google Scholar
Zhao, W., Chellappa, R., Phillips, P. J., & Rosenfield, A. (2003). Face recognition: a literature survey. ACM Computing Surveys, 35, 399–458.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Stanford University, Stanford, CA, 94305, USA
Ashutosh Saxena, Sung H. Chung & Andrew Y. Ng

Authors

Ashutosh Saxena
View author publications
You can also search for this author in PubMed Google Scholar
Sung H. Chung
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Y. Ng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashutosh Saxena.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Saxena, A., Chung, S.H. & Ng, A.Y. 3-D Depth Reconstruction from a Single Still Image. Int J Comput Vis 76, 53–69 (2008). https://doi.org/10.1007/s11263-007-0071-y

Download citation

Received: 01 November 2006
Accepted: 06 June 2007
Published: 16 August 2007
Issue Date: January 2008
DOI: https://doi.org/10.1007/s11263-007-0071-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

3-D Depth Reconstruction from a Single Still Image

Abstract

Article PDF

Similar content being viewed by others

Top–Down Bayesian Inference of Indoor Scenes

Integrating Geometrical Context for Semantic Labeling of Indoor Scenes using RGBD Images

Modeling Pose/Appearance Relations for Improved Object Localization and Pose Estimation in 2D images

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

3-D Depth Reconstruction from a Single Still Image

Abstract

Article PDF

Similar content being viewed by others

Top–Down Bayesian Inference of Indoor Scenes

Integrating Geometrical Context for Semantic Labeling of Indoor Scenes using RGBD Images

Modeling Pose/Appearance Relations for Improved Object Localization and Pose Estimation in 2D images

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation