Abstract
We consider the task of 3-d depth estimation from a single still image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured indoor and outdoor environments which include forests, sidewalks, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the value of the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufficient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a hierarchical, multiscale Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models the depths and the relation between depths at different points in the image. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. We further propose a model that incorporates both monocular cues and stereo (triangulation) cues, to obtain significantly more accurate depth estimates than is possible using either monocular or stereo cues alone.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., & Davis, J. (2005). SCAPE: shape completion and animation of people. ACM Transactions on Graphics, 24(3), 408–416.
Barron, J. L., Fleet, D. J., & Beauchemin, S. S. (1994). Performance of optical flow techniques. International Journal of Computer Vision, 12, 43–77.
Brown, M. Z., Burschka, D., & Hager, G. D. (2003). Advances in computational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8), 993–1008.
Bulthoff, I., Bulthoff, H., & Sinha, P. (1998). Top-down influences on stereoscopic depth-perception. Nature Neuroscience, 1, 254–257.
Cornelis, N., Leibe, B., Cornelis, K., & Van Gool, L. (2006). 3d city modeling using cognitive loops. In Video proceedings of CVPR (VPCVPR).
Criminisi, A., Reid, I., & Zisserman, A. (2000). Single view metrology. International Journal of Computer Vision, 40, 123–148.
Das, S., & Ahuja, N. (1995). Performance analysis of stereo, vergence, and focus as depth cues for active vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12), 1213–1219.
Davies, E. R. (1997). Laws’ texture energy in texture. In Machine vision: theory, algorithms, practicalities (2nd ed.). San Diego: Academic Press.
Delage, E., Lee, H., & Ng, A. Y. (2005). Automatic single-image 3d reconstructions of indoor Manhattan world scenes. In 12th International Symposium of Robotics Research (ISRR).
Delage, E., Lee, H., & Ng, A. Y. (2006). A dynamic Bayesian network model for autonomous 3D reconstruction from a single indoor image. In Computer vision and pattern recognition (CVPR).
Forsyth, D. A., & Ponce, J. (2003). Computer vision: a modern approach. New York: Prentice Hall.
Frueh, C., & Zakhor, A. (2003). Constructing 3D city models by merging ground-based and airborne views. In Computer vision and pattern recognition (CVPR).
Gini, G., & Marchi, A. (2002). Indoor robot navigation with single camera vision. In PRIS.
Harkness, L. (1977). Chameleons use accommodation cues to judge distance. Nature, 267, 346–349.
He, X., Zemel, R., & Perpinan, M. (2004). Multiscale conditional random fields for image labeling. In Computer vision and pattern recognition (CVPR).
Hertzmann, A., & Seitz, S. M. (2005). Example-based photometric stereo: Shape reconstruction with general, varying brdfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1254–1264.
Hoiem, D., Efros, A. A., & Herbert, M. (2005a). Geometric context from a single image. In International conference on computer vision (ICCV).
Hoiem, D., Efros, A. A., & Herbert, M. (2005b). Automatic photo pop-up. In ACM SIGGRAPH.
Hoiem, D., Efros, A. A., & Herbert, M. (2006). Putting objects in perspective. In Computer vision and pattern recognition (CVPR).
Huang, J., Lee, A. B., & Mumford, D. (2000). Statistics of range images. In Computer vision and pattern recognition (CVPR).
Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., & Rother, C. (2006). Probabilistic fusion of stereo with color and contrast for bilayer segmentation. IEEE Pattern Analysis and Machine Intelligence, 28(9), 1480–1492.
Konishi, S., & Yuille, A. (2000). Statistical cues for domain specific image segmentation with performance analysis. In Computer vision and pattern recognition (CVPR).
Kumar, S., & Hebert, M. (2003). Discriminative fields for modeling spatial dependencies in natural images. In Neural information processing systems (NIPS) (Vol. 16).
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In International conference on machine learning (ICML).
Lindeberg, T., & Garding, J. (1993). Shape from texture from a multi-scale perspective. In International conference on computer vision (ICCV).
Loomis, J. M. (2001). Looking down is looking up. Nature News and Views, 414, 155–156.
Maki, A., Watanabe, M., & Wiles, C. (2002). Geotensity: combining motion and lighting for 3d surface reconstruction. International Journal of Computer Vision, 48(2), 75–90.
Malik, J., & Perona, P. (1990). Preattentive texture discrimination with early vision mechanisms. Journal of the Optical Society of America A, 7(5), 923–932.
Malik, J., & Rosenholtz, R. (1997). Computing local surface orientation and shape from texture for curved surfaces. International Journal of Computer Vision, 23(2), 149–168.
Michels, J., Saxena, A., & Ng, A. Y. (2005). High speed obstacle avoidance using monocular vision and reinforcement learning. In 22nd international conference on machine learning (ICML).
Moldovan, T. M., Roth, S., & Black, M. J. (2006). Denoising archival films using a learned Bayesian model. In International conference on image processing (ICIP).
Mortensen, E. N., Deng, H., & Shapiro, L. (2005). A SIFT descriptor with global context. In Computer vision and pattern recognition (CVPR).
Murphy, K., Torralba, A., & Freeman, W. T. (2003). Using the forest to see the trees: a graphical model relating features, objects, and scenes. In Neural information processing systems (NIPS) (Vol. 16).
Nagai, T., Naruse, T., Ikehara, M., & Kurematsu, A. (2002). Hmm-based surface reconstruction from single images. In IEEE international conference on image processing (ICIP).
Narasimhan, S. G., & Nayar, S. K. (2003). Shedding light on the weather. In Computer vision and pattern recognition (CVPR)
Nestares, O., Navarro, R., Portilia, J., & Tabernero, A. (1998). Efficient spatial-domain implementation of a multiscale image representation based on Gabor functions. Journal of Electronic Imaging, 7(1), 166–173.
Oliva, A., & Torralba, A. (2006). Building the gist of a scene: the role of global image features in recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 155, 23–36.
Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an over-complete basis set: a strategy employed by v1? Vision Research, 37, 3311–3325.
Porrill, J., Frisby, J. P., Adams, W. J., & Buckley, D. (1999). Robust and optimal use of information in stereo vision. Nature, 397, 63–66.
Quartulli, M., & Datcu, M. (2001). Bayesian model based city reconstruction from high resolution ISAR data. In IEEE/ISPRS joint workshop remote sensing and data fusion over urban areas.
Saxena, A., Anand, A., & Mukerjee, A. (2004). Robust facial expression recognition using spatially localized geometric model. In International conf systemics, cybernetics and informatics (ICSCI).
Saxena, A., Chung, S. H., & Ng, A. Y. (2005). Learning depth from single monocular images. In Neural information processing system (NIPS) (Vol. 18).
Saxena, A., Driemeyer, J., Kearns, J., Osondu, C., & Ng, A. Y. (2006a). Learning to grasp novel objects using vision. In 10th international symposium on experimental robotics (ISER).
Saxena, A., Sun, M., Agarwal, R., & Ng, A. Y. (2006b). Learning 3-d scene structure from a single still image. Stanford Technical Report, November 2006.
Saxena, A., Driemeyer, J., Kearns, J., & Ng, A. Y. (2006c). Robotic grasping of novel objects. In Neural information processing systems (NIPS) (Vol. 19).
Saxena, A., Schulte, J., & Ng, A. Y. (2007). Depth estimation using monocular and stereo cues. In International joint conference on artificial intelligence (IJCAI).
Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1), 7–42.
Scharstein, D., & Szeliski, R. (2003) High-accuracy stereo depth maps using structured light. In Computer vision and pattern recognition (CVPR).
Schwartz, S. H. (1999). Visual perception (2nd ed.). Connecticut: Appleton and Lange.
Serre, T., Wolf, L., & Poggio, T. (2005). Object recognition with features inspired by visual cortex. In Computer vision and pattern recognition (CVPR).
Strang, G., & Nguyen, T. (1997). Wavelets and filter banks. Wellesley: Wellesley-Cambridge Press.
Sudderth, E. B., Torralba, A., Freeman, W. T., & Willisky, A. S. (2006). Depth from familiar objects: A hierarchical model for 3D scenes. In Computer vision and pattern recognition (CVPR)
Szeliski, R. (1990). Bayesian modeling of uncertainty in low-level vision. In International conference on computer vision (ICCV).
Thrun, S., & Wegbreit, B. (2005). Shape from symmetry. In International conference on computer vision (ICCV).
Torralba, A., & Oliva, A. (2002). Depth estimation from image structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1–13.
Torresani, L., & Hertzmann, A. (2004). Automatic non-rigid 3D modeling from video. In European conference on computer vision.
Wandell, B. A. (1995). Foundations of vision. Sunderland: Sinauer Associates.
Welchman, A. E., Deubelius, A., Conrad, V., Bülthoff, H. H., & Kourtzi, Z. (2005). 3D shape perception from combined depth cues in human visual cortex. Nature Neuroscience, 8, 820–827.
Wexler, M., Panerai, F., Lamouret, I., & Droulez, J. (2001). Self-motion and the perception of stationary objects. Nature, 409, 85–88.
Willsky, A. S. (2002). Multiresolution Markov models for signal and image processing. Proceedings IEEE, 90(8), 1396–1458.
Wu, B., Ooi, T. L., & He, Z. J. (2004). Perceiving distance accurately by a directional process of integrating ground information. Letters to Nature, 428, 73–77.
Zhang, R., Tsai, P.-S., Cryer, J. E., & Shah, M. (1999). Shape from shading: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 690–706.
Zhao, W., Chellappa, R., Phillips, P. J., & Rosenfield, A. (2003). Face recognition: a literature survey. ACM Computing Surveys, 35, 399–458.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Saxena, A., Chung, S.H. & Ng, A.Y. 3-D Depth Reconstruction from a Single Still Image. Int J Comput Vis 76, 53–69 (2008). https://doi.org/10.1007/s11263-007-0071-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-007-0071-y