Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

3-D Depth Reconstruction from a Single Still Image

Published: 01 January 2008 Publication History
  • Get Citation Alerts
  • Abstract

    We consider the task of 3-d depth estimation from a single still image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured indoor and outdoor environments which include forests, sidewalks, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the value of the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufficient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a hierarchical, multiscale Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models the depths and the relation between depths at different points in the image. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. We further propose a model that incorporates both monocular cues and stereo (triangulation) cues, to obtain significantly more accurate depth estimates than is possible using either monocular or stereo cues alone.

    References

    [1]
    Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., & Davis, J. (2005). SCAPE: shape completion and animation of people. ACM Transactions on Graphics, 24(3), 408-416.
    [2]
    Barron, J. L., Fleet, D. J., & Beauchemin, S. S. (1994). Performance of optical flow techniques. International Journal of Computer Vision, 12, 43-77.
    [3]
    Brown, M. Z., Burschka, D., & Hager, G. D. (2003). Advances in computational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8), 993-1008.
    [4]
    Bulthoff, I., Bulthoff, H., & Sinha, R (1998). Top-down influences on stereoscopic depth-perception. Nature Neuroscience, 1, 254-257.
    [5]
    Cornelis, N., Leibe, B., Cornelis, K., & Van Gool, L. (2006). 3d city modeling using cognitive loops. In Video proceedings of CVPR (VPCVPR).
    [6]
    Criminisi, A., Reid, I., & Zisserman, A. (2000). Single view metrology. International Journal of Computer Vision, 40, 123-148.
    [7]
    Das, S., & Ahuja, N. (1995). Performance analysis of stereo, vergence, and focus as depth cues for active vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12), 1213-1219.
    [8]
    Davies, E. R. (1997). Laws' texture energy in TEXTURE. In Machine vision: theory, algorithms, practicalities (2nd ed.). San Diego: Academic Press.
    [9]
    Delage, E., Lee, H., & Ng, A. Y. (2005). Automatic single-image 3d reconstructions of indoor Manhattan world scenes. In 12th International Symposium of Robotics Research (ISRR).
    [10]
    Delage, E., Lee, H., & Ng, A. Y. (2006). A dynamic Bayesian network model for autonomous 3D reconstruction from a single indoor image. In Computer vision and pattern recognition (CVPR).
    [11]
    Forsyth, D. A., & Ponce, J. (2003). Computer vision: a modern approach . New York: Prentice Hall.
    [12]
    Frueh, C., & Zakhor, A. (2003). Constructing 3D city models by merging ground-based and airborne views. In Computer vision and pattern recognition (CVPR).
    [13]
    Gini, G., & Marchi, A. (2002). Indoor robot navigation with single camera vision. In PRIS.
    [14]
    Harkness, L. (1977). Chameleons use accommodation cues to judge distance. Nature, 267, 346-349.
    [15]
    He, X., Zemel, R., & Perpinan, M. (2004). Multiscale conditional random fields for image labeling. In Computer vision and pattern recognition (CVPR).
    [16]
    Hertzmann, A., & Seitz, S. M. (2005). Example-based photometric stereo: Shape reconstruction with general, varying brdfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1254-1264.
    [17]
    Hoiem, D., Efros, A. A., & Herbert, M. (2005a). Geometric context from a single image. In International conference on computer vision (ICCV).
    [18]
    Hoiem, D., Efros, A. A., & Herbert, M. (2005b). Automatic photo pop-up. In ACM SIGGRAPH.
    [19]
    Hoiem, D., Efros, A. A., & Herbert, M. (2006). Putting objects in perspective. In Computer vision and pattern recognition (CVPR).
    [20]
    Huang, J., Lee, A. B., & Mumford, D. (2000). Statistics of range images. In Computer vision and pattern recognition (CVPR).
    [21]
    Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., & Rother, C. (2006). Probabilistic fusion of stereo with color and contrast for bilayer segmentation. IEEE Pattern Analysis and Machine Intelligence, 28(9), 1480-1492.
    [22]
    Konishi, S., & Yuille, A. (2000). Statistical cues for domain specific image segmentation with performance analysis. In Computer vision and pattern recognition (CVPR).
    [23]
    Kumar, S., & Hebert, M. (2003). Discriminative fields for modeling spatial dependencies in natural images. In Neural information processing systems (NIPS) (Vol. 16).
    [24]
    Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In International conference on machine learning (ICML).
    [25]
    Lindeberg, T., & Garding, J. (1993). Shape from texture from a multi-scale perspective. In International conference on computer vision (ICCV).
    [26]
    Loomis, J. M. (2001). Looking down is looking up. Nature News and Views, 414, 155-156.
    [27]
    Maki, A., Watanabe, M., & Wiles, C. (2002). Geotensity: combining motion and lighting for 3d surface reconstruction. International Journal of Computer Vision, 48(2), 75-90.
    [28]
    Malik, J., & Perona, P. (1990). Preattentive texture discrimination with early vision mechanisms. Journal of the Optical Society of America A, 7(5), 923-932.
    [29]
    Malik, J., & Rosenholtz, R. (1997). Computing local surface orientation and shape from texture for curved surfaces. International Journal of Computer Vision, 23(2), 149-168.
    [30]
    Michels, J., Saxena, A., & Ng, A. Y. (2005). High speed obstacle avoidance using monocular vision and reinforcement learning. In 22nd international conference on machine learning (ICML).
    [31]
    Moldovan, T. M., Roth, S., & Black, M. J. (2006). Denoising archival films using a learned Bayesian model. In International conference on image processing (ICIP).
    [32]
    Mortensen, E. N., Deng, H., & Shapiro, L. (2005). A SIFT descriptor with global context. In Computer vision and pattern recognition (CVPR).
    [33]
    Murphy, K., Torralba, A., & Freeman, W. T. (2003). Using the forest to see the trees: a graphical model relating features, objects, and scenes. In Neural information processing systems (NIPS) (Vol. 16).
    [34]
    Nagai, T., Naruse, T., Ikehara, M., & Kurematsu, A. (2002). Hmm-based surface reconstruction from single images. In IEEE international conference on image processing (ICIP).
    [35]
    Narasimhan, S. G., & Nayar, S. K. (2003). Shedding light on the weather. In Computer vision and pattern recognition (CVPR).
    [36]
    Nestares, O., Navarro, R., Portilia, J., & Tabernero, A. (1998). Efficient spatial-domain implementation of a multiscale image representation based on Gabor functions. Journal of Electronic Imaging, 7(1), 166-173.
    [37]
    Oliva, A., & Torralba, A. (2006). Building the gist of a scene: the role of global image features in recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 155, 23-36.
    [38]
    Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: a strategy employed by v1 ? Vision Research, 37, 3311-3325.
    [39]
    Porrill, J., Frisby, J. P., Adams, W. J., & Buckley, D. (1999). Robust and optimal use of information in stereo vision. Nature, 397, 63-66.
    [40]
    Quartulli, M., & Datcu, M. (2001). Bayesian model based city reconstruction from high resolution ISAR data. In IEEE/ISPRS joint workshop remote sensing and data fusion over urban areas.
    [41]
    Saxena, A., Anand, A., & Mukerjee, A. (2004). Robust facial expression recognition using spatially localized geometric model. In International conf systemics, cybernetics and informatics (ICSCI).
    [42]
    Saxena, A., Chung, S. H., & Ng, A. Y. (2005). Learning depth from single monocular images. In Neural information processing system (NIPS) (Vol. 18).
    [43]
    Saxena, A., Driemeyer, J., Kearns, J., Osondu, C., & Ng, A. Y. (2006a). Learning to grasp novel objects using vision. In 10th international symposium on experimental robotics (ISER).
    [44]
    Saxena, A., Sun, M., Agarwal, R., & Ng, A. Y. (2006b). Learning 3-d scene structure from a single still image. Stanford Technical Report, November 2006.
    [45]
    Saxena, A., Driemeyer, J., Kearns, J., & Ng, A. Y. (2006c). Robotic grasping of novel objects. In Neural information processing systems (NIPS) (Vol. 19).
    [46]
    Saxena, A., Schulte, J., & Ng, A. Y. (2007). Depth estimation using monocular and stereo cues. In International joint conference on artificial intelligence (IJCAI).
    [47]
    Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1), 7-42.
    [48]
    Scharstein, D., & Szeliski, R. (2003) High-accuracy stereo depth maps using structured light. In Computer vision and pattern recognition (CVPR).
    [49]
    Schwartz, S. H. (1999). Visual perception (2nd ed.). Connecticut: Appleton and Lange.
    [50]
    Serre, T., Wolf, L., & Poggio, T. (2005). Object recognition with features inspired by visual cortex. In Computer vision and pattern recognition (CVPR).
    [51]
    Strang, G., & Nguyen, T. (1997). Wavelets and filter banks. Wellesley: Wellesley-Cambridge Press.
    [52]
    Sudderth, E. B., Torralba, A., Freeman, W. T., & Willisky, A. S. (2006). Depth from familiar objects: A hierarchical model for 3D scenes. In Computer vision and pattern recognition (CVPR).
    [53]
    Szeliski, R. (1990). Bayesian modeling of uncertainty in low-level vision. In International conference on computer vision (ICCV).
    [54]
    Thrun, S., & Wegbreit, B. (2005). Shape from symmetry. In International conference on computer vision (ICCV).
    [55]
    Torralba, A., & Oliva, A. (2002). Depth estimation from image structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1-13.
    [56]
    Torresani, L., & Hertzmann, A. (2004). Automatic non-rigid 3D modeling from video. In European conference on computer vision.
    [57]
    Wandell, B. A. (1995). Foundations of vision. Sunderland: Sinauer Associates.
    [58]
    Welchman, A. E., Deubelius, A., Conrad, V., Bülthoff, H. H., & Kourtzi. Z. (2005). 3D shape perception from combined depth cues in human visual cortex. Nature Neuroscience, 8, 820-827.
    [59]
    Wexler, M., Panerai, F., Lamouret, I., & Droulez, J. (2001). Self-motion and the perception of stationary objects. Nature, 409, 85-88.
    [60]
    Willsky, A. S. (2002). Multiresolution Markov models for signal and image processing. Proceedings IEEE, 90(8), 1396-1458.
    [61]
    Wu, B., Ooi, T. L., & He, Z. J. (2004). Perceiving distance accurately by a directional process of integrating ground information. Letters to Nature, 428, 73-77.
    [62]
    Zhang, R., Tsai, P.-S., Cryer, J. E., & Shah, M. (1999). Shape from shading: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 690-706.
    [63]
    Zhao, W., Chellappa, R., Phillips, P. J., & Rosenfield, A. (2003). Face recognition: a literature survey. ACM Computing Surveys, 35, 399-458.

    Cited By

    View all
    • (2024)Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357020:8(1-19)Online publication date: 13-Jun-2024
    • (2024)Monocular Depth Estimation: A Thorough ReviewIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333094446:4(2396-2414)Online publication date: 1-Apr-2024
    • (2024)Large-scale Monocular Depth Estimation in the WildEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107189127:PAOnline publication date: 1-Feb-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image International Journal of Computer Vision
    International Journal of Computer Vision  Volume 76, Issue 1
    January 2008
    101 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 January 2008

    Author Tags

    1. 3D reconstruction
    2. Dense reconstruction
    3. Depth estimation
    4. Hand-held camera
    5. Learning depth
    6. Markov random field
    7. Monocular depth
    8. Monocular vision
    9. Stereo vision
    10. Visual modeling

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357020:8(1-19)Online publication date: 13-Jun-2024
    • (2024)Monocular Depth Estimation: A Thorough ReviewIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333094446:4(2396-2414)Online publication date: 1-Apr-2024
    • (2024)Large-scale Monocular Depth Estimation in the WildEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107189127:PAOnline publication date: 1-Feb-2024
    • (2024)Indoor Obstacle Discovery on Reflective Ground via Monocular CameraInternational Journal of Computer Vision10.1007/s11263-023-01925-4132:3(987-1007)Online publication date: 1-Mar-2024
    • (2023)Does it work outside this benchmark? Introducing the rigid depth constructor toolMultimedia Tools and Applications10.1007/s11042-023-14743-082:27(41641-41667)Online publication date: 4-Apr-2023
    • (2023)A semantic-aware monocular projection model for accurate pose measurementPattern Analysis & Applications10.1007/s10044-023-01197-126:4(1703-1714)Online publication date: 1-Nov-2023
    • (2022)MonoSDFProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602084(25018-25032)Online publication date: 28-Nov-2022
    • (2022)Image-Based OA-Style Paper Pop-Up Design via Mixed-Integer ProgrammingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.318956929:10(4269-4283)Online publication date: 8-Jul-2022
    • (2022)Unsupervised Monocular Depth Estimation Using Attention and Multi-Warp ReconstructionIEEE Transactions on Multimedia10.1109/TMM.2021.309130824(2938-2949)Online publication date: 1-Jan-2022
    • (2022)Near-Field Perception for Low-Speed Vehicle Automation Using Surround-View Fisheye CamerasIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2021.312764623:9(13976-13993)Online publication date: 1-Sep-2022
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media