Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

3D attention-driven depth acquisition for object identification

Published: 05 December 2016 Publication History

Abstract

We address the problem of autonomously exploring unknown objects in a scene by consecutive depth acquisitions. The goal is to reconstruct the scene while online identifying the objects from among a large collection of 3D shapes. Fine-grained shape identification demands a meticulous series of observations attending to varying views and parts of the object of interest. Inspired by the recent success of attention-based models for 2D recognition, we develop a 3D Attention Model that selects the best views to scan from, as well as the most informative regions in each view to focus on, to achieve efficient object recognition. The region-level attention leads to focus-driven features which are quite robust against object occlusion. The attention model, trained with the 3D shape collection, encodes the temporal dependencies among consecutive views with deep recurrent networks. This facilitates order-aware view planning accounting for robot movement cost. In achieving instance identification, the shape collection is organized into a hierarchy, associated with pre-trained hierarchical classifiers. The effectiveness of our method is demonstrated on an autonomous robot (PR) that explores a scene and identifies the objects to construct a 3D scene model.

References

[1]
Atanasov, N., Sankaran, B., Ny, J. L., Pappas, G. J., and Daniilidis, K. 2014. Nonmyopic view planning for active object classification and pose estimation. IEEE Trans. on Robotics 30, 5, 1078--1090.
[2]
Ba, J., Mnih, V., and Kavukcuoglu, K. 2014. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.
[3]
Bansal, A., Shrivastava, A., Doersch, C., and Gupta, A. 2015. Mid-level elements for object detection. arXiv preprint arXiv:1504.07284.
[4]
Bart, E., Porteous, I., Perona, P., and Welling, M. 2008. Unsupervised learning of visual taxonomies. In Proc. CVPR, IEEE, 1--8.
[5]
Chen, K., Lai, Y.-K., Wu, Y.-X., Martin, R., and Hu, S.-M. 2014. Automatic semantic modeling of indoor scenes from low-quality rgb-d data using contextual information. ACM Trans. on Graph. (SIGGRAPH Asia) 33, 6, 208:1--208:15.
[6]
Choi, S., Zhou, Q.-Y., and Koltun, V. 2015. Robust reconstruction of indoor scenes. In Proc. CVPR, 5556--5565.
[7]
Choi, S., Zhou, Q.-Y., Miller, S., and Koltun, V. 2016. A large dataset of object scans. arXiv:1602.02481.
[8]
Corbetta, M., and Shulman, G. L. 2002. Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience 3, 201--215.
[9]
Doersch, C., Gupta, A., and Efros, A. A. 2015. Unsupervised visual representation learning by context prediction. In Proc. ICCV, 1422--1430.
[10]
Fisher, M., Ritchie, D., Savva, M., Funkhouser, T., and Hanrahan, P. 2012. Example-based synthesis of 3D object arrangements. ACM Trans. on Graph. (SIGGRAPH Asia) 31, 6, 135:1--135:11.
[11]
Gao, T., and Koller, D. 2011. Discriminative learning of relaxed hierarchy for large-scale visual recognition. In Proc. ICCV, 2072--2079.
[12]
Gupta, S., Arbeláez, P., Girshick, R., and Malik, J. 2015. Aligning 3d models to RGB-D images of cluttered scenes. In Proc. CVPR, 4731--4740.
[13]
Haque, A., Alahi, A., and Fei-Fei, L. 2016. Recurrent attention models for depth-based person identification. In Proc. CVPR.
[14]
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9, 8, 1735--1780.
[15]
Huang, Q.-X., Su, H., and Guibas, L. 2013. Fine-grained semi-supervised labeling of large shape collections. ACM Trans. on Graph. 32, 6, 190:1--190:10.
[16]
Huang, H., Lischinski, D., Hao, Z., Gong, M., Christie, M., and Cohen-Or, D. 2016. Trip synopsis: 60km in 60sec. Computer Graphics Forum (Pacific Graphics), to appear.
[17]
Hueting, M., Ovsjanikov, M., and Mitra, N. J. 2015. CrossLink: Joint understanding of image and 3d model collections through shape and camera pose variations. ACM Trans. on Graph. 34, 6, 233.
[18]
Kleiman, Y., van Kaick, O., Sorkine-Hornung, O., and Cohen-Or, D. 2015. SHED: hape edit distance for fine-grained shape similarity. ACM Trans. on Graph. 34, 6, 235:1--235:14.
[19]
Krause, J., Jin, H., Yang, J., and Fei-Fei, L. 2015. Fine-grained recognition without part annotations. In Proc. CVPR, 5546--5555.
[20]
Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. In Proc. NIPS, 1097--1105.
[21]
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11, 2278--2324.
[22]
Li, L.-J., Wang, C., Lim, Y., Blei, D. M., and Fei-Fei, L. 2010. Building and using a semantivisual image hierarchy. In Proc. CVPR, IEEE, 3336--3343.
[23]
Li, Y., Dai, A., Guibas, L., and Niessner, M. 2015. Database-assisted object retrieval for real-time 3D reconstruction. Computer Graphics Forum (Eurographics) 34, 2.
[24]
Li, Y., Su, H., Qi, C. R., Fish, N., Cohen-Or, D., and Guibas, L. J. 2015. Joint embeddings of shapes and images via CNN image purification. ACM Trans. on Graph. 34, 6, 234.
[25]
Mnih, V., Heess, N., Graves, A., et al. 2014. Recurrent models of visual attention. In Proc. NIPS, 2204--2212.
[26]
Newcombe, R. A., Davison, A. J., Izadi, S., Kohli, P., Hilliges, O., Shotton, J., Molyneaux, D., Hodges, S., Kim, D., and Fitzgibbon, A. 2011. KinectFusion: Real-time dense surface mapping and tracking. In Proc. IEEE Int. Symp. on Mixed and Augmented Reality, 127--136.
[27]
Niessner, M., Zollhöfer, M., Izadi, S., and Stamminger, M. 2013. Real-time 3D reconstruction at scale using voxel hashing. ACM Trans. on Graph. (SIGGRAPH Asia) 32, 6, 169:1--169:11.
[28]
Nister, D., and Stewenius, H. 2006. Scalable recognition with a vocabulary tree. In Proc. CVPR, 2161--2168.
[29]
ROS, 2014. ROS Wiki. http://wiki.ros.org/.
[30]
Salas-Moreno, R. F., Newcombe, R. A., Strasdat, H., Kelly, P. H. J., and Davison, A. J. 2012. SLAM++: Simultaneous localisation and mapping at the level of objects. In CVPR, 1352--1359.
[31]
Shi, Y., Long, P., Xu, K., Huang, H., and Xiong, Y. 2016. Data-driven contextual modeling for 3d scene understanding. Computers and Graphics 55, 55--67.
[32]
Song, S., and Xiao, J. 2016. Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proc. CVPR.
[33]
Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E. 2015. Multi-view convolutional neural networks for 3D shape recognition. In Proc. ICCV.
[34]
Su, H., Qi, C. R., Li, Y., and Guibas, L. 2015. Render for CNN: Viewpoint estimation in images using cnns trained with rendered 3d model views. In Proc. ICCV.
[35]
Su, H., Savva, M., Yi, L., Chang, A. X., Song, S., Yu, F., Li, Z., Xiao, J., Huang, Q., Savarese, S., Funkhouser, T., Hanrahan, P., and Guibas, L. J. 2015. ShapeNet: An information-rich 3d model repository. http://www.shapenet.org/.
[36]
Uijlings, J. R., van de Sande, K. E., Gevers, T., and Smeulders, A. W. 2013. Selective search for object recognition. Int. J. Computer Vision. 104, 2, 154--171.
[37]
Valentin, J., Vineet, V., Cheng, M.-M., Kim, D., Shotton, J., Kohli, P., Niessner, M., Criminisi, A., Izadi, S., and Torr, P. 2015. SemanticPaint: Interactive 3D labeling and learning at your finger tips. ACM Trans. on Graph. 34, 5.
[38]
Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3--4, 229--256.
[39]
Wu, S., Sun, W., Long, P., Huang, H., Cohen-Or, D., Gong, M., Deussen, O., and Chen, B. 2014. Quality-driven poisson-guided autoscanning. ACM Trans. on Graph. (SIGGRAPH Asia) 33, 6, 203:1--203:12.
[40]
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., and Xiao, J. 2015. 3D ShapeNets: A deep representation for volumetric shapes. In Proc. CVPR, 1912--1920.
[41]
Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., and Zhang, Z. 2015. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proc. CVPR, 842--850.
[42]
Xu, K., Chen, K., Fu, H., Sun, W.-L., and Hu, S.-M. 2013. Sketch2Scene: Sketch-based co-retrieval and co-placement of 3D models. ACM Trans. on Graph. (SIGGRAPH) 32, 4, 123:1--123:10.
[43]
Xu, K., Huang, H., Shi, Y., Li, H., Long, P., Caichen, J., Sun, W., and Chen, B. 2015. Autoscanning for coupled scene reconstruction and proactive object analysis. ACM Trans. on Graph. 34, 6, 177:1--177:14.
[44]
Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044.
[45]
Zelnik-Manor, L., and Perona, P. 2004. Self-tuning spectral clustering. In Proc. NIPS, 1601--1608.
[46]
Zhang, Y., Xu, W., Tong, Y., and Zhou, K. 2014. Online structure analysis for real-time indoor scene reconstruction. ACM Trans. on Graph. 34, 5, 159:1--159:12.

Cited By

View all
  • (2024)GAMMA: Graspability-Aware Mobile MAnipulation Policy Learning based on Online Grasping Pose Fusion2024 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA57147.2024.10610125(1399-1405)Online publication date: 13-May-2024
  • (2023)Geometric Primitive-Guided UAV Path Planning for High-Quality Image-Based ReconstructionRemote Sensing10.3390/rs1510263215:10(2632)Online publication date: 18-May-2023
  • (2023)Online Scene CAD Recomposition via Autonomous ScanningACM Transactions on Graphics10.1145/361833942:6(1-16)Online publication date: 5-Dec-2023
  • Show More Cited By

Index Terms

  1. 3D attention-driven depth acquisition for object identification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Graphics
    ACM Transactions on Graphics  Volume 35, Issue 6
    November 2016
    1045 pages
    ISSN:0730-0301
    EISSN:1557-7368
    DOI:10.1145/2980179
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 December 2016
    Published in TOG Volume 35, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D acquisition
    2. attention-based model
    3. depth camera
    4. next-best-view
    5. object identification
    6. shape classification

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)GAMMA: Graspability-Aware Mobile MAnipulation Policy Learning based on Online Grasping Pose Fusion2024 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA57147.2024.10610125(1399-1405)Online publication date: 13-May-2024
    • (2023)Geometric Primitive-Guided UAV Path Planning for High-Quality Image-Based ReconstructionRemote Sensing10.3390/rs1510263215:10(2632)Online publication date: 18-May-2023
    • (2023)Online Scene CAD Recomposition via Autonomous ScanningACM Transactions on Graphics10.1145/361833942:6(1-16)Online publication date: 5-Dec-2023
    • (2023)ScanBot: Autonomous Reconstruction via Deep Reinforcement LearningACM Transactions on Graphics10.1145/359211342:4(1-16)Online publication date: 26-Jul-2023
    • (2023)Point Cloud Scene Completion With Joint Color and Semantic Estimation From Single RGB-D ImageIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.326444945:9(11079-11095)Online publication date: 1-Sep-2023
    • (2023)Using synthesized facial views for active face recognitionMachine Vision and Applications10.1007/s00138-023-01412-334:4Online publication date: 29-Jun-2023
    • (2022)Asynchronous Collaborative Autoscanning with Mode Switching for Multi-Robot Scene ReconstructionACM Transactions on Graphics10.1145/3550454.355548341:6(1-13)Online publication date: 30-Nov-2022
    • (2022)Autonomous Outdoor Scanning via Online Topological and Geometric Path OptimizationIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2020.303955723:4(3682-3695)Online publication date: Apr-2022
    • (2021)A Physics-Aware Neural Network Approach for Flow Data Reconstruction From Satellite ObservationsFrontiers in Climate10.3389/fclim.2021.6565053Online publication date: 9-Apr-2021
    • (2021)Supervoxel Convolution for Online 3D Semantic SegmentationACM Transactions on Graphics10.1145/345348540:3(1-15)Online publication date: 1-Aug-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media