Article

Beyond Viewpoint: Robust 3D Object Recognition Under Arbitrary Views Through Joint Multi-part Representation

Authors:

Lixin DuanAuthors Info & Claims

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LII

Pages 291 - 309

https://doi.org/10.1007/978-3-031-72943-0_17

Published: 29 November 2024 Publication History

Abstract

Existing view-based methods excel at recognizing 3D objects from predefined viewpoints, but their exploration of recognition under arbitrary views is limited. This is a challenging and realistic setting because each object has different viewpoint positions and quantities, and their poses are not aligned. However, most view-based methods, which aggregate multiple view features to obtain a global feature representation, hard to address 3D object recognition under arbitrary views. Due to the unaligned inputs from arbitrary views, it is challenging to robustly aggregate features, leading to performance degradation. In this paper, we introduce a novel Part-aware Network (PANet), which is a part-based representation, to address these issues. This part-based representation aims to localize and understand different parts of 3D objects, such as airplane wings and tails. It has properties such as viewpoint invariance and rotation robustness, which give it an advantage in addressing the 3D object recognition problem under arbitrary views. Our results on benchmark datasets clearly demonstrate that our proposed method outperforms existing view-based aggregation baselines for the task of 3D object recognition under arbitrary views, even surpassing most fixed viewpoint methods.

References

[1]

Afham, M., Dissanayake, I., Dissanayake, D., Dharmasiri, A., Thilakarathna, K., Rodrigo, R.: CrossPoint: self-supervised cross-modal contrastive learning for 3D point cloud understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9902–9912 (2022)

[2]

Asif U, Bennamoun M, and Sohel FA A multi-modal, discriminative and spatially invariant CNN for RGB-D object labeling IEEE Trans. Pattern Anal. Mach. Intell. 2017 40 9 2051-2065

Digital Library

[3]

Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Generative and discriminative voxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236 (2016)

[4]

Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, and Zagoruyko S Vedaldi A, Bischof H, Brox T, and Frahm J-M End-to-end object detection with transformers Computer Vision – ECCV 2020 2020 Cham Springer 213-229

Digital Library

[5]

Chen J, Qin J, Shen Y, Liu L, Zhu F, and Shao L Vedaldi A, Bischof H, Brox T, and Frahm J-M Learning attentive and hierarchical representations for 3D shape recognition Computer Vision – ECCV 2020 2020 Cham Springer 105-122

Digital Library

[6]

Chen, S., Yu, T., Li, P.: MVT: multi-view vision transformer for 3D object recognition. arXiv preprint arXiv:2110.13083 (2021)

[7]

Cheng B, Schwing A, and Kirillov A Per-pixel classification is not all you need for semantic segmentation Adv. Neural. Inf. Process. Syst. 2021 34 17864-17875

[8]

Cheng, Y., Cai, R., Zhao, X., Huang, K.: Convolutional fisher kernels for RGB-D object recognition. In: 2015 International Conference on 3D Vision, pp. 135–143. IEEE (2015)

[9]

Delitzas, A., et al.: Multi-CLIP: contrastive vision-language pre-training for question answering tasks in 3D scenes. arXiv preprint arXiv:2306.02329 (2023)

[10]

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

[11]

Dosovitskiy, A., Beyer, L., et al.: An image is worth 16

\times

16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

[12]

Esteves, C., Xu, Y., Allen-Blanchette, C., Daniilidis, K.: Equivariant multi-view networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1568–1577 (2019)

[13]

Feng, Y., Zhang, Z., Zhao, X., Ji, R., Gao, Y.: GVCNN: group-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 264–272 (2018)

[14]

Fu, J., Zheng, H., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4438–4446 (2017)

[15]

Gao, K., Gao, Y., He, H., Lu, D., Xu, L., Li, J.: NeRF: neural radiance field in 3D vision, a comprehensive review. arXiv preprint arXiv:2210.00379 (2022)

[16]

Gao Y, Feng Y, Ji S, and Ji R HGNN+: general hypergraph neural networks IEEE Trans. Pattern Anal. Mach. Intell. 2022 45 3 3181-3199

[17]

Guo Y, Wang H, Hu Q, Liu H, Liu L, and Bennamoun M Deep learning for 3D point clouds: a survey IEEE Trans. Pattern Anal. Mach. Intell. 2020 43 12 4338-4364

Digital Library

[18]

Han Z et al. 3D2Seqviews: aggregating sequential views for 3d global feature learning by CNN with hierarchical attention aggregation IEEE Trans. Image Process. 2019 28 8 3986-3999

[19]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

[20]

Herreras, E.B.: Cognitive neuroscience; the biology of the mind. Cuadernos Neuropsicología/Panamerican J. Neuropsychol. 4(1), 87–90 (2010)

[21]

Hong, Y., Lin, C., Du, Y., Chen, Z., Tenenbaum, J.B., Gan, C.: 3D concept learning and reasoning from multi-view images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9202–9212 (2023)

[22]

Hu L, Qin M, Zhang F, Du Z, and Liu R RSCNN: a CNN-based method to enhance low-light remote-sensing images Remote Sens. 2020 13 1 62

[23]

Hu, T., Qi, H., Huang, Q., Lu, Y.: See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891 (2019)

[24]

Huang J, Yan W, Li T, Liu S, and Li G Learning the global descriptor for 3-D object recognition based on multiple views decomposition IEEE Trans. Multimed. 2022 24 188-201

Digital Library

[25]

Huang, S., Xu, Z., Tao, D., Zhang, Y.: Part-stacked CNN for fine-grained visual categorization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1173–1182 (2016)

[26]

Kanezaki A, Matsushita Y, and Nishida Y RotationNet for joint object categorization and unsupervised pose estimation from multi-view images IEEE Trans. Pattern Anal. Mach. Intell. 2019 43 1 269-283

Digital Library

[27]

Klokov, R., Lempitsky, V.: Escape from cells: deep Kd-networks for the recognition of 3D point cloud models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 863–872 (2017)

[28]

Kumawat, S., Raman, S.: LP-3DCNN: unveiling local phase in 3D convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4903–4912 (2019)

[29]

Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view RGB-D object dataset. In: 2011 IEEE International Conference on Robotics and Automation, pp. 1817–1824. IEEE (2011)

[30]

Li Z et al. Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, et al. BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers Computer Vision – ECCV 2022 2022 Cham Springer 1-18

Digital Library

[31]

Lin Y, Gou Y, Liu X, Bai J, Lv J, and Peng X Dual contrastive prediction for incomplete multi-view representation learning IEEE Trans. Pattern Anal. Mach. Intell. 2022 45 4 4447-4461

[32]

Liu AA et al. Hierarchical multi-view context modelling for 3D object classification and retrieval Inf. Sci. 2021 547 984-995

[33]

Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)

[34]

Liu, S., Nguyen, V., Rehg, I., Tu, Z.: Recognizing objects from any view with object and viewer-centered representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793 (2020)

[35]

Liu X, Han Z, Liu YS, and Zwicker M Fine-grained 3D shape classification with hierarchical part-view attention IEEE Trans. Image Process. 2021 30 1744-1758

[36]

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

[37]

Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157 (1999).

[38]

Maturana, D., Scherer, S.: VoxNet: a 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. IEEE (2015)

[39]

Meng M, Zhang T, Yang W, Zhao J, Zhang Y, and Wu F Diverse complementary part mining for weakly supervised object localization IEEE Trans. Image Process. 2022 31 1774-1788

[40]

Mildenhall B, Srinivasan PP, Tancik M, Barron JT, Ramamoorthi R, and Ng R NeRF: representing scenes as neural radiance fields for view synthesis Commun. ACM 2021 65 1 99-106

Digital Library

[41]

Nie W, Zhao Y, Song D, and Gao Y DAN: deep-attention network for 3D shape recognition IEEE Trans. Image Process. 2021 30 4371-4383

[42]

Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)

[43]

Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 30 (2017)

[44]

Rahman, M.M., Tan, Y., Xue, J., Lu, K.: RGB-D object recognition with multimodal deep convolutional neural networks. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 991–996. IEEE (2017)

[45]

Rao, Y., Chen, G., Lu, J., Zhou, J.: Counterfactual attention learning for fine-grained visual categorization and re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1025–1034 (2021)

[46]

Riegler, G., Osman Ulusoy, A., Geiger, A.: OctNet: learning deep 3D representations at high resolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3577–3586 (2017)

[47]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

[48]

Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)

[49]

Su, J.C., Gadelha, M., Wang, R., Maji, S.: A deeper look at 3D shape classifiers. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)

[50]

Sun K, Zhang J, Liu J, Yu R, and Song Z DRCNN: dynamic routing convolutional neural network for multi-view 3D object recognition IEEE Trans. Image Process. 2020 30 868-877

Digital Library

[51]

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

[52]

Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.: KPConv: flexible and deformable convolution for point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6411–6420 (2019)

[53]

Uy, M.A., Pham, Q.H., Hua, B.S., Nguyen, T., Yeung, S.K.: Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1588–1597 (2019)

[54]

Wang Y, Sun Y, Liu Z, Sarma SE, Bronstein MM, and Solomon JM Dynamic graph CNN for learning on point clouds ACM Trans. Graph. (ToG) 2019 38 5 1-12

Digital Library

[55]

Wei, X., Gong, Y., Wang, F., Sun, X., Sun, J.: Learning canonical view representation for 3D shape recognition with arbitrary views. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 407–416 (2021)

[56]

Wei, X., Yu, R., Sun, J.: View-GCN: view-based graph convolutional network for 3D shape analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1850–1859 (2020)

[57]

Wei, X., Yu, R., Sun, J.: Learning view-based graph convolutional network for multi-view 3D shape analysis. IEEE Trans. Pattern Anal. Mach. Intell. (2022)

[58]

Wu, S., Khosla, Y., Zhang, T.: D shapeNets: a deep representation for volumetric shape modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 7–12 (2015)

[59]

Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920 (2015)

[60]

Xu Y, Zheng C, Xu R, Quan Y, and Ling H Multi-view 3D shape recognition via correspondence-aware deep learning IEEE Trans. Image Process. 2021 30 5299-5312

[61]

Xue, L., et al.: ULIP: learning unified representation of language, image and point cloud for 3D understanding. arXiv preprint arXiv:2212.05171 (2022)

[62]

Yang, C., et al.: BEVFormer v2: adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17830–17839 (2023)

[63]

Yang, Z., Wang, L.: Learning relationships for multi-view 3D object recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7505–7514 (2019)

[64]

Yu T, Meng J, Yang M, and Yuan J 3D object representation learning: a set-to-set matching perspective IEEE Trans. Image Process. 2021 30 2168-2179

Digital Library

[65]

Yu, T., Meng, J., Yuan, J.: Multi-view harmonized bilinear network for 3D object recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 186–194 (2018).

[66]

Zhang, H., et al.: SPDA-CNN: unifying semantic part detection and abstraction for fine-grained recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1143–1152 (2016)

[67]

Zhang N, Donahue J, Girshick R, and Darrell T Fleet D, Pajdla T, Schiele B, and Tuytelaars T Part-based R-CNNs for fine-grained category detection Computer Vision – ECCV 2014 2014 Cham Springer 834-849

[68]

Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5209–5217 (2017)

[69]

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)

Index Terms

Beyond Viewpoint: Robust 3D Object Recognition Under Arbitrary Views Through Joint Multi-part Representation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
        Object recognition
        Shape inference
      2. Computer vision representations
        Shape representations
  2. Computer graphics
    1. Shape modeling

Index terms have been assigned to the content through auto-classification.

Recommendations

View-relation constrained global representation learning for multi-view-based 3D object recognition
Abstract
Multi-view observations provide complementary clues for 3D object recognition, but also include redundant information that appears different across views due to view-dependent projection, light reflection and self-occlusions. This paper presents a ...
VAeViT: Fusing Multi-views for Complete 3D Object Recognition
Artificial Neural Networks in Pattern Recognition
Abstract
The Vision Transformer (ViT) emergence has ushered in a new era in 2D image classification, yet the realm of 3D object recognition presents distinct challenges. To bridge this gap, we introduce VAeViT, a pioneering hybrid model that seamlessly ...
Evaluation of PCL's Descriptors for 3D Object Recognition in Cluttered Scene
BDCA'17: Proceedings of the 2nd international Conference on Big Data, Cloud and Applications

3D point cloud recognition is a challenging problem that is growing fast, especially with the development of acquisition cameras such as the Kinect that allows the generation of a lot of data to approximate an object model. Based on the types of features ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LII

Sep 2024

577 pages

ISBN:978-3-031-72942-3

DOI:10.1007/978-3-031-72943-0

Editors:
Aleš Leonardis
University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
Technical University of Darmstadt, Darmstadt, Germany
,
Olga Russakovsky
Princeton University, Princeton, NJ, USA
,
Torsten Sattler
Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
École des Ponts ParisTech, Marne-la-Vallée, France

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 29 November 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten