research-article

HMTN: Hierarchical Multi-scale Transformer Network for 3D Shape Recognition

Authors:

Yue Zhao,

Weizhi Nie,

Zan Gao,

An-an LiuAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 316 - 324

https://doi.org/10.1145/3503161.3548140

Published: 10 October 2022 Publication History

Get Access

Abstract

As an important field of multimedia, 3D shape recognition has attracted much research attention in recent years. Various approaches have been proposed, within which the multiview-based methods show their promising performances. In general, an effective 3D shape recognition algorithm should take both the multiview local and global visual information into consideration, and explore the inherent properties of generated 3D descriptors to guarantee the performance of feature alignment in the common space. To tackle these issues, we propose a novel Hierarchical Multi-scale Transformer Network (HMTN) for the 3D shape recognition task. In HMTN, we propose a multi-level regional transformer (MLRT) module for shape descriptor generation. MLRT includes two branches that aim to extract the intra-view local characteristics by modeling region-wise dependencies and give the supervision of multiview global information under different granularities. Specifically, MLRT can comprehensively consider the relations of different regions and focus on the discriminative parts, which improves the effectiveness of the learned descriptors. Finally, we adopt the cross-granularity contrastive learning (CCL) mechanism for shape descriptor alignment in the common space. It can explore and utilize the cross-granularity semantic correlation to guide the descriptor extraction process while performing the instance alignment based on the category information. We evaluate the proposed network on several public benchmarks, and HMTN achieves competitive performance compared with the state-of-the-art (SOTA) methods.

Supplementary Material

MP4 File (MM22-fp1653.mp4.mp4)

Presentation Video for HMTN: Hierarchical Multi-scale Transformer Network for 3D Shape Recognition

Download
19.93 MB

References

[1]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6836--6846.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

V2MLP: an accurate and simple multi-view MLP network for fine-grained 3D shape recognition

Multi-scale local binary pattern histograms for face recognition

Multi-Scale Boosting Feature Encoding Network for Texture Recognition

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

V $^{2}$ MLP: an accurate and simple multi-view MLP network for fine-grained 3D shape recognition