MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis

Liang, Yaqian; Zhao, Shanshan; Yu, Baosheng; Zhang, Jing; He, Fazhi

doi:10.1007/978-3-031-20062-5_3

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13663))

Included in the following conference series:

European Conference on Computer Vision

2490 Accesses
14 Citations
3 Altmetric

Abstract

Recently, self-supervised pre-training has advanced Vision Transformers on various tasks w.r.t. different data modalities, e.g., image and 3D point cloud data. In this paper, we explore this learning paradigm for 3D mesh data analysis based on Transformers. Since applying Transformer architectures to new modalities is usually non-trivial, we first adapt Vision Transformer to 3D mesh data processing, i.e., Mesh Transformer. In specific, we divide a mesh into several non-overlapping local patches with each containing the same number of faces and use the 3D position of each patch’s center point to form positional embeddings. Inspired by MAE, we explore how pre-training on 3D mesh data with the Transformer-based structure benefits downstream 3D mesh analysis tasks. We first randomly mask some patches of the mesh and feed the corrupted mesh into Mesh Transformers. Then, through reconstructing the information of masked patches, the network is capable of learning discriminative representations for mesh data. Therefore, we name our method MeshMAE, which can yield state-of-the-art or comparable performance on mesh analysis tasks, i.e., classification and segmentation. In addition, we also conduct comprehensive ablation studies to show the effectiveness of key designs in our method.

Y. Liang—This work was done during Y. Liang’s internship at JD Explore Academy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MeT: mesh transformer with an edge

Article 14 July 2023

Image2Mesh: A Learning Framework for Single Image 3D Reconstruction

Disentangled Shape and Pose Based on Attention and Mesh Autoencoder

References

Adobe.com: animate 3d characters for games, film, and more. Accessed 24 2021. https://www.mixamo.com (2021)
Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV, pp. 37–45 (2015)
Google Scholar
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. In: ACM Transactions on Graphics (TOG), pp. 408–416 (2005)
Google Scholar
Aumentado-Armstrong, T., Tsogkas, S., Jepson, A., Dickinson, S.: Geometric disentanglement for generative latent shape models. In: CVPR, pp. 8181–8190 (2019)
Google Scholar
Bao, H., Dong, L., Wei, F.: Beit: bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
Bogo, F., Romero, J., Loper, M., Black, M.J.: Faust: dataset and evaluation for 3d mesh registration. In: CVPR, pp. 3794–3801 (2014)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. NeurIPS 33, 1877–1901 (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, C.F., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899 (2021)
Chen, M., et al.: Generative pretraining from pixels. In: ICML, pp. 1691–1703. PMLR (2020)
Google Scholar
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021)
Google Scholar
Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: NeurIPS, pp. 7059–7069 (2019)
Google Scholar
Cosmo, L., Norelli, A., Halimi, O., Kimmel, R., Rodolà, E.: LIMP: learning latent shape representations with metric preservation priors. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 19–35. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_2
Chapter Google Scholar
Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: unsupervised pre-training for object detection with transformers. In: CVPR, pp. 1601–1610 (2021)
Google Scholar
Davide, B., Jonathan, M., Simone, M., Michael, M.B., Umberto, C., Pierre, V.: Learning class-specific descriptors for deformable shapes using localized spectral convolutional networks. Comput. Graph. Forum 34(5), 13–23 (2015)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ding, L., et al.: Looking outside the window: wide-context transformer for the semantic segmentation of high-resolution remote sensing images. arXiv preprint arXiv:2106.15754 (2021)
Dosovitskiy, A., et al.: An image is worth 16 x 16 words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Feng, Y., Feng, Y., You, H., Zhao, X., Gao, Y.: MeshNet: mesh neural network for 3d shape representation. In: AAAI, pp. 8279–8286 (2019)
Google Scholar
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)
Google Scholar
Giorgi, D., Biasotti, S., Paraboschi, L.: Shape retrieval contest 2007: watertight models track. SHREC Compet. 8(7) (2007)
Google Scholar
Guan, S., Xu, J., Wang, Y., Ni, B., Yang, X.: Bilevel online adaptation for out-of-domain human mesh reconstruction. In: CVPR, pp. 10472–10481 (2021)
Google Scholar
Guo, M.H., Cai, J.X., Liu, Z.N., Mu, T.J., Martin, R.R., Hu, S.M.: PCT: point cloud transformer. Comput. Visual Media 7(2), 187–199 (2021)
Article Google Scholar
Haim, N., Segol, N., Ben-Hamu, H., Maron, H., Lipman, Y.: Surface networks via general covers. In: ICCV, pp. 632–641 (2019)
Google Scholar
Hanocka, R., Hertz, A., Fish, N., Giryes, R., Fleishman, S., Cohenor, D.: MeshCNN: a network with an edge. ACM Trans. Graph. (TOG) 38(4), 90 (2019)
Article Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR, pp. 16000–16009 (2022)
Google Scholar
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: ICLR (2018)
Google Scholar
Hu, S.-M., Liang, D., Yang, G.-Y., Yang, G.-W., Zhou, W.-Y.: Jittor: a novel deep learning framework with meta-operators and unified graph execution. Sci. China Inf. Sci. 63(12), 1–21 (2020). https://doi.org/10.1007/s11432-020-3097-4
Article MathSciNet Google Scholar
Hu, S.M., et al.: Subdivision-based mesh convolution networks. ACM Trans. Graph. (TOG) (2021)
Google Scholar
Huang, L., Tan, J., Liu, J., Yuan, J.: Hand-transformer: non-autoregressive structured modeling for 3d hand pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 17–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_2
Chapter Google Scholar
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 8, 64–77 (2020)
Article Google Scholar
Lahav, A., Tal, A.: MeshWalker: deep mesh understanding by random walks. ACM Trans. Graph. (TOG) 39(6), 1–13 (2020)
Article Google Scholar
Lee, A.W., Sweldens, W., Schröder, P., Cowsar, L., Dobkin, D.: Maps: multiresolution adaptive parameterization of surfaces. In: ACM SIGGRAPH, pp. 95–104 (1998)
Google Scholar
Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L.: MHFormer: multi-hypothesis transformer for 3d human pose estimation. arXiv preprint arXiv:2111.12707 (2021)
Li, Z., Liu, X., Drenkow, N., Ding, A., Creighton, F.X., Taylor, R.H., Unberath, M.: Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6197–6206 (2021)
Google Scholar
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR, pp. 1954–1963 (2021)
Google Scholar
Lin, K., Wang, L., Liu, Z.: Mesh graphormer. arXiv preprint arXiv:2104.00272 (2021)
Liu, H.T.D., Kim, V.G., Chaudhuri, S., Aigerman, N., Jacobson, A.: Neural subdivision. arXiv preprint arXiv:2005.01819 (2020)
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Maron, H., et al.: Convolutional neural networks on surfaces via seamless toric covers. ACM Trans. Graph. (TOG) 36(4), 71–1 (2017)
Article Google Scholar
Milano, F., Loquercio, A., Rosinol, A., Scaramuzza, D., Carlone, L.: Primal-dual mesh convolutional neural networks. NeurIPS 33, 952–963 (2020)
Google Scholar
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: CVPR, pp. 2906–2917 (2021)
Google Scholar
Monti, F., Shchur, O., Bojchevski, A., Litany, O., Günnemann, S., Bronstein, M.M.: Dual-primal graph convolutional networks. arXiv preprint arXiv:1806.00770 (2018)
Nash, C., Ganin, Y., Eslami, S.A., Battaglia, P.: PolyGen: an autoregressive generative model of 3d meshes. In: PMLR, pp. 7220–7229. PMLR (2020)
Google Scholar
Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. In: CVPR, pp. 2701–2710 (2017)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Rolfe, J.T.: Discrete variational autoencoders. arXiv preprint arXiv:1609.02200 (2016)
Saleh, M., Wu, S.C., Cosmo, L., Navab, N., Busam, B., Tombari, F.: Bending graphs: Hierarchical shape matching using gated optimal transport. In: CVPR (2022)
Google Scholar
Sederberg, T.W., Parry, S.R.: Free-form deformation of solid geometric models. In: ACM SIGGRAPH Computer Graphics, pp. 151–160 (1986)
Google Scholar
Tianyu, L., Yali, W., Junhao, Z., Zhe, W., Zhipeng, Z., Yu, Q.: PC-HMR: pose calibration for 3d human mesh recovery from 2d images/videos. In: AAAI, pp. 2269–2276. AAAI Press (2021)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICLR, pp. 10347–10357 (2021)
Google Scholar
Trappolini, G., Cosmo, L., Moschella, L., Marin, R., Melzi, S., Rodolà, E.: Shape registration in the time of transformers. In: NeurIPS, pp. 5731–5744 (2021)
Google Scholar
Trinh, T.H., Luong, M.T., Le, Q.V.: Selfie: self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940 (2019)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Google Scholar
Verma, N., Boyer, E., Verbeek, J.: FeaStNet: feature-steered graph convolutions for 3d shape analysis. In: CVPR, pp. 2598–2606 (2018)
Google Scholar
Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from multi-view silhouettes. In: ACM SIGGRAPH, pp. 1–9 (2008)
Google Scholar
Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: MaX-DeepLab: end-to-end panoptic segmentation with mask transformers. In: CVPR, pp. 5463–5474 (2021)
Google Scholar
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV, pp. 2794–2802 (2015)
Google Scholar
Wang, Y., Asafi, S., Van Kaick, O., Zhang, H., Cohen-Or, D., Chen, B.: Active co-analysis of a set of shapes. ACM Trans. Graph. (TOG) 31(6), 1–10 (2012)
Article Google Scholar
Xu, Y., Zhang, Q., Zhang, J., Tao, D.: Vitae: vision transformer advanced by exploring intrinsic inductive bias. In: NeurIPS, pp. 28522–28535 (2021)
Google Scholar
Yang, G., Tang, H., Ding, M., Sebe, N., Ricci, E.: Transformer-based attention networks for continuous pixel-wise prediction. In: CVPR, pp. 16269–16279 (2021)
Google Scholar
Yu, X., Rao, Y., Wang, Z., Liu, Z., Lu, J., Zhou, J.: PoinTr: diverse point cloud completion with geometry-aware transformers. In: ICCV, pp. 12498–12507 (2021)
Google Scholar
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-BERT: pre-training 3d point cloud transformers with masked point modeling. arXiv preprint arXiv:2111.14819 (2021)
Zhang, Q., Xu, Y., Zhang, J., Tao, D.: Vitaev2: vision transformer advanced by exploring inductive bias for image recognition and beyond. arXiv preprint arXiv:2202.10108 (2022)
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: ICCV, pp. 16259–16268 (2021)
Google Scholar
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant No. 62072348. Dr. Baosheng Yu and Dr. Jing Zhang are supported by ARC Project FL-170100117.

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, China
Yaqian Liang & Fazhi He
JD Explore Academy, Beijing, China
Shanshan Zhao
School of Computer Science, The University of Sydney, Sydney, Australia
Baosheng Yu & Jing Zhang

Authors

Yaqian Liang
View author publications
You can also search for this author in PubMed Google Scholar
Shanshan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Baosheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fazhi He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fazhi He .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 640 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liang, Y., Zhao, S., Yu, B., Zhang, J., He, F. (2022). MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13663. Springer, Cham. https://doi.org/10.1007/978-3-031-20062-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-20062-5_3
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20061-8
Online ISBN: 978-3-031-20062-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MeT: mesh transformer with an edge

Image2Mesh: A Learning Framework for Single Image 3D Reconstruction

Disentangled Shape and Pose Based on Attention and Mesh Autoencoder

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 640 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MeT: mesh transformer with an edge

Image2Mesh: A Learning Framework for Single Image 3D Reconstruction

Disentangled Shape and Pose Based on Attention and Mesh Autoencoder

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 640 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation