Abstract
Recently, self-supervised pre-training has advanced Vision Transformers on various tasks w.r.t. different data modalities, e.g., image and 3D point cloud data. In this paper, we explore this learning paradigm for 3D mesh data analysis based on Transformers. Since applying Transformer architectures to new modalities is usually non-trivial, we first adapt Vision Transformer to 3D mesh data processing, i.e., Mesh Transformer. In specific, we divide a mesh into several non-overlapping local patches with each containing the same number of faces and use the 3D position of each patch’s center point to form positional embeddings. Inspired by MAE, we explore how pre-training on 3D mesh data with the Transformer-based structure benefits downstream 3D mesh analysis tasks. We first randomly mask some patches of the mesh and feed the corrupted mesh into Mesh Transformers. Then, through reconstructing the information of masked patches, the network is capable of learning discriminative representations for mesh data. Therefore, we name our method MeshMAE, which can yield state-of-the-art or comparable performance on mesh analysis tasks, i.e., classification and segmentation. In addition, we also conduct comprehensive ablation studies to show the effectiveness of key designs in our method.
Y. Liang—This work was done during Y. Liang’s internship at JD Explore Academy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adobe.com: animate 3d characters for games, film, and more. Accessed 24 2021. https://www.mixamo.com (2021)
Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV, pp. 37–45 (2015)
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. In: ACM Transactions on Graphics (TOG), pp. 408–416 (2005)
Aumentado-Armstrong, T., Tsogkas, S., Jepson, A., Dickinson, S.: Geometric disentanglement for generative latent shape models. In: CVPR, pp. 8181–8190 (2019)
Bao, H., Dong, L., Wei, F.: Beit: bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
Bogo, F., Romero, J., Loper, M., Black, M.J.: Faust: dataset and evaluation for 3d mesh registration. In: CVPR, pp. 3794–3801 (2014)
Brown, T., et al.: Language models are few-shot learners. NeurIPS 33, 1877–1901 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, C.F., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899 (2021)
Chen, M., et al.: Generative pretraining from pixels. In: ICML, pp. 1691–1703. PMLR (2020)
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS, pp. 17864–17875 (2021)
Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: NeurIPS, pp. 7059–7069 (2019)
Cosmo, L., Norelli, A., Halimi, O., Kimmel, R., Rodolà, E.: LIMP: learning latent shape representations with metric preservation priors. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 19–35. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_2
Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: unsupervised pre-training for object detection with transformers. In: CVPR, pp. 1601–1610 (2021)
Davide, B., Jonathan, M., Simone, M., Michael, M.B., Umberto, C., Pierre, V.: Learning class-specific descriptors for deformable shapes using localized spectral convolutional networks. Comput. Graph. Forum 34(5), 13–23 (2015)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ding, L., et al.: Looking outside the window: wide-context transformer for the semantic segmentation of high-resolution remote sensing images. arXiv preprint arXiv:2106.15754 (2021)
Dosovitskiy, A., et al.: An image is worth 16 x 16 words: transformers for image recognition at scale. In: ICLR (2020)
Feng, Y., Feng, Y., You, H., Zhao, X., Gao, Y.: MeshNet: mesh neural network for 3d shape representation. In: AAAI, pp. 8279–8286 (2019)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)
Giorgi, D., Biasotti, S., Paraboschi, L.: Shape retrieval contest 2007: watertight models track. SHREC Compet. 8(7) (2007)
Guan, S., Xu, J., Wang, Y., Ni, B., Yang, X.: Bilevel online adaptation for out-of-domain human mesh reconstruction. In: CVPR, pp. 10472–10481 (2021)
Guo, M.H., Cai, J.X., Liu, Z.N., Mu, T.J., Martin, R.R., Hu, S.M.: PCT: point cloud transformer. Comput. Visual Media 7(2), 187–199 (2021)
Haim, N., Segol, N., Ben-Hamu, H., Maron, H., Lipman, Y.: Surface networks via general covers. In: ICCV, pp. 632–641 (2019)
Hanocka, R., Hertz, A., Fish, N., Giryes, R., Fleishman, S., Cohenor, D.: MeshCNN: a network with an edge. ACM Trans. Graph. (TOG) 38(4), 90 (2019)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR, pp. 16000–16009 (2022)
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: ICLR (2018)
Hu, S.-M., Liang, D., Yang, G.-Y., Yang, G.-W., Zhou, W.-Y.: Jittor: a novel deep learning framework with meta-operators and unified graph execution. Sci. China Inf. Sci. 63(12), 1–21 (2020). https://doi.org/10.1007/s11432-020-3097-4
Hu, S.M., et al.: Subdivision-based mesh convolution networks. ACM Trans. Graph. (TOG) (2021)
Huang, L., Tan, J., Liu, J., Yuan, J.: Hand-transformer: non-autoregressive structured modeling for 3d hand pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 17–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_2
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 8, 64–77 (2020)
Lahav, A., Tal, A.: MeshWalker: deep mesh understanding by random walks. ACM Trans. Graph. (TOG) 39(6), 1–13 (2020)
Lee, A.W., Sweldens, W., Schröder, P., Cowsar, L., Dobkin, D.: Maps: multiresolution adaptive parameterization of surfaces. In: ACM SIGGRAPH, pp. 95–104 (1998)
Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L.: MHFormer: multi-hypothesis transformer for 3d human pose estimation. arXiv preprint arXiv:2111.12707 (2021)
Li, Z., Liu, X., Drenkow, N., Ding, A., Creighton, F.X., Taylor, R.H., Unberath, M.: Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6197–6206 (2021)
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR, pp. 1954–1963 (2021)
Lin, K., Wang, L., Liu, Z.: Mesh graphormer. arXiv preprint arXiv:2104.00272 (2021)
Liu, H.T.D., Kim, V.G., Chaudhuri, S., Aigerman, N., Jacobson, A.: Neural subdivision. arXiv preprint arXiv:2005.01819 (2020)
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Maron, H., et al.: Convolutional neural networks on surfaces via seamless toric covers. ACM Trans. Graph. (TOG) 36(4), 71–1 (2017)
Milano, F., Loquercio, A., Rosinol, A., Scaramuzza, D., Carlone, L.: Primal-dual mesh convolutional neural networks. NeurIPS 33, 952–963 (2020)
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: CVPR, pp. 2906–2917 (2021)
Monti, F., Shchur, O., Bojchevski, A., Litany, O., Günnemann, S., Bronstein, M.M.: Dual-primal graph convolutional networks. arXiv preprint arXiv:1806.00770 (2018)
Nash, C., Ganin, Y., Eslami, S.A., Battaglia, P.: PolyGen: an autoregressive generative model of 3d meshes. In: PMLR, pp. 7220–7229. PMLR (2020)
Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. In: CVPR, pp. 2701–2710 (2017)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Rolfe, J.T.: Discrete variational autoencoders. arXiv preprint arXiv:1609.02200 (2016)
Saleh, M., Wu, S.C., Cosmo, L., Navab, N., Busam, B., Tombari, F.: Bending graphs: Hierarchical shape matching using gated optimal transport. In: CVPR (2022)
Sederberg, T.W., Parry, S.R.: Free-form deformation of solid geometric models. In: ACM SIGGRAPH Computer Graphics, pp. 151–160 (1986)
Tianyu, L., Yali, W., Junhao, Z., Zhe, W., Zhipeng, Z., Yu, Q.: PC-HMR: pose calibration for 3d human mesh recovery from 2d images/videos. In: AAAI, pp. 2269–2276. AAAI Press (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICLR, pp. 10347–10357 (2021)
Trappolini, G., Cosmo, L., Moschella, L., Marin, R., Melzi, S., Rodolà, E.: Shape registration in the time of transformers. In: NeurIPS, pp. 5731–5744 (2021)
Trinh, T.H., Luong, M.T., Le, Q.V.: Selfie: self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940 (2019)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Verma, N., Boyer, E., Verbeek, J.: FeaStNet: feature-steered graph convolutions for 3d shape analysis. In: CVPR, pp. 2598–2606 (2018)
Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from multi-view silhouettes. In: ACM SIGGRAPH, pp. 1–9 (2008)
Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: MaX-DeepLab: end-to-end panoptic segmentation with mask transformers. In: CVPR, pp. 5463–5474 (2021)
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV, pp. 2794–2802 (2015)
Wang, Y., Asafi, S., Van Kaick, O., Zhang, H., Cohen-Or, D., Chen, B.: Active co-analysis of a set of shapes. ACM Trans. Graph. (TOG) 31(6), 1–10 (2012)
Xu, Y., Zhang, Q., Zhang, J., Tao, D.: Vitae: vision transformer advanced by exploring intrinsic inductive bias. In: NeurIPS, pp. 28522–28535 (2021)
Yang, G., Tang, H., Ding, M., Sebe, N., Ricci, E.: Transformer-based attention networks for continuous pixel-wise prediction. In: CVPR, pp. 16269–16279 (2021)
Yu, X., Rao, Y., Wang, Z., Liu, Z., Lu, J., Zhou, J.: PoinTr: diverse point cloud completion with geometry-aware transformers. In: ICCV, pp. 12498–12507 (2021)
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-BERT: pre-training 3d point cloud transformers with masked point modeling. arXiv preprint arXiv:2111.14819 (2021)
Zhang, Q., Xu, Y., Zhang, J., Tao, D.: Vitaev2: vision transformer advanced by exploring inductive bias for image recognition and beyond. arXiv preprint arXiv:2202.10108 (2022)
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: ICCV, pp. 16259–16268 (2021)
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grant No. 62072348. Dr. Baosheng Yu and Dr. Jing Zhang are supported by ARC Project FL-170100117.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liang, Y., Zhao, S., Yu, B., Zhang, J., He, F. (2022). MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13663. Springer, Cham. https://doi.org/10.1007/978-3-031-20062-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-20062-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20061-8
Online ISBN: 978-3-031-20062-5
eBook Packages: Computer ScienceComputer Science (R0)