Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394171.3413621acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion

Published: 12 October 2020 Publication History
  • Get Citation Alerts
  • Abstract

    We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network. The framework consists of two innovative fusion schemes. Firstly, unlike existing multimodal methods that necessitate individual encoders for different modalities, we verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder, which also enables implicit fusion via joint feature representation learning. Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively. To take advantage of such scheme, we introduce two asymmetric fusion operations including channel shuffle and pixel shift, which learn different fused features with respect to different fusion directions. These two operations are parameter-free and strengthen the multimodal feature interactions across channels as well as enhance the spatial feature discrimination within channels. We conduct extensive experiments on semantic segmentation and image translation tasks, based on three publicly available datasets covering diverse modalities. Results indicate that our proposed framework is general, compact and is superior to state-of-the-art fusion frameworks.

    Supplementary Material

    ZIP File (mmfp1303aux.zip)
    supplementary-material.pdf contains supplementary details of our main paper.
    MP4 File (3394171.3413621.mp4)
    We are glad to share our paper. This is a video introduction of the paper, which is titled Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion. This work proposes three interdependent yet parameter-free components, including Parameter Sharing, Cross-Modality Channel Shuffle and Modality-Specific Pixel Shift. These three components are tactfully bridged into two architectural designs for fusing multimodal features, aiming to promote feature representation learning as well as make the fusion model compact. We introduce performance on two tasks, including semantic segmentation and image translation, which prove the effectiveness and generalization of work.

    References

    [1]
    Bulò, S.R., Porzi, L., Kontschieder, P.: In-place activated batchnorm for memory-optimized training of dnns. In: CVPR (2018)
    [2]
    Chen, L., Collins, M.D., Zhu, Y., Papandreou, G., Zoph, B., Schroff, F., Adam, H., Shlens, J.: Searching for efficient multi-scale architectures for dense image prediction. In: NeurIPS (2018)
    [3]
    Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
    [4]
    Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV (2018)
    [5]
    Chen, W., Xie, D., Zhang, Y., Pu, S.: All you need is a few shifts: Designing efficient convolutional neural networks for image classification. In: CVPR (2019)
    [6]
    Cheng, Y., Cai, R., Li, Z., Zhao, X., Huang, K.: Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation. In: CVPR (2017)
    [7]
    Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: CVPR (2017)
    [8]
    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
    [9]
    Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. In: ICLR (2013)
    [10]
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)
    [11]
    Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: CVPR (2013)
    [12]
    Gupta, S., Girshick, R.B., Arbelá ez, P.A., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: ECCV (2014)
    [13]
    Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: Fusenet: Incorporating depth into semantic segmentation via fusion-based CNN architecture. In: ACCV (2016)
    [14]
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
    [15]
    Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
    [16]
    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
    [17]
    Lee, S., Park, S., Hong, K.: Rdfnet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In: ICCV (2017)
    [18]
    Li, Y., Wang, N., Shi, J., Liu, J., Hou, X.: Revisiting batch normalization for practical domain adaptation. In: ICLR Workshop (2017)
    [19]
    Lin, D., Chen, G., Cohen-Or, D., Heng, P., Huang, H.: Cascaded feature network for semantic segmentation of RGB-D images. In: ICCV (2017)
    [20]
    Lin, D., Zhang, R., Ji, Y., Li, P., Huang, H.: SCN: switchable context network for semantic segmentation of RGB-D images. In: IEEE Trans. Cybern. (2020)
    [21]
    Lin, G., Liu, F., Milan, A., Shen, C., Reid, I.: Refinenet: Multi-path refinement networks for dense prediction. In: IEEE Trans. PAMI (2019)
    [22]
    Lin, G., Milan, A., Shen, C., Reid, I.D.: Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In: CVPR (2017)
    [23]
    Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV (2019)
    [24]
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
    [25]
    Ma, P., Zhou, Y., Lu, Y., Zhang, W.: Learning efficient video representation with video shuffle networks. arXiv preprint arXiv:1911.11319 (2019)
    [26]
    Mudrakarta, P.K., Sandler, M., Zhmoginov, A., Howard, A.G.: K for the price of 1: Parameter-efficient multi-task and transfer learning. In: ICLR (2019)
    [27]
    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
    [28]
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. IJCV (2015)
    [29]
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: ECCV (2012)
    [30]
    Wang, J., Wang, Z., Tao, D., See, S., Wang, G.: Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. In: ECCV (2016)
    [31]
    Wang, Y., Sun, F., Li, D., Yao, A.: Resolution switchable networks for runtime efficient image recognition. In: ECCV (2020)
    [32]
    Wu, B., Wan, A., Yue, X., Jin, P.H., Zhao, S., Golmant, N., Gholaminejad, A., Gonzalez, J., Keutzer, K.: Shift: A zero flop, zero parameter alternative to spatial convolutions. In: CVPR (2018)
    [33]
    Yu, J., Yang, L., Xu, N., Yang, J., Huang, T.S.: Slimmable neural networks. In: ICLR (2019)
    [34]
    Zamir, A.R., Sax, A., Shen, W.B., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: Disentangling task transfer learning. In: CVPR (2018)
    [35]
    Zeng, J., Tong, Y., Huang, Y., Yan, Q., Sun, W., Chen, J., Wang, Y.: Deep surface normal estimation with hierarchical RGB-D fusion. In: CVPR (2019)
    [36]
    Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: CVPR (2018)
    [37]
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
    [38]
    Zhuang, Y., Yang, F., Tao, L., Ma, C., Zhang, Z., Li, Y., Jia, H., Xie, X., Gao, W.: Dense relation network: Learning consistent and context-aware representation for semantic image segmentation. In: ICIP (2018)

    Cited By

    View all
    • (2024)Quad-Biometrics for Few-Shot User IdentificationProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636027(560-564)Online publication date: 8-Apr-2024
    • (2024)Research on Semantic Description Algorithm for Dual-Feature Fusion Images Based on Transformer Networks2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC)10.1109/YAC63405.2024.10598716(1631-1636)Online publication date: 7-Jun-2024
    • (2024)Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00106(1009-1019)Online publication date: 3-Jan-2024
    • Show More Cited By

    Index Terms

    1. Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '20: Proceedings of the 28th ACM International Conference on Multimedia
        October 2020
        4889 pages
        ISBN:9781450379885
        DOI:10.1145/3394171
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 12 October 2020

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. asymmetric operations
        2. bidirectional fusion
        3. compact network design
        4. multimodal learning

        Qualifiers

        • Research-article

        Funding Sources

        • National Science Foundation of China (NSFC)
        • German Research Foundation (DFG)

        Conference

        MM '20
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 995 of 4,171 submissions, 24%

        Upcoming Conference

        MM '24
        The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)117
        • Downloads (Last 6 weeks)18
        Reflects downloads up to 26 Jul 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Quad-Biometrics for Few-Shot User IdentificationProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636027(560-564)Online publication date: 8-Apr-2024
        • (2024)Research on Semantic Description Algorithm for Dual-Feature Fusion Images Based on Transformer Networks2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC)10.1109/YAC63405.2024.10598716(1631-1636)Online publication date: 7-Jun-2024
        • (2024)Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00106(1009-1019)Online publication date: 3-Jan-2024
        • (2024)S$^{3}$M-Net: Joint Learning of Semantic Segmentation and Stereo Matching for Autonomous DrivingIEEE Transactions on Intelligent Vehicles10.1109/TIV.2024.33570569:2(3940-3951)Online publication date: Mar-2024
        • (2024)Texture-Aware Causal Feature Extraction Network for Multimodal Remote Sensing Data ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.336809162(1-12)Online publication date: 2024
        • (2024)Indoor semantic segmentation based on Swin-TransformerJournal of Visual Communication and Image Representation10.1016/j.jvcir.2023.10399198(103991)Online publication date: Mar-2024
        • (2024)Triple-modality interaction for deepfake detection on zero-shot identityInformation Fusion10.1016/j.inffus.2024.102424(102424)Online publication date: May-2024
        • (2023)Automatic Network Architecture Search for RGB-D Semantic SegmentationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612288(3777-3786)Online publication date: 26-Oct-2023
        • (2023)Asymmetric Feature Fusion Network for Hyperspectral and SAR Image ClassificationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.314939434:10(8057-8070)Online publication date: Oct-2023
        • (2023)CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With TransformersIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.330053724:12(14679-14694)Online publication date: Dec-2023
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media