research-article

Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion

Authors:

Yikai Wang,

Fuchun Sun,

Ming Lu,

Anbang YaoAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 3902 - 3910

https://doi.org/10.1145/3394171.3413621

Published: 12 October 2020 Publication History

Get Access

Abstract

We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network. The framework consists of two innovative fusion schemes. Firstly, unlike existing multimodal methods that necessitate individual encoders for different modalities, we verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder, which also enables implicit fusion via joint feature representation learning. Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively. To take advantage of such scheme, we introduce two asymmetric fusion operations including channel shuffle and pixel shift, which learn different fused features with respect to different fusion directions. These two operations are parameter-free and strengthen the multimodal feature interactions across channels as well as enhance the spatial feature discrimination within channels. We conduct extensive experiments on semantic segmentation and image translation tasks, based on three publicly available datasets covering diverse modalities. Results indicate that our proposed framework is general, compact and is superior to state-of-the-art fusion frameworks.

Supplementary Material

ZIP File (mmfp1303aux.zip)

supplementary-material.pdf contains supplementary details of our main paper.

Download
442.27 KB

MP4 File (3394171.3413621.mp4)

We are glad to share our paper. This is a video introduction of the paper, which is titled Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion. This work proposes three interdependent yet parameter-free components, including Parameter Sharing, Cross-Modality Channel Shuffle and Modality-Specific Pixel Shift. These three components are tactfully bridged into two architectural designs for fusing multimodal features, aiming to promote feature representation learning as well as make the fusion model compact. We introduce performance on two tasks, including semantic segmentation and image translation, which prove the effectiveness and generalization of work.

Download
45.01 MB

References

[1]

Bulò, S.R., Porzi, L., Kontschieder, P.: In-place activated batchnorm for memory-optimized training of dnns. In: CVPR (2018)

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Multimodal Sentiment Analysis Based on Attentional Temporal Convolutional Network and Multi-Layer Feature Fusion

Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning

Research on Feature Extraction and Multimodal Fusion of Video Caption Based on Deep Learning

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations