research-article

Deep Cross-Modal Audio-Visual Generation

Authors:

Sudhanshu Srivastava,

Zhiyao Duan, and

Chenliang XuAuthors Info & Claims

Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

October 2017

Pages 349 - 357

https://doi.org/10.1145/3126686.3126723

Published: 23 October 2017 Publication History

Abstract

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.

References

[1]

Sima Behpour and Brian D Ziebart. 2016. Adversarial methods improve object localization. Advances in Neural Information Processing Systems Workshop.

[2]

Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. 2016. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In European Conference on Computer Vision.

[3]

Richard K. Davenport, Charles M. Rogers, and I. Steele Russell. 1973. Cross-modal perception in apes. Neuropsychologia, Vol. 11, 1 (1973), 21--28.

[4]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database IEEE Conference on Computer Vision and Pattern Recognition.

[5]

Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks Advances in Neural Information Processing Systems.

Digital Library

[6]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder ACM International Conference on Multimedia.

Digital Library

[7]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems.

Digital Library

[8]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks IEEE Conference on Computer Vision and Pattern Recognition.

[9]

S. Kumar, V. Dhiman, and J. J. Corso. 2014. Learning compositional sparse models of bimodal percepts AAAI Conference on Artificial Intelligence.

Digital Library

[10]

Bochen Li, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma. 2017. See and listen: score-informed association of sound tracks to players in chamber music performance videos. In IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]

Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma. 2016. Creating a classical musical performance dataset for multimodal music analysis: Challenges, Insights, and Applications. In arXiv:1612.08727.

[12]

Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. 2016. Semantic segmentation using adversarial networks. arXiv:1611.08408.

[13]

Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. 2016. Adversarial autoencoders. In International Conference on Learning Representations.

[14]

Christophe Mignot, Claude Valot, and Noelle Carbonell. 1993. An experimental study of future "natural" multimodal human-computer interaction INTERACT'93 and CHI'93 Conference Companion on Human Factors in Computing Systems.

Digital Library

[15]

Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. In arXiv:1411.1784.

[16]

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In International Conference on Machine Learning.

Digital Library

[17]

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. 2016. Visually indicated sounds. In IEEE Conference on Computer Vision and Pattern Recognition.

[18]

Santiago Pascual, Antonio Bonafonte, and Joan Serrà. 2017. SEGAN: Speech Enhancement Generative Adversarial Network arXiv:1703.09452.

[19]

Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, 3 (2014), 521--535.

Digital Library

[20]

Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks International Conference on Learning Representations.

[21]

Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval ACM International Conference on Multimedia.

Digital Library

[22]

Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning deep representations of fine-grained visual descriptions IEEE Conference on Computer Vision and Pattern Recognition.

[23]

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text-to-image synthesis. In International Conference on Machine Learning.

Digital Library

[24]

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training GANs. In Advances in Neural Information Processing Systems.

[25]

Nasim Souly, Concetto Spampinato, and Mubarak Shah. 2017. Semi and Weakly Supervised Semantic Segmentation Using Generative Adversarial Networks arXiv:1703.09695.

[26]

Nitish Srivastava and Ruslan R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. Advances in Neural Information Processing Systems.

Digital Library

[27]

Barry E. Stein and M. Alex Meredith. 1993. The merging of the senses. The MIT Press.

[28]

Russell L. Storms. 1998. Auditory-visual cross-modal perception phenomena. Ph.D. Dissertation. bibinfoschoolNaval Postgraduate School.

[29]

M. Iftekhar Tanveer, Ji Liu, and M. Ehsan Hoque. 2015. Unsupervised extraction of human-interpretable nonverbal behavioral cues in a public speaking scenario. In ACM International Conference on Multimedia.

Digital Library

[30]

Bradley W. Vines, Carol L. Krumhansl, Marcelo M. Wanderley, and Daniel J. Levitin. 2006. Cross-modal interactions in the perception of musical performance. Cognition, Vol. 101, 1 (2006), 80--113.

[31]

Jean Vroomen and Beatrice de Gelder. 2000. Sound enhances visual perception: cross-modal effects of auditory organization on vision. Journal of experimental psychology: Human perception and performance, Vol. 26, 5 (2000), 1583.

[32]

Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A Comprehensive Survey on Cross-modal Retrieval. arXiv:1607.06215.

[33]

Hang Zhang and Kristin Dana. 2017. Multi-style generative network for real-time transfer arXiv:1703.06953.

Cited By

Shen QXu JMei JWu XDong D(2024)EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio GuidanceApplied Sciences10.3390/app1408319314:8(3193)Online publication date: 10-Apr-2024
https://doi.org/10.3390/app14083193
Bao JLi DLi SZhao GSun HZhang Y(2024)Fine-Grained Image Generation Network With Radar Range Profiles Using Cross-Modal Visual SupervisionIEEE Transactions on Microwave Theory and Techniques10.1109/TMTT.2023.329961572:2(1339-1352)Online publication date: Feb-2024
https://doi.org/10.1109/TMTT.2023.3299615
Zhan YSun XWang QNai W(2024)Method for Audio-to-Tactile Cross-Modality Generation Based on Residual U-NetIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.333645373(1-14)Online publication date: 2024
https://doi.org/10.1109/TIM.2023.3336453
Show More Cited By

Index Terms

Deep Cross-Modal Audio-Visual Generation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Harmony across Music, Visuals and Movement in a New Audio-visual Gestural Performance
TEI '20: Proceedings of the Fourteenth International Conference on Tangible, Embedded, and Embodied Interaction

This paper describes the technology, concepts and development of Computer Storm, a live audio-visual piece created for a gestural instrument, the 'AirSticks'. The AirSticks allow the composition, performance and improvisation of live electronic music ...
Read More
Audio–visual collaborative representation learning for Dynamic Saliency Prediction
Abstract
The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive a dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, ...
Read More
Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

October 2017

558 pages

ISBN:9781450354165

DOI:10.1145/3126686

Program Chairs:
Wanmin Wu
Google, USA
,
Jianchao Yang
Snap Inc., USA
,
Qi Tian
The University of Texas at San Antonio, USA
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 23 - 27, 2017

California, Mountain View, USA

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

127
Total Citations
View Citations
1,104
Total Downloads

Downloads (Last 12 months)168
Downloads (Last 6 weeks)22

Other Metrics

View Author Metrics

Citations

Cited By

Shen QXu JMei JWu XDong D(2024)EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio GuidanceApplied Sciences10.3390/app1408319314:8(3193)Online publication date: 10-Apr-2024
https://doi.org/10.3390/app14083193
Bao JLi DLi SZhao GSun HZhang Y(2024)Fine-Grained Image Generation Network With Radar Range Profiles Using Cross-Modal Visual SupervisionIEEE Transactions on Microwave Theory and Techniques10.1109/TMTT.2023.329961572:2(1339-1352)Online publication date: Feb-2024
https://doi.org/10.1109/TMTT.2023.3299615
Zhan YSun XWang QNai W(2024)Method for Audio-to-Tactile Cross-Modality Generation Based on Residual U-NetIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.333645373(1-14)Online publication date: 2024
https://doi.org/10.1109/TIM.2023.3336453
Jiang QZhao GMa XLi MTian YLi X(2024)Cross-Modal Learning Based Flexible Bimodal Biometric Authentication With Template ProtectionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.336409219(3593-3607)Online publication date: 2024
https://doi.org/10.1109/TIFS.2024.3364092
Xi QWang FTao LZhang HJiang XWu J(2024)CM-AVAE: Cross-Modal Adversarial Variational Autoencoder for Visual-to-Tactile Data GenerationIEEE Robotics and Automation Letters10.1109/LRA.2024.33871469:6(5214-5221)Online publication date: Jun-2024
https://doi.org/10.1109/LRA.2024.3387146
Xu ZWang TLiu DHu DZeng HCao J(2024)Audio-Visual Cross-Modal Generation with Multimodal Variational Generative Model2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10557902(1-5)Online publication date: 19-May-2024
https://doi.org/10.1109/ISCAS58744.2024.10557902
Lee SChi HOh GByeon WYoon SPark HCho WKim JKim S(2024)Robust sound-guided image manipulationNeural Networks10.1016/j.neunet.2024.106271(106271)Online publication date: Mar-2024
https://doi.org/10.1016/j.neunet.2024.106271
Li ZZhao BYuan Y(2024)Cross-modal generative model for visual-guided binaural stereo generationKnowledge-Based Systems10.1016/j.knosys.2024.111814296(111814)Online publication date: Jul-2024
https://doi.org/10.1016/j.knosys.2024.111814
Fanzeres LNadeu C(2023)Sound-to-Imagination: An Exploratory Study on Cross-Modal Translation Using Diverse Audiovisual DataApplied Sciences10.3390/app13191083313:19(10833)Online publication date: 29-Sep-2023
https://doi.org/10.3390/app131910833
Yang ZWu ZShan YJia JWilliams BChen YNeville J(2023)What does your face sound like? 3D face shape towards voiceProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i11.26628(13905-13913)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i11.26628
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents