Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3126686.3126723acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Deep Cross-Modal Audio-Visual Generation

Published: 23 October 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.

    References

    [1]
    Sima Behpour and Brian D Ziebart. 2016. Adversarial methods improve object localization. Advances in Neural Information Processing Systems Workshop.
    [2]
    Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. 2016. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In European Conference on Computer Vision.
    [3]
    Richard K. Davenport, Charles M. Rogers, and I. Steele Russell. 1973. Cross-modal perception in apes. Neuropsychologia, Vol. 11, 1 (1973), 21--28.
    [4]
    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database IEEE Conference on Computer Vision and Pattern Recognition.
    [5]
    Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks Advances in Neural Information Processing Systems.
    [6]
    Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder ACM International Conference on Multimedia.
    [7]
    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems.
    [8]
    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks IEEE Conference on Computer Vision and Pattern Recognition.
    [9]
    S. Kumar, V. Dhiman, and J. J. Corso. 2014. Learning compositional sparse models of bimodal percepts AAAI Conference on Artificial Intelligence.
    [10]
    Bochen Li, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma. 2017. See and listen: score-informed association of sound tracks to players in chamber music performance videos. In IEEE International Conference on Acoustics, Speech and Signal Processing.
    [11]
    Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma. 2016. Creating a classical musical performance dataset for multimodal music analysis: Challenges, Insights, and Applications. In arXiv:1612.08727.
    [12]
    Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. 2016. Semantic segmentation using adversarial networks. arXiv:1611.08408.
    [13]
    Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. 2016. Adversarial autoencoders. In International Conference on Learning Representations.
    [14]
    Christophe Mignot, Claude Valot, and Noelle Carbonell. 1993. An experimental study of future "natural" multimodal human-computer interaction INTERACT'93 and CHI'93 Conference Companion on Human Factors in Computing Systems.
    [15]
    Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. In arXiv:1411.1784.
    [16]
    Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In International Conference on Machine Learning.
    [17]
    Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. 2016. Visually indicated sounds. In IEEE Conference on Computer Vision and Pattern Recognition.
    [18]
    Santiago Pascual, Antonio Bonafonte, and Joan Serrà. 2017. SEGAN: Speech Enhancement Generative Adversarial Network arXiv:1703.09452.
    [19]
    Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, 3 (2014), 521--535.
    [20]
    Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks International Conference on Learning Representations.
    [21]
    Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval ACM International Conference on Multimedia.
    [22]
    Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning deep representations of fine-grained visual descriptions IEEE Conference on Computer Vision and Pattern Recognition.
    [23]
    Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text-to-image synthesis. In International Conference on Machine Learning.
    [24]
    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training GANs. In Advances in Neural Information Processing Systems.
    [25]
    Nasim Souly, Concetto Spampinato, and Mubarak Shah. 2017. Semi and Weakly Supervised Semantic Segmentation Using Generative Adversarial Networks arXiv:1703.09695.
    [26]
    Nitish Srivastava and Ruslan R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. Advances in Neural Information Processing Systems.
    [27]
    Barry E. Stein and M. Alex Meredith. 1993. The merging of the senses. The MIT Press.
    [28]
    Russell L. Storms. 1998. Auditory-visual cross-modal perception phenomena. Ph.D. Dissertation. bibinfoschoolNaval Postgraduate School.
    [29]
    M. Iftekhar Tanveer, Ji Liu, and M. Ehsan Hoque. 2015. Unsupervised extraction of human-interpretable nonverbal behavioral cues in a public speaking scenario. In ACM International Conference on Multimedia.
    [30]
    Bradley W. Vines, Carol L. Krumhansl, Marcelo M. Wanderley, and Daniel J. Levitin. 2006. Cross-modal interactions in the perception of musical performance. Cognition, Vol. 101, 1 (2006), 80--113.
    [31]
    Jean Vroomen and Beatrice de Gelder. 2000. Sound enhances visual perception: cross-modal effects of auditory organization on vision. Journal of experimental psychology: Human perception and performance, Vol. 26, 5 (2000), 1583.
    [32]
    Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A Comprehensive Survey on Cross-modal Retrieval. arXiv:1607.06215.
    [33]
    Hang Zhang and Kristin Dana. 2017. Multi-style generative network for real-time transfer arXiv:1703.06953.

    Cited By

    View all
    • (2024)EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio GuidanceApplied Sciences10.3390/app1408319314:8(3193)Online publication date: 10-Apr-2024
    • (2024)Fine-Grained Image Generation Network With Radar Range Profiles Using Cross-Modal Visual SupervisionIEEE Transactions on Microwave Theory and Techniques10.1109/TMTT.2023.329961572:2(1339-1352)Online publication date: Feb-2024
    • (2024)Method for Audio-to-Tactile Cross-Modality Generation Based on Residual U-NetIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.333645373(1-14)Online publication date: 2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017
    October 2017
    558 pages
    ISBN:9781450354165
    DOI:10.1145/3126686
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 October 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. audio-visual
    2. cross-modal generation
    3. generative adversarial networks

    Qualifiers

    • Research-article

    Conference

    MM '17
    Sponsor:
    MM '17: ACM Multimedia Conference
    October 23 - 27, 2017
    California, Mountain View, USA

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)168
    • Downloads (Last 6 weeks)22

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio GuidanceApplied Sciences10.3390/app1408319314:8(3193)Online publication date: 10-Apr-2024
    • (2024)Fine-Grained Image Generation Network With Radar Range Profiles Using Cross-Modal Visual SupervisionIEEE Transactions on Microwave Theory and Techniques10.1109/TMTT.2023.329961572:2(1339-1352)Online publication date: Feb-2024
    • (2024)Method for Audio-to-Tactile Cross-Modality Generation Based on Residual U-NetIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.333645373(1-14)Online publication date: 2024
    • (2024)Cross-Modal Learning Based Flexible Bimodal Biometric Authentication With Template ProtectionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.336409219(3593-3607)Online publication date: 2024
    • (2024)CM-AVAE: Cross-Modal Adversarial Variational Autoencoder for Visual-to-Tactile Data GenerationIEEE Robotics and Automation Letters10.1109/LRA.2024.33871469:6(5214-5221)Online publication date: Jun-2024
    • (2024)Audio-Visual Cross-Modal Generation with Multimodal Variational Generative Model2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10557902(1-5)Online publication date: 19-May-2024
    • (2024)Robust sound-guided image manipulationNeural Networks10.1016/j.neunet.2024.106271(106271)Online publication date: Mar-2024
    • (2024)Cross-modal generative model for visual-guided binaural stereo generationKnowledge-Based Systems10.1016/j.knosys.2024.111814296(111814)Online publication date: Jul-2024
    • (2023)Sound-to-Imagination: An Exploratory Study on Cross-Modal Translation Using Diverse Audiovisual DataApplied Sciences10.3390/app13191083313:19(10833)Online publication date: 29-Sep-2023
    • (2023)What does your face sound like? 3D face shape towards voiceProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i11.26628(13905-13913)Online publication date: 7-Feb-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media