Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3512011acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Contrastive Learning with Positive-Negative Frame Mask for Music Representation

Published: 25 April 2022 Publication History

Abstract

Self-supervised learning, especially contrastive learning, has made an outstanding contribution to the development of many deep learning research fields. Recently, researchers in the acoustic signal processing field noticed its success and leveraged contrastive learning for better music representation. Typically, existing approaches maximize the similarity between two distorted audio segments sampled from the same music. In other words, they ensure a semantic agreement at the music level. However, those coarse-grained methods neglect some inessential or noisy elements at the frame level, which may be detrimental to the model to learn the effective representation of music. Towards this end, this paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR. Concretely, PEMR incorporates a Positive-Negative Mask Generation module, which leverages transformer blocks to generate frame masks on Log-Mel spectrogram. We can generate self-augmented negative and positive samples by masking important components or inessential components, respectively. We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music. We conduct experiments on four public datasets. The experimental results of two music-related downstream tasks, music classification and cover song identification, demonstrate the generalization ability and transferability of music representation learned by PEMR.

References

[1]
Philip Bachman, R. Devon Hjelm, and William Buchwalter. 2019. Learning representations by maximizing mutual information across views. Advances in Neural Information Processing Systems 32 (2019). arxiv:1906.00910
[2]
Lucas Beyer, Xiaohua Zhai, Avital Oliver, and Alexander Kolesnikov. 2019. S4L: Self-supervised semi-supervised learning. Proceedings of the IEEE International Conference on Computer Vision 2019-October(2019), 1476–1485. https://doi.org/10.1109/ICCV.2019.00156 arXiv:1905.03670
[3]
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11218 LNCS (2018), 139–156.
[4]
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. NeurIPS (2020), 1–13. arXiv:2006.09882
[5]
Feiyang Chen, Rongjie Huang, Chenye Cui, Yi Ren, Jinglin Liu, and Zhou Zhao. 2021. SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation. 0 (2021). arxiv:2110.07468http://arxiv.org/abs/2110.07468
[6]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. arXivFigure 1(2020). arXiv:2002.05709
[7]
Keunwoo Choi, György Fazekas, and Mark Sandler. 2016. Automatic tagging using deep convolutional neural networks. Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR 2016(2016), 805–811. arxiv:1606.00298
[8]
Keunwoo Choi, Gyorgy Fazekas, Mark Sandler, and Kyunghyun Cho. 2017. Convolutional recurrent neural networks for music classification. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings(2017), 2392–2396. https://doi.org/10.1109/ICASSP.2017.7952585 arXiv:1609.04243
[9]
Ieee International Conference and Signal Processing. 2014. END-TO-END LEARNING FOR MUSIC AUDIO Sander Dieleman, Benjamin Schrauwen Electronics and information systems department. Icassp (2014), 7014–7018.
[10]
Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1(2019), 4171–4186. arXiv:1810.04805
[11]
Halvard Fausk and Daniel C. Isaksen. 2007. Improving Language Understanding by Generative Pre-Training. Homology, Homotopy and Applications 9, 1 (2007), 399–438. https://doi.org/10.4310/HHA.2007.v9.n1.a16
[12]
Ross Girshick. 2015. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision 2015 Inter(2015), 1440–1448. https://doi.org/10.1109/ICCV.2015.169 arxiv:1504.08083
[13]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
[14]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap your own latent: A new approach to self-supervised Learning. 200 (2020). arXiv:2006.07733http://arxiv.org/abs/2006.07733
[15]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2(2006), 1735–1742. https://doi.org/10.1109/CVPR.2006.100
[16]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 9726–9735. https://doi.org/10.1109/CVPR42600.2020.00975 arXiv:1911.05722
[17]
Olivier J. Hénaff, Ali Razavi, Carl Doersch, S. M. Ali Eslami, and Aaron Van Den Oord. 2019. Data-efficient image recognition with contrastive predictive coding. arXiv2018(2019). arXiv:1905.09272
[18]
Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. 2021. Multi-Singer: Fast Multi-Singer Singing Voice Vocoder with A Large-Scale Corpus. Vol. 1. Association for Computing Machinery. 3945–3954 pages. https://doi.org/10.1145/3474085.3475437 arxiv:arXiv:2112.10358v1
[19]
Chaoya Jiang, Deshun Yang, and Xiaoou Chen. 2020. Similarity Learning for Cover Song Identification Using Cross-Similarity Matrices of Multi-Level Deep Sequences. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2020-May (2020), 26–30.
[20]
Ziyu Jiang, Tianlong Chen, Bobak Mortazavi, and Zhangyang Wang. 2021. Self-Damaging Contrastive Learning. (2021). arXiv:2106.02990http://arxiv.org/abs/2106.02990
[21]
Thomas N. Kipf and Max Welling. 2016. Variational Graph Auto-Encoders. 2 (2016), 1–3. arXiv:1611.07308http://arxiv.org/abs/1611.07308
[22]
Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. 2019. Revisiting self-supervised visual representation learning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2019-June (2019), 1920–1929. arxiv:1901.09005
[23]
Edith Law, Kris West, Michael Mandel, Mert Bay, and J. Stephen Downie. 2009. Evaluation of algorithms using games: The case of music tagging. Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR 2009Ismir (2009), 387–392.
[24]
Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. 2019. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. Proceedings of the 14th Sound and Music Computing Conference 2017, SMC 2017 (2019), 220–226. arXiv:1703.01789
[25]
Andy T. Liu, Shu Wen Yang, Po Han Chi, Po Chun Hsu, and Hung Yi Lee. 2020. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2020-May (2020), 6419–6423. https://doi.org/10.1109/ICASSP40776.2020.9054458 arXiv:1910.12638
[26]
Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. 2021. DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism. (2021). arxiv:2105.02446http://arxiv.org/abs/2105.02446
[27]
Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino. 2021. BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation. (2021). arXiv:2103.06695http://arxiv.org/abs/2103.06695
[28]
Jordi Pons and Xavier Serra. 2019. musicnn: Pre-trained convolutional neural networks for music audio tagging. (2019), 4–5. arXiv:1909.06654http://arxiv.org/abs/1909.06654
[29]
Ali Razavi, Aäron van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with VQ-VAE-2. Advances in Neural Information Processing Systems 32, NeurIPS 2019(2019), 1–11. arXiv:1906.00446
[30]
Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, and Tie Yan Liu. 2020. PopMAG: Pop Music Accompaniment Generation. MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia1(2020), 1198–1206. https://doi.org/10.1145/3394171.3413721 arXiv:2008.07703
[31]
Yi Ren, Jinglin Liu, and Zhou Zhao. 2021. PortaSpeech: Portable and High-Quality Generative Text-to-Speech. NeurIPS (2021). arxiv:2109.15166http://arxiv.org/abs/2109.15166
[32]
Aaqib Saeed, David Grangier, and Neil Zeghidour. 2021. Contrastive Learning of General-Purpose Audio Representations. (2021), 3875–3879. https://doi.org/10.1109/icassp39728.2021.9413528 arXiv:2010.10915
[33]
Janne Spijkervet and John Ashley Burgoyne. 2021. Contrastive Learning of Musical Representations. (2021). arxiv:2103.09410http://arxiv.org/abs/2103.09410
[34]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive Multiview Coding. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12356 LNCS (2020), 776–794. https://doi.org/10.1007/978-3-030-58621-8_45 arXiv:1906.05849
[35]
Aäron van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2014. Transfer learning by supervised pre-training for audio-based music classification. Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR 2014Ismir (2014), 29–34.
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 2017-Decem, Nips(2017), 5999–6009. arXiv:1706.03762
[37]
Ho-Hsiang Wu, Chieh-Chi Kao, Qingming Tang, Ming Sun, Brian McFee, Juan Pablo Bello, and Chao Wang. 2021. Multi-Task Self-Supervised Pre-Training for Music Classification. (2021), 556–560. https://doi.org/10.1109/icassp39728.2021.9414405 arXiv:2102.03229
[38]
Xiaoshuo Xu, Xiaoou Chen, and Deshun Yang. 2018. Key-Invariant Convolutional Neural Network Toward Efficient Cover Song Identification. Proceedings - IEEE International Conference on Multimedia and Expo 2018-July (2018), 1–6. https://doi.org/10.1109/ICME.2018.8486531
[39]
Xiaoshuo Xu, Xiaoou Chen, and Deshun Yang. 2018. Key-Invariant Convolutional Neural Network Toward Efficient Cover Song Identification. Proceedings - IEEE International Conference on Multimedia and Expo 2018-July (2018), 1–6. https://doi.org/10.1109/ICME.2018.8486531
[40]
Furkan Yesiler, Joan Serra, and Emilia Gomez. 2020. Accurate and Scalable Version Identification Using Musically-Motivated Embeddings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2020-May, Icassp (2020), 21–25.
[41]
Zhesong Yu, Xiaoshuo Xu, Xiaoou Chen, and Deshun Yang. 2019. Temporal pyramid pooling convolutional neural network for cover song identification. IJCAI International Joint Conference on Artificial Intelligence 2019-Augus(2019), 4846–4852. https://doi.org/10.24963/ijcai.2019/673
[42]
Zhesong Yu, Xiaoshuo Xu, Xiaoou Chen, and Deshun Yang. 2020. Learning a Representation for Cover Song Identification Using Convolutional Neural Network. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2020-May (2020), 541–545. https://doi.org/10.1109/ICASSP40776.2020.9053839 arXiv:1911.00334
[43]
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. 2021. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. (2021). arXiv:2103.03230http://arxiv.org/abs/2103.03230
[44]
Mingliang Zeng, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, and Tie-Yan Liu. 2021. MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training. (2021), 791–800. https://doi.org/10.18653/v1/2021.findings-acl.70 arXiv:2106.05630
[45]
Shengyu Zhang, Dong Yao, Zhou Zhao, Tat-Seng Chua, and Fei Wu. 2021. CauseRec: Counterfactual User Sequence Synthesis for Sequential Recommendation. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021.
[46]
Yilun Zhao and Jia Guo. 2021. MusiCoder: A Universal Music-Acoustic Encoder Based on Transformer. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12572 LNCS (2021), 417–429. https://doi.org/10.1007/978-3-030-67832-6_34 arXiv:2008.00781

Cited By

View all
  • (2024)AMG-Embedding: A Self-Supervised Embedding Approach for Audio IdentificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681647(9544-9553)Online publication date: 28-Oct-2024
  • (2024)Not All Embeddings are Created Equal: Towards Robust Cross-domain Recommendation via Contrastive LearningProceedings of the ACM Web Conference 202410.1145/3589334.3645357(3195-3206)Online publication date: 13-May-2024
  • (2024)On The Effect Of Data-Augmentation On Local Embedding Properties In The Contrastive Learning Of Music Audio RepresentationsICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446274(671-675)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '22: Proceedings of the ACM Web Conference 2022
April 2022
3764 pages
ISBN:9781450390965
DOI:10.1145/3485447
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Attention
  2. Contrastive Learning
  3. Music Representation
  4. Representation Learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Zhejiang Natural Science Foundation award
  • National Natural Science Foundation of China under Grant award

Conference

WWW '22
Sponsor:
WWW '22: The ACM Web Conference 2022
April 25 - 29, 2022
Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)2
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)AMG-Embedding: A Self-Supervised Embedding Approach for Audio IdentificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681647(9544-9553)Online publication date: 28-Oct-2024
  • (2024)Not All Embeddings are Created Equal: Towards Robust Cross-domain Recommendation via Contrastive LearningProceedings of the ACM Web Conference 202410.1145/3589334.3645357(3195-3206)Online publication date: 13-May-2024
  • (2024)On The Effect Of Data-Augmentation On Local Embedding Properties In The Contrastive Learning Of Music Audio RepresentationsICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446274(671-675)Online publication date: 14-Apr-2024
  • (2023)Training audio transformers for cover song identificationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-023-00297-42023:1Online publication date: 25-Aug-2023
  • (2023)Supervised Contrastive Learning For Musical Onset DetectionProceedings of the 18th International Audio Mostly Conference10.1145/3616195.3616215(130-135)Online publication date: 30-Aug-2023
  • (2023)Modal-aware Bias Constrained Contrastive Learning for Multimodal RecommendationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612568(6369-6378)Online publication date: 26-Oct-2023
  • (2023)DisCover: Disentangled Music Representation Learning for Cover Song IdentificationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591664(453-463)Online publication date: 19-Jul-2023
  • (2023)Self-Supervised Hierarchical Metrical Structure ModelingICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096498(1-5)Online publication date: 4-Jun-2023
  • (2023)Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and SimilarityICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095058(1-5)Online publication date: 4-Jun-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media