research-article

Contrastive Learning with Positive-Negative Frame Mask for Music Representation

Authors:

Xiuqiang HeAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 2906 - 2915

https://doi.org/10.1145/3485447.3512011

Published: 25 April 2022 Publication History

Abstract

Self-supervised learning, especially contrastive learning, has made an outstanding contribution to the development of many deep learning research fields. Recently, researchers in the acoustic signal processing field noticed its success and leveraged contrastive learning for better music representation. Typically, existing approaches maximize the similarity between two distorted audio segments sampled from the same music. In other words, they ensure a semantic agreement at the music level. However, those coarse-grained methods neglect some inessential or noisy elements at the frame level, which may be detrimental to the model to learn the effective representation of music. Towards this end, this paper proposes a novel Positive-nEgative frame mask for Music Representation based on the contrastive learning framework, abbreviated as PEMR. Concretely, PEMR incorporates a Positive-Negative Mask Generation module, which leverages transformer blocks to generate frame masks on Log-Mel spectrogram. We can generate self-augmented negative and positive samples by masking important components or inessential components, respectively. We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music. We conduct experiments on four public datasets. The experimental results of two music-related downstream tasks, music classification and cover song identification, demonstrate the generalization ability and transferability of music representation learned by PEMR.

References

[1]

Philip Bachman, R. Devon Hjelm, and William Buchwalter. 2019. Learning representations by maximizing mutual information across views. Advances in Neural Information Processing Systems 32 (2019). arxiv:1906.00910

[2]

Lucas Beyer, Xiaohua Zhai, Avital Oliver, and Alexander Kolesnikov. 2019. S4L: Self-supervised semi-supervised learning. Proceedings of the IEEE International Conference on Computer Vision 2019-October(2019), 1476–1485. https://doi.org/10.1109/ICCV.2019.00156 arXiv:1905.03670

[3]

Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11218 LNCS (2018), 139–156.

[4]

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. NeurIPS (2020), 1–13. arXiv:2006.09882

[5]

Feiyang Chen, Rongjie Huang, Chenye Cui, Yi Ren, Jinglin Liu, and Zhou Zhao. 2021. SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation. 0 (2021). arxiv:2110.07468http://arxiv.org/abs/2110.07468

[6]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. arXivFigure 1(2020). arXiv:2002.05709

[7]

Keunwoo Choi, György Fazekas, and Mark Sandler. 2016. Automatic tagging using deep convolutional neural networks. Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR 2016(2016), 805–811. arxiv:1606.00298

[8]

Keunwoo Choi, Gyorgy Fazekas, Mark Sandler, and Kyunghyun Cho. 2017. Convolutional recurrent neural networks for music classification. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings(2017), 2392–2396. https://doi.org/10.1109/ICASSP.2017.7952585 arXiv:1609.04243

Digital Library

[9]

Ieee International Conference and Signal Processing. 2014. END-TO-END LEARNING FOR MUSIC AUDIO Sander Dieleman, Benjamin Schrauwen Electronics and information systems department. Icassp (2014), 7014–7018.

[10]

Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1(2019), 4171–4186. arXiv:1810.04805

[11]

Halvard Fausk and Daniel C. Isaksen. 2007. Improving Language Understanding by Generative Pre-Training. Homology, Homotopy and Applications 9, 1 (2007), 399–438. https://doi.org/10.4310/HHA.2007.v9.n1.a16

[12]

Ross Girshick. 2015. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision 2015 Inter(2015), 1440–1448. https://doi.org/10.1109/ICCV.2015.169 arxiv:1504.08083

Digital Library

[13]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.

Digital Library

[14]

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap your own latent: A new approach to self-supervised Learning. 200 (2020). arXiv:2006.07733http://arxiv.org/abs/2006.07733

[15]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2(2006), 1735–1742. https://doi.org/10.1109/CVPR.2006.100

Digital Library

[16]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 9726–9735. https://doi.org/10.1109/CVPR42600.2020.00975 arXiv:1911.05722

[17]

Olivier J. Hénaff, Ali Razavi, Carl Doersch, S. M. Ali Eslami, and Aaron Van Den Oord. 2019. Data-efficient image recognition with contrastive predictive coding. arXiv2018(2019). arXiv:1905.09272

[18]

Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. 2021. Multi-Singer: Fast Multi-Singer Singing Voice Vocoder with A Large-Scale Corpus. Vol. 1. Association for Computing Machinery. 3945–3954 pages. https://doi.org/10.1145/3474085.3475437 arxiv:arXiv:2112.10358v1

Digital Library

[19]

Chaoya Jiang, Deshun Yang, and Xiaoou Chen. 2020. Similarity Learning for Cover Song Identification Using Cross-Similarity Matrices of Multi-Level Deep Sequences. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2020-May (2020), 26–30.

[20]

Ziyu Jiang, Tianlong Chen, Bobak Mortazavi, and Zhangyang Wang. 2021. Self-Damaging Contrastive Learning. (2021). arXiv:2106.02990http://arxiv.org/abs/2106.02990

[21]

Thomas N. Kipf and Max Welling. 2016. Variational Graph Auto-Encoders. 2 (2016), 1–3. arXiv:1611.07308http://arxiv.org/abs/1611.07308

[22]

Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. 2019. Revisiting self-supervised visual representation learning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2019-June (2019), 1920–1929. arxiv:1901.09005

[23]

Edith Law, Kris West, Michael Mandel, Mert Bay, and J. Stephen Downie. 2009. Evaluation of algorithms using games: The case of music tagging. Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR 2009Ismir (2009), 387–392.

[24]

Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. 2019. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. Proceedings of the 14th Sound and Music Computing Conference 2017, SMC 2017 (2019), 220–226. arXiv:1703.01789

[25]

Andy T. Liu, Shu Wen Yang, Po Han Chi, Po Chun Hsu, and Hung Yi Lee. 2020. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2020-May (2020), 6419–6423. https://doi.org/10.1109/ICASSP40776.2020.9054458 arXiv:1910.12638

[26]

Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. 2021. DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism. (2021). arxiv:2105.02446http://arxiv.org/abs/2105.02446

[27]

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino. 2021. BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation. (2021). arXiv:2103.06695http://arxiv.org/abs/2103.06695

[28]

Jordi Pons and Xavier Serra. 2019. musicnn: Pre-trained convolutional neural networks for music audio tagging. (2019), 4–5. arXiv:1909.06654http://arxiv.org/abs/1909.06654

[29]

Ali Razavi, Aäron van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with VQ-VAE-2. Advances in Neural Information Processing Systems 32, NeurIPS 2019(2019), 1–11. arXiv:1906.00446

[30]

Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, and Tie Yan Liu. 2020. PopMAG: Pop Music Accompaniment Generation. MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia1(2020), 1198–1206. https://doi.org/10.1145/3394171.3413721 arXiv:2008.07703

Digital Library

[31]

Yi Ren, Jinglin Liu, and Zhou Zhao. 2021. PortaSpeech: Portable and High-Quality Generative Text-to-Speech. NeurIPS (2021). arxiv:2109.15166http://arxiv.org/abs/2109.15166

[32]

Aaqib Saeed, David Grangier, and Neil Zeghidour. 2021. Contrastive Learning of General-Purpose Audio Representations. (2021), 3875–3879. https://doi.org/10.1109/icassp39728.2021.9413528 arXiv:2010.10915

[33]

Janne Spijkervet and John Ashley Burgoyne. 2021. Contrastive Learning of Musical Representations. (2021). arxiv:2103.09410http://arxiv.org/abs/2103.09410

[34]

Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive Multiview Coding. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12356 LNCS (2020), 776–794. https://doi.org/10.1007/978-3-030-58621-8_45 arXiv:1906.05849

Digital Library

[35]

Aäron van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2014. Transfer learning by supervised pre-training for audio-based music classification. Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR 2014Ismir (2014), 29–34.

[36]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 2017-Decem, Nips(2017), 5999–6009. arXiv:1706.03762

[37]

Ho-Hsiang Wu, Chieh-Chi Kao, Qingming Tang, Ming Sun, Brian McFee, Juan Pablo Bello, and Chao Wang. 2021. Multi-Task Self-Supervised Pre-Training for Music Classification. (2021), 556–560. https://doi.org/10.1109/icassp39728.2021.9414405 arXiv:2102.03229

[38]

Xiaoshuo Xu, Xiaoou Chen, and Deshun Yang. 2018. Key-Invariant Convolutional Neural Network Toward Efficient Cover Song Identification. Proceedings - IEEE International Conference on Multimedia and Expo 2018-July (2018), 1–6. https://doi.org/10.1109/ICME.2018.8486531

[39]

Xiaoshuo Xu, Xiaoou Chen, and Deshun Yang. 2018. Key-Invariant Convolutional Neural Network Toward Efficient Cover Song Identification. Proceedings - IEEE International Conference on Multimedia and Expo 2018-July (2018), 1–6. https://doi.org/10.1109/ICME.2018.8486531

[40]

Furkan Yesiler, Joan Serra, and Emilia Gomez. 2020. Accurate and Scalable Version Identification Using Musically-Motivated Embeddings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2020-May, Icassp (2020), 21–25.

[41]

Zhesong Yu, Xiaoshuo Xu, Xiaoou Chen, and Deshun Yang. 2019. Temporal pyramid pooling convolutional neural network for cover song identification. IJCAI International Joint Conference on Artificial Intelligence 2019-Augus(2019), 4846–4852. https://doi.org/10.24963/ijcai.2019/673

[42]

Zhesong Yu, Xiaoshuo Xu, Xiaoou Chen, and Deshun Yang. 2020. Learning a Representation for Cover Song Identification Using Convolutional Neural Network. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2020-May (2020), 541–545. https://doi.org/10.1109/ICASSP40776.2020.9053839 arXiv:1911.00334

[43]

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. 2021. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. (2021). arXiv:2103.03230http://arxiv.org/abs/2103.03230

[44]

Mingliang Zeng, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, and Tie-Yan Liu. 2021. MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training. (2021), 791–800. https://doi.org/10.18653/v1/2021.findings-acl.70 arXiv:2106.05630

[45]

Shengyu Zhang, Dong Yao, Zhou Zhao, Tat-Seng Chua, and Fei Wu. 2021. CauseRec: Counterfactual User Sequence Synthesis for Sequential Recommendation. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021.

Digital Library

[46]

Yilun Zhao and Jia Guo. 2021. MusiCoder: A Universal Music-Acoustic Encoder Based on Transformer. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12572 LNCS (2021), 417–429. https://doi.org/10.1007/978-3-030-67832-6_34 arXiv:2008.00781

Digital Library

Cited By

Su YHu WZhang FXu QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)AMG-Embedding: A Self-Supervised Embedding Approach for Audio IdentificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681647(9544-9553)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681647
Yang WJian YWang YLu SShen LWang BTang HZhang LChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Not All Embeddings are Created Equal: Towards Robust Cross-domain Recommendation via Contrastive LearningProceedings of the ACM Web Conference 202410.1145/3589334.3645357(3195-3206)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645357
McCallum MDavies MHenkel FKim JSandberg S(2024)On The Effect Of Data-Augmentation On Local Embedding Properties In The Contrastive Learning Of Music Audio RepresentationsICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446274(671-675)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446274
Show More Cited By

Index Terms

Contrastive Learning with Positive-Negative Frame Mask for Music Representation

Index terms have been assigned to the content through auto-classification.

Recommendations

iSargam: music notation representation for Indian Carnatic music

Indian classical music, including its two varieties, Carnatic and Hindustani music, has a rich music tradition and enjoys a wide audience from various parts of the world. The Carnatic music which is more popular in South India still continues to be ...
Exploiting Negative Preference in Content-based Music Recommendation with Contrastive Learning
RecSys '22: Proceedings of the 16th ACM Conference on Recommender Systems

Advanced music recommendation systems are being introduced along with the development of machine learning. However, it is essential to design a music recommendation system that can increase user satisfaction by understanding users’ music tastes, not by ...
Video Background Music Generation with Controllable Music Transformer
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

In this work, we address the task of video background music generation. Some previous works achieve effective music generation but are unable to generate melodious music specifically for a given video, and none of them considers the video-music rhythmic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Zhejiang Natural Science Foundation award
National Natural Science Foundation of China under Grant award

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
390
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)2

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Su YHu WZhang FXu QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)AMG-Embedding: A Self-Supervised Embedding Approach for Audio IdentificationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681647(9544-9553)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681647
Yang WJian YWang YLu SShen LWang BTang HZhang LChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Not All Embeddings are Created Equal: Towards Robust Cross-domain Recommendation via Contrastive LearningProceedings of the ACM Web Conference 202410.1145/3589334.3645357(3195-3206)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645357
McCallum MDavies MHenkel FKim JSandberg S(2024)On The Effect Of Data-Augmentation On Local Embedding Properties In The Contrastive Learning Of Music Audio RepresentationsICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446274(671-675)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446274
Zeng TLau F(2023)Training audio transformers for cover song identificationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-023-00297-42023:1Online publication date: 25-Aug-2023
https://dl.acm.org/doi/10.1186/s13636-023-00297-4
Bolt JFazekas G(2023)Supervised Contrastive Learning For Musical Onset DetectionProceedings of the 18th International Audio Mostly Conference10.1145/3616195.3616215(130-135)Online publication date: 30-Aug-2023
https://dl.acm.org/doi/10.1145/3616195.3616215
Yang WFang ZZhang TWu SLu CEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Modal-aware Bias Constrained Contrastive Learning for Multimodal RecommendationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612568(6369-6378)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612568
Xun JZhang SYang YZhu JDeng LZhao ZDong ZLi RZhang LWu FChen HDuh WHuang HKato MMothe JPoblete B(2023)DisCover: Disentangled Music Representation Learning for Cover Song IdentificationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591664(453-463)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591664
Jiang JXia G(2023)Self-Supervised Hierarchical Metrical Structure ModelingICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096498(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10096498
Alonso-Jiménez PFavory XForoughmand HBourdalas GSerra XLidy TBogdanov D(2023)Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and SimilarityICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095058(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10095058

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten