research-article

Event Localization in Music Auto-tagging

Authors:

Yi-Hsuan YangAuthors Info & Claims

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 1048 - 1057

https://doi.org/10.1145/2964284.2964292

Published: 01 October 2016 Publication History

Abstract

In music auto-tagging, people develop models to automatically label a music clip with attributes such as instruments, styles or acoustic properties. Many of these tags are actually descriptors of local events in a music clip, rather than a holistic description of the whole clip. Localizing such tags in time can potentially innovate the way people retrieve and interact with music, but little work has been done to date due to the scarcity of labeled data with granularity specific enough to the frame level. Most labeled data for training a learning-based model for music auto-tagging are in the clip level, providing no cues when and how long these attributes appear in a music clip. To bridge this gap, we propose in this paper a convolutional neural network (CNN) architecture that is able to make accurate frame-level predictions of tags in unseen music clips by using only clip-level annotations in the training phase. Our approach is motivated by recent advances in computer vision for localizing visual objects, but we propose new designs of the CNN architecture to account for the temporal information of music and the variable duration of such local tags in time. We report extensive experiments to gain insights into the problem of event localization in music, and validate through experiments the effectiveness of the proposed approach. In addition to quantitative evaluations, we also present qualitative analyses showing the model can indeed learn certain characteristics of music tags.

References

[1]

O. Abdel-Hamid, L. Deng, and D. Yu. Exploring convolutional neural network structures and optimization techniques for speech recognition. In Proc. INTERSPEECH, pages 3366--3370, 2013.

[2]

O. Abdel-Hamid, A. r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural networks for speech recognition. IEEE/ACM TASLP, 22(10):1533--1545, Oct 2014.

Digital Library

[3]

F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.

[4]

J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proc. the Python for Scientific Computing Conference (SciPy), June 2010. http://deeplearning.net/software/theano/.

[5]

R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. Medleydb: A multitrack dataset for annotation-intensive mir research. In Proc. ISMIR, pages 155--160, 2014. http://medleydb.weebly.com.

[6]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.

[7]

S. Dieleman and B. Schrauwen. Multiscale approaches to music audio feature learning. In Proc. ISMIR, pages 116--121, 2013.

[8]

S. Dieleman and B. Schrauwen. End-to-end learning for music audio. In Proc. ICASSP, pages 6964--6968, May 2014.

[9]

D. Eck, P. Lamere, S. Green, and T. Bertin-Mahieux. Automatic generation of social tags for music recommendation. In Proc. NIPS, pages 1--8, 2007.

Digital Library

[10]

S. Essid, G. Richard, and B. David. Instrument recognition in polyphonic music based on automatic taxonomies. IEEE TASLP, 14(1):68--80, jan 2006.

Digital Library

[11]

P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE TPAMI, 32(9):1627--1645, 2010.

Digital Library

[12]

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett. Darpa timit acoustic phonetic continuous speech corpus. 1993. US Dept. of Commerce, NIST, Gaithersburg, USA.

[13]

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. CVPR, pages 580--587, 2014.

Digital Library

[14]

A. Graves. Supervised Sequence Labelling with Recurrent Neural Networks, volume 385 of Studies in Computational Intelligence. Springer, 2012.

[15]

P. Hamel, S. Lemieux, Y. Bengio, and D. Eck. Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In Proc. ISMIR, pages 729--734, 2011.

[16]

M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun. Unsupervised learning of sparse features for scalable audio classification. In Proc. ISMIR, pages 681--686, 2011.

[17]

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82--97, 2012.

[18]

E. J. Humphrey, J. P. Bello, and Y. LeCun. Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proc. ISMIR, pages 403--408, 2012.

[19]

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proc. CVPR, 2014.

Digital Library

[20]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, pages 1097--1105, 2012.

Digital Library

[21]

E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie. Evaluation of algorithms using games: The case of music tagging. In Proc. ISMIR, 2009. http://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset.

[22]

M. Mandel, D. Eck, and Y. Bengio. Learning tags that vary within a song. Proc. ISMIR, pages 399--404, 2010.

[23]

M. I. Mandel and D. P. Ellis. A Web-Based Game for Collecting Music Metadata. Journal of New Music Research, 37(2):151--165, 2008.

[24]

M. I. Mandel, R. Pascanu, D. Eck, Y. Bengio, L. M. Aiello, R. Schifanella, and F. Menczer. Contextual tag inference. ACM TOMCCAP, 7S(1):1--18, oct 2011.

Digital Library

[25]

J. Masci, A. Giusti, D. Ciresan, G. Fricout, and J. Schmidhuber. A fast learning algorithm for image segmentation with max-pooling convolutional networks. In Proc. ICIP, pages 2713--2717, 2013.

[26]

B. McFee, M. McVicar, C. Raffel, D. Liang, O. Nieto, E. Battenberg, J. Moore, D. Ellis, R. Yamamoto, R. Bittner, D. Repetto, P. Viktorin, J. F. Santos, and A. Holovaty. librosa: 0.4.1, Oct. 2015.

[27]

J. Nam, J. Herrera, M. Slaney, and J. O. Smith. Learning sparse feature representations for music annotation and retrieval. In Proc. ISMIR, pages 565--570, 2012.

[28]

M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free-weakly-supervised learning with convolutional neural networks. In Proc. CVPR, pages 685--694, 2015.

[29]

G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proc. ICCV, pages 1742--1750, 2015.

Digital Library

[30]

G. Parascandolo, H. Huttunen, and T. Virtanen. Recurrent Neural Networks for Polyphonic Sound Event Detection In Real Life Recordings. In Proc. ICASSP, 2016.

Digital Library

[31]

D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully convolutional multi-class multiple instance learning. arXiv preprint arXiv:1412.7144, 2014.

[32]

P. O. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In Proc. ICML, pages 82--90, 2014.

[33]

P. O. Pinheiro and R. Collobert. From image-level to pixel-level labeling with convolutional networks. In Proc. CVPR, pages 1713--1721, 2015.

[34]

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proc. NIPS, pages 568--576. 2014.

Digital Library

[35]

D. Tingle, Y. E. Kim, and D. Turnbull. Exploring automatic music annotation with acoustically-objective tags. In Proc. ICMR, pages 55--61, 2010.

Digital Library

[36]

D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Semantic annotation and retrieval of music and sound effects. IEEE TASLP, 16(2):467--476, feb 2008.

Digital Library

[37]

S.-Y. Wang, J.-C. W. Y.-H. Yang, and H.-M. Wang. Towards time-varying music auto-tagging based on cal500 expansion. In Proc. ICME, pages 1--6, 2014.

[38]

Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. In Proc. CVPR, 2015.

Cited By

Cheng CWei TChen H(2024)Playlist Continuation of Cold-Start Songs2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00029(141-147)Online publication date: 7-Aug-2024
https://doi.org/10.1109/MIPR62202.2024.00029
Banoori FIjaz NShi JKhan KLiu XAhmad SPrakash APławiak PHammad M(2024)Few-Shot Bioacoustics Event Detection Using Transductive Inference With Data AugmentationIEEE Sensors Letters10.1109/LSENS.2024.33630218:3(1-4)Online publication date: Mar-2024
https://doi.org/10.1109/LSENS.2024.3363021
Zhao CWang JLi LQu XXiao J(2022)Adaptive Few-Shot Learning Algorithm for Rare Sound Event Detection2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892604(1-7)Online publication date: 18-Jul-2022
https://doi.org/10.1109/IJCNN55064.2022.9892604
Show More Cited By

Index Terms

Event Localization in Music Auto-tagging

Recommendations

Music auto-tagging using scattering transform and convolutional neural network with self-attention
Abstract
As a branch of machine learning, deep learning has been used for tackling with the music auto-tagging problem. Deep learning methods, especially those with convolutional neural network (CNN) architecture, have exhibited good ...
Highlights
- A deep convolutional neural network with self-attention mechanism and scattering coefficients is proposed for music automatic tagging.
A sample-level DCNN for music auto-tagging
Abstract
Deep convolutional neural networks (DCNNs) has been widely used in music auto-tagging which is a multi-label classification task that predicts tags of audio signals. This paper presents a sample-level DCNN for music auto-tagging. The proposed DCNN ...
Semantic Tagging of Singing Voices in Popular Music Recordings
Singing voice is a key sound source in popular music. As recent music streaming and entertainment services call for more intelligent solutions to retrieve songs or evaluate musical characteristics, automatic analysis of popular music targeted to singing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '16: Proceedings of the 24th ACM international conference on Multimedia

October 2016

1542 pages

ISBN:9781450336031

DOI:10.1145/2964284

General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft

Copyright © 2016 ACM.

© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '16

Sponsor:

SIGMM

MM '16: ACM Multimedia Conference

October 15 - 19, 2016

Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
382
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cheng CWei TChen H(2024)Playlist Continuation of Cold-Start Songs2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00029(141-147)Online publication date: 7-Aug-2024
https://doi.org/10.1109/MIPR62202.2024.00029
Banoori FIjaz NShi JKhan KLiu XAhmad SPrakash APławiak PHammad M(2024)Few-Shot Bioacoustics Event Detection Using Transductive Inference With Data AugmentationIEEE Sensors Letters10.1109/LSENS.2024.33630218:3(1-4)Online publication date: Mar-2024
https://doi.org/10.1109/LSENS.2024.3363021
Zhao CWang JLi LQu XXiao J(2022)Adaptive Few-Shot Learning Algorithm for Rare Sound Event Detection2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892604(1-7)Online publication date: 18-Jul-2022
https://doi.org/10.1109/IJCNN55064.2022.9892604
Kratimenos AAvramidis KGaroufis CZlatintsi AMaragos P(2021)Augmentation Methods on Monophonic Audio for Instrument Classification in Polyphonic Music2020 28th European Signal Processing Conference (EUSIPCO)10.23919/Eusipco47968.2020.9287745(156-160)Online publication date: 24-Jan-2021
https://doi.org/10.23919/Eusipco47968.2020.9287745
Wei WLin JLiu TTyan HWang HLiao H(2021)Learning to Visualize Music Through Shot Sequence for Automatic Concert Video MashupIEEE Transactions on Multimedia10.1109/TMM.2020.300363123(1731-1743)Online publication date: 2021
https://doi.org/10.1109/TMM.2020.3003631
Lin JWei WLin YLiu TLiao HWen Chen CCucchiara RHua XQi GRicci EZhang ZZimmermann R(2020)Learning From Music to Visual Storytelling of Shots: A Deep Interactive Learning MechanismProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413985(102-110)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3394171.3413985
Shimada KKoyama YInoue A(2020)Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound EventsICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP40776.2020.9054712(616-620)Online publication date: May-2020
https://doi.org/10.1109/ICASSP40776.2020.9054712
Parekh JRao PYang Y(2020)Speech-To-Singing Conversion in an Encoder-Decoder FrameworkICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP40776.2020.9054473(261-265)Online publication date: May-2020
https://doi.org/10.1109/ICASSP40776.2020.9054473
Wang QSu FWang YEl Saddik ADel Bimbo AZhang ZHauptmann ACandan KBertini MXie LWei X(2019)A Hierarchical Attentive Deep Neural Network Model for Semantic Music Annotation Integrating Multiple Music RepresentationsProceedings of the 2019 on International Conference on Multimedia Retrieval10.1145/3323873.3325031(150-158)Online publication date: 5-Jun-2019
https://dl.acm.org/doi/10.1145/3323873.3325031
Liu JYang YJeng S(2019)Weakly-Supervised Visual Instrument-Playing Action Detection in VideosIEEE Transactions on Multimedia10.1109/TMM.2018.287141821:4(887-901)Online publication date: Apr-2019
https://doi.org/10.1109/TMM.2018.2871418
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents