Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2964284.2964292acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Event Localization in Music Auto-tagging

Published: 01 October 2016 Publication History

Abstract

In music auto-tagging, people develop models to automatically label a music clip with attributes such as instruments, styles or acoustic properties. Many of these tags are actually descriptors of local events in a music clip, rather than a holistic description of the whole clip. Localizing such tags in time can potentially innovate the way people retrieve and interact with music, but little work has been done to date due to the scarcity of labeled data with granularity specific enough to the frame level. Most labeled data for training a learning-based model for music auto-tagging are in the clip level, providing no cues when and how long these attributes appear in a music clip. To bridge this gap, we propose in this paper a convolutional neural network (CNN) architecture that is able to make accurate frame-level predictions of tags in unseen music clips by using only clip-level annotations in the training phase. Our approach is motivated by recent advances in computer vision for localizing visual objects, but we propose new designs of the CNN architecture to account for the temporal information of music and the variable duration of such local tags in time. We report extensive experiments to gain insights into the problem of event localization in music, and validate through experiments the effectiveness of the proposed approach. In addition to quantitative evaluations, we also present qualitative analyses showing the model can indeed learn certain characteristics of music tags.

References

[1]
O. Abdel-Hamid, L. Deng, and D. Yu. Exploring convolutional neural network structures and optimization techniques for speech recognition. In Proc. INTERSPEECH, pages 3366--3370, 2013.
[2]
O. Abdel-Hamid, A. r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural networks for speech recognition. IEEE/ACM TASLP, 22(10):1533--1545, Oct 2014.
[3]
F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
[4]
J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proc. the Python for Scientific Computing Conference (SciPy), June 2010. http://deeplearning.net/software/theano/.
[5]
R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. Medleydb: A multitrack dataset for annotation-intensive mir research. In Proc. ISMIR, pages 155--160, 2014. http://medleydb.weebly.com.
[6]
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
[7]
S. Dieleman and B. Schrauwen. Multiscale approaches to music audio feature learning. In Proc. ISMIR, pages 116--121, 2013.
[8]
S. Dieleman and B. Schrauwen. End-to-end learning for music audio. In Proc. ICASSP, pages 6964--6968, May 2014.
[9]
D. Eck, P. Lamere, S. Green, and T. Bertin-Mahieux. Automatic generation of social tags for music recommendation. In Proc. NIPS, pages 1--8, 2007.
[10]
S. Essid, G. Richard, and B. David. Instrument recognition in polyphonic music based on automatic taxonomies. IEEE TASLP, 14(1):68--80, jan 2006.
[11]
P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE TPAMI, 32(9):1627--1645, 2010.
[12]
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett. Darpa timit acoustic phonetic continuous speech corpus. 1993. US Dept. of Commerce, NIST, Gaithersburg, USA.
[13]
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. CVPR, pages 580--587, 2014.
[14]
A. Graves. Supervised Sequence Labelling with Recurrent Neural Networks, volume 385 of Studies in Computational Intelligence. Springer, 2012.
[15]
P. Hamel, S. Lemieux, Y. Bengio, and D. Eck. Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In Proc. ISMIR, pages 729--734, 2011.
[16]
M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun. Unsupervised learning of sparse features for scalable audio classification. In Proc. ISMIR, pages 681--686, 2011.
[17]
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82--97, 2012.
[18]
E. J. Humphrey, J. P. Bello, and Y. LeCun. Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proc. ISMIR, pages 403--408, 2012.
[19]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proc. CVPR, 2014.
[20]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, pages 1097--1105, 2012.
[21]
E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie. Evaluation of algorithms using games: The case of music tagging. In Proc. ISMIR, 2009. http://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset.
[22]
M. Mandel, D. Eck, and Y. Bengio. Learning tags that vary within a song. Proc. ISMIR, pages 399--404, 2010.
[23]
M. I. Mandel and D. P. Ellis. A Web-Based Game for Collecting Music Metadata. Journal of New Music Research, 37(2):151--165, 2008.
[24]
M. I. Mandel, R. Pascanu, D. Eck, Y. Bengio, L. M. Aiello, R. Schifanella, and F. Menczer. Contextual tag inference. ACM TOMCCAP, 7S(1):1--18, oct 2011.
[25]
J. Masci, A. Giusti, D. Ciresan, G. Fricout, and J. Schmidhuber. A fast learning algorithm for image segmentation with max-pooling convolutional networks. In Proc. ICIP, pages 2713--2717, 2013.
[26]
B. McFee, M. McVicar, C. Raffel, D. Liang, O. Nieto, E. Battenberg, J. Moore, D. Ellis, R. Yamamoto, R. Bittner, D. Repetto, P. Viktorin, J. F. Santos, and A. Holovaty. librosa: 0.4.1, Oct. 2015.
[27]
J. Nam, J. Herrera, M. Slaney, and J. O. Smith. Learning sparse feature representations for music annotation and retrieval. In Proc. ISMIR, pages 565--570, 2012.
[28]
M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free-weakly-supervised learning with convolutional neural networks. In Proc. CVPR, pages 685--694, 2015.
[29]
G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proc. ICCV, pages 1742--1750, 2015.
[30]
G. Parascandolo, H. Huttunen, and T. Virtanen. Recurrent Neural Networks for Polyphonic Sound Event Detection In Real Life Recordings. In Proc. ICASSP, 2016.
[31]
D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully convolutional multi-class multiple instance learning. arXiv preprint arXiv:1412.7144, 2014.
[32]
P. O. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In Proc. ICML, pages 82--90, 2014.
[33]
P. O. Pinheiro and R. Collobert. From image-level to pixel-level labeling with convolutional networks. In Proc. CVPR, pages 1713--1721, 2015.
[34]
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proc. NIPS, pages 568--576. 2014.
[35]
D. Tingle, Y. E. Kim, and D. Turnbull. Exploring automatic music annotation with acoustically-objective tags. In Proc. ICMR, pages 55--61, 2010.
[36]
D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Semantic annotation and retrieval of music and sound effects. IEEE TASLP, 16(2):467--476, feb 2008.
[37]
S.-Y. Wang, J.-C. W. Y.-H. Yang, and H.-M. Wang. Towards time-varying music auto-tagging based on cal500 expansion. In Proc. ICME, pages 1--6, 2014.
[38]
Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. In Proc. CVPR, 2015.

Cited By

View all
  • (2024)Playlist Continuation of Cold-Start Songs2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00029(141-147)Online publication date: 7-Aug-2024
  • (2024)Few-Shot Bioacoustics Event Detection Using Transductive Inference With Data AugmentationIEEE Sensors Letters10.1109/LSENS.2024.33630218:3(1-4)Online publication date: Mar-2024
  • (2022)Adaptive Few-Shot Learning Algorithm for Rare Sound Event Detection2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892604(1-7)Online publication date: 18-Jul-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. convolutional neural network
  2. music auto-tagging
  3. music event localization

Qualifiers

  • Research-article

Conference

MM '16
Sponsor:
MM '16: ACM Multimedia Conference
October 15 - 19, 2016
Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Playlist Continuation of Cold-Start Songs2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00029(141-147)Online publication date: 7-Aug-2024
  • (2024)Few-Shot Bioacoustics Event Detection Using Transductive Inference With Data AugmentationIEEE Sensors Letters10.1109/LSENS.2024.33630218:3(1-4)Online publication date: Mar-2024
  • (2022)Adaptive Few-Shot Learning Algorithm for Rare Sound Event Detection2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892604(1-7)Online publication date: 18-Jul-2022
  • (2021)Augmentation Methods on Monophonic Audio for Instrument Classification in Polyphonic Music2020 28th European Signal Processing Conference (EUSIPCO)10.23919/Eusipco47968.2020.9287745(156-160)Online publication date: 24-Jan-2021
  • (2021)Learning to Visualize Music Through Shot Sequence for Automatic Concert Video MashupIEEE Transactions on Multimedia10.1109/TMM.2020.300363123(1731-1743)Online publication date: 2021
  • (2020)Learning From Music to Visual Storytelling of Shots: A Deep Interactive Learning MechanismProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413985(102-110)Online publication date: 12-Oct-2020
  • (2020)Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound EventsICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP40776.2020.9054712(616-620)Online publication date: May-2020
  • (2020)Speech-To-Singing Conversion in an Encoder-Decoder FrameworkICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP40776.2020.9054473(261-265)Online publication date: May-2020
  • (2019)A Hierarchical Attentive Deep Neural Network Model for Semantic Music Annotation Integrating Multiple Music RepresentationsProceedings of the 2019 on International Conference on Multimedia Retrieval10.1145/3323873.3325031(150-158)Online publication date: 5-Jun-2019
  • (2019)Weakly-Supervised Visual Instrument-Playing Action Detection in VideosIEEE Transactions on Multimedia10.1109/TMM.2018.287141821:4(887-901)Online publication date: Apr-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media