Abstract
The melody of a musical piece—informally, the part you would hum along with—is a useful and compact summary of a full audio recording. The extraction of melodic content has practical applications ranging from content-based audio retrieval to the analysis of musical structure. Whereas previous systems generate transcriptions based on a model of the harmonic (or periodic) structure of musical pitches, we present a classification-based system for performing automatic melody transcription that makes no assumptions beyond what is learned from its training data. We evaluate the success of our algorithm by predicting the melody of the ADC 2004 Melody Competition evaluation set, and we show that a simple frame-level note classifier, temporally smoothed by post processing with a hidden Markov model, produces results comparable to state of the art model-based transcription systems.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Birmingham, W., Dannenberg, R., Wakefield, G., Bartsch, M., Bykowski, D., Mazzoni, D., Meek, C., Mellody, M., & Rand, W. (2001). MUSART: Music retrieval via aural queries. In Proc. 2nd Annual International Symposium on Music Information Retrieval ISMIR-01 (pp. 73–82). Bloomington, IN.
Chawla, N., Japkowicz, N., & Kolcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1), 1–6. URL http://portal.acm.org/toc.cfm?id=1007730&type=issue.
de Cheveigne, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. Journal Acoustic Society of America, 111(4), 1917–1930.
Downie, J., West K., Ehmann, A., & Vincent, E. (2005). The 2005 music information retrieval evaluation exchange (MIREX 2005): Preliminary overview. In Proc. 6th International Symposium on Music Information Retrieval ISMIR (pp. 320–323). London.
Eggink, J., & Brown, G. J. (2004). Extracting melody lines from complex audio. In International Symposium on Music Information Retrieval (pp. 84–91).
Gomez, E., Ong, B., & Streich, S. (2004). Ismir 2004 melody extraction competition contest definition page, http://ismir2004.ismir.net/melody_contest/results.html.
Goto, M. (2004). A predominant-f0 estimation method for polyphonic musical audio signals. In 18th International Congress on Acoustics (pp. 1085–1088).
Goto, M., & Hayamizu, S. (1999). A real-time music scene description system: Detecting melody and bass lines in audio signals. In Working Notes of the IJCAI-99 Workshop on Computational Auditory Scene Analysis (pp. 31–40). Stockholm.
Lamel, L., Gauvain, J.-L., & Adda, G. (2002). Lightly supervised and unsupervised acoustic model training. Computer, Speech & Language, 16(1), 115–129. URL citeseer.ist.psu.edu/lamel02lightly.html.
Li, Y., & Wang, D. L. (2005). Detecting pitch of singing voice in polyphonic audio. In IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. III.17–21).
Marolt, M. (2004). On finding melodic lines in audio recordings. In Proc. 7th International Conference on Digital Audio Effects DAFx’04, Naples, Italy. URL http://lgm.fri.uni-lj.si/~matic/clanki/dafx04_marolt.pdf
Paiva, R. P., Mendes, T., & Cardoso, A. (2005) On the detection of melody notes in polyphonic audio. In Proc. 6th International Symposium on Music Information Retrieval ISMIR (pp. 175–182). London.
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—support vector learning (pp. 185–208). Cambridge, MA, MIT Press.
Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classication. Journal of Machine Learning Research, 5, 101–141, URL http://jmlr.csail.mit.edu/papers/volume5/rifkin04a/rifkin04a.pdf.
Sjölander, K., & Beskow, J. (2000). WaveSurfer—an open source speech tool. In Proc. Int. Conf. on Spoken Language Processing.
Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT). In W. B. Kleijn & K. K. Paliwal, (Eds.), Speech coding and synthesis, chapter 14 (pp. 495–518). Elsevier, Amsterdam.
Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin markov networks. In Proc. Neural Information Processing Systems NIPS, Vancouver, URL http://www.cs.berkeley.edu/ taskar/pubs/mmmn.ps.
Turetsky, R. J., & Ellis, D. P. W. (2003). Ground-truth transcriptions of real music from force-aligned midi syntheses. In Proc. Int. Conf. on Music Info. Retrieval ISMIR-03.
Witten, I. H., & Frank, E. (2000). Data Mining: Practical machine learning tools with Java implementations. San Francisco, CA, USA, Morgan Kaufmann, ISBN 1-55860-552-5.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Gerhard Widmer
Rights and permissions
About this article
Cite this article
Ellis, D.P.W., Poliner, G.E. Classification-based melody transcription. Mach Learn 65, 439–456 (2006). https://doi.org/10.1007/s10994-006-8373-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-006-8373-9