Music Retrieval A Tutorial and Review
Music Retrieval A Tutorial and Review
Music Retrieval A Tutorial and Review
net/publication/200806252
CITATIONS READS
155 564
1 author:
Nicola Orio
University of Padova
124 PUBLICATIONS 1,438 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Nicola Orio on 02 September 2014.
Nicola Orio
Abstract
The increasing availability of music in digital format needs to be
matched by the development of tools for music accessing, filtering,
classification, and retrieval. The research area of Music Information
Retrieval (MIR) covers many of these aspects. The aim of this paper is
to present an overview of this vast and new field. A number of issues,
which are peculiar to the music language, are described—including
forms, formats, and dimensions of music—together with the typologies
of users and their information needs. To fulfil these needs a number
of approaches are discussed, from direct search to information filtering
and clustering of music documents. An overview of the techniques for
music processing, which are commonly exploited in many approaches,
is also presented. Evaluation and comparisons of the approaches on a
common benchmark are other important issues. To this end, a descrip-
tion of the initial efforts and evaluation campaigns for MIR is provided.
1
Introduction
1
2 Introduction
with the real world works. Moreover, music is an art form that can be
both cultivated and popular, and sometimes it is impossible to draw a
line between the two, as for jazz or for most of traditional music.
Or maybe the increasing interest toward digital music is moti-
vated by its portability and the possibility to access music while doing
another, possibly primary, activity. Perhaps users are not looking for
a cultural experience, or the enjoyment of artworks, but for a suitable
soundtrack for the many hours spent commuting, traveling, waiting
or even working, and studying. Last but not least, the music business
is pushing toward the continuous production of new musical works,
especially in genres like pop and rock. The continuous decrease of the
average age of persons that regularly purchase and consume music has
been paralleled by an increasing simplification of the structure of main-
stream music genres, requiring the continuous production of new music.
The number of items sold daily by Web-based music dealers, or down-
loaded from services like i-Tune—not mentioning the illegal sharing of
music files through peer-to-peer networks—shows how much music is
commercially and culturally important.
Music Information Retrieval (MIR) is an emerging research area
devoted to fulfill users’ music information needs. As it will be seen,
despite the emphasis on retrieval of its name, MIR encompasses a num-
ber of different approaches aimed at music management, easy access,
and enjoyment. Most of the research work on MIR, of the proposed
techniques, and of the developed systems are content based.
The main idea underlying content-based approaches is that a doc-
ument can be described by a set of features that are directly com-
puted from its content. Usually, content-based access to multimedia
data requires specific methodologies that have to be tailored to each
particular medium. Yet, the core information retrieval (IR) techniques,
which are based on statistics and probability theory, may be more gen-
erally employed outside the textual case, because the underlying mod-
els are likely to describe fundamental characteristics being shared by
different media, languages, and application domains [60]. For this rea-
son, the research results achieved in the area of IR, in particular in the
case of text documents, are a continuous reference for MIR approaches.
Already in 1996 McLane stated that a challenging research topic would
3
digital libraries and MIR would have required a paper by itself, because
it regards important issues on digital acquisition of musical works, the
creation of an infrastructure for the management of musical documents
and for the access to musical content, which are challenging problem
by themselves as discussed in [1, 29].
The focus of this paper is on tools, techniques, and approaches for
content-based MIR, rather than on systems that implement them. The
interested reader can find the descriptions of more than 35 systems
for music retrieval in [127], with links to their Web sites. An inter-
esting survey of a selection of 17 different systems has been presented
in [135]. Systems can be compared according to the retrieval tasks that
can be carried out, the size of the collections, and the techniques that
are employed. To this end, a mapping of the analyzed systems on a
bidimensional space was proposed [135] in where the two dimensions of
interest were the target audience of the systems—e.g., industry, con-
sumer, professional—and the level at which retrieval is needed—e.g.,
from a particular instance of a work to a music genre.
The paper is structured as follows: This introductory section ends
with a short overview of some music concepts. Chapter 2 discusses the
peculiarities of the music language, introducing the dimensions that
characterize musical documents and that can be used to describe its
content. Chapter 3 highlights the main typologies of MIR users, intro-
ducing and specifying a number of information needs that have been
taken into account by different MIR approaches. The processing of
musical documents, aimed at extracting features related to their dimen-
sions of interest, is discussed in Chapter 4, followed by a classification
of the different facets of MIR research areas, which are reported in
Chapter 5. The efforts carried out for creating a shared evaluation
framework and their initial results are presented in Chapter 6. Finally,
some concluding considerations are drawn in Chapter 7.
without musical training. For this reason, this section presents a short
introduction of some basic concepts and terms.
2 Being very simple, these operative definitions may not be completely agreed by readers
with a musical background, because some relevant aspects are not taken into account. Yet,
their purpose is just to introduce some terminology that will be used in the next chapters.
1.1. Review of Music Concepts 7
Fig. 1.1 Example of a musical score (excerpt from Première Arabesque by Claude Debussy).
2
Characteristics of the Music Language
The effectiveness of tools for music retrieval depends on the way the
characteristics of music language, and somehow its peculiarities, are
modeled. For example, it is well known that textual IR methodolo-
gies are based on a number of shared assumptions about the writ-
ten text, in particular on the role of words as content descriptors,
depending on their frequency and distribution inside the documents
and across the collections. Similarly, image retrieval is based on assump-
tions that images with similar distribution of colors and shapes may
be perceived as similar. Video retrieval extends these assumption
with the notion of similarity between frames, which are assumed to
be related to the concept of movement. All these assumptions are
based on scientific results in the field of linguistic, information sci-
ence, psychology of visual perception, information engineering, and
so on.
In the case of music, concepts developed by music theorists,
musicologists, psychologists of perception, and information scientists
and engineers can be exploited to design and to refine the approaches to
8
2.1. Which Content Does Music Convey? 9
a difficult task for a listener to guess the text that inspired a given
musical work.
These considerations lead to the conclusion that music characteris-
tics can be described almost only in musical terms. Probably because of
this reason, the terminology used in music domain refers to the struc-
ture of the musical work or to the overall features, rather than to its
content. Hence terms like B flat minor, concerto, ballad, and remix
refer to some of the possible global characteristics of a musical work
but are not useful to discriminate among the hundreds or thousands of
different concerti, ballads, and remixes. Clearly, relevant metadata on
author’s name, work title, and so on are particularly useful to retrieve
relevant documents if the user already knew this information. To this
end, the effectiveness of metadata for querying a music collection has
been studied in [70], where name–title entries, title entries, and contents
notes have been compared. From the analysis of the results, keyword
retrieval based on a textual description of the content was shown to be
less effective than title searching and name–title searching.
All these considerations suggest to focus on music content, in terms
of music features, as a way to access and retrieve music.
dimensions of music that could be effective for music retrieval are the
following:
1 Thedictionary-based definition of timbre given in Section 1.1.1 is too generic for MIR pur-
poses. It has been preferred to redefine it according to its common use in MIR approaches.
12 Characteristics of the Music Language
Table 2.1 Time scale and characteristics of the different dimensions of music.
2 The number of tonalities that are perceived as different is slightly smaller, because of
enharmonic equivalents—tones that are written differently but actually sounds exactly
the same with equal-tempered intonation, such as C and D.
18 Characteristics of the Music Language
For instance, the original version of Yesterday by The Beatles and the
orchestral version played by The London Symphony Orchestra—and
most of the hundreds of cover versions that have been released—share
melody and harmony, but may not all be relevant for a user. On the
other hand, also a cover version of Yesterday by a garage band available
from a weblog, even if the goal was to be as close as possible to the
original version, is probably not relevant for many users. This particu-
lar aspect of relevance of musical documents has not been addressed in
content-based MIR, apart in the case of audio fingerprinting, where par-
ticular recordings are tracked for copyright issues. Nevertheless, they
will probably become more interesting when MIR systems will be avail-
able to a large majority of users.
In a sense, symbolic formats inherit the bias toward the direct rep-
resentation of a part of the musical content that has been discussed
in Section 2.3.1 for symbolic scores. As for textual documents, there
are two approaches to music editing: graphical interfaces and markup
languages.
The most popular commercial products for music editing are Finale,
Sibelius, Encore—whose names reveal a bias toward Western classical
music—together with software that takes into account both symbolic
and acoustic formats such as CuBase. These commercial products pro-
vide a graphical user interface and are currently used by music editors—
many printed music books have been authored with them. Unfortu-
nately, these valuable digital documents are not available to the public
for clear reasons of copyright protection. Although out of the scope
of this overview, it is worth mentioning the Wedelmusic project [141],
aimed at promoting the distribution of symbolic formats, granting the
protection of copyright ownership thanks to a watermarking mechanism
on the score printouts.
A visual format for the representation of symbolic scores is called
pianoroll, which represents notes as horizontal lines, where the vertical
position of the line depends on note pitch. The term pianoroll, used in
many music editing software packages, derived from the punched rolls
that were used in automatic pianos: The beginning of a line, which was
actually a hole in the paper, corresponded to a note onset, and the
length corresponded to note duration. Figure 2.1 displays the pianoroll
view representing the same music excerpt shown in Fig. 1.1; this rep-
resentation clarifies the horizontal and vertical dimensions of a musical
work introduced in Section 2.2.
Portability of digital musical documents authored with commercial
software is an issue. If users of text processing software complain about
the portability of their documents to different platforms and operative
systems, it is probably because they never tried with edited music.
Moreover, there have been few attempts to develop software to convert
from and to these formats. It has to be mentioned that Finale developed
a format, called Enigma, whose specifications are public [15] and for
which some converters have been already developed.
22 Characteristics of the Music Language
F5#
C5#
G4#
D4#
A3#
Pitch
F3
C3
G2
D2
A1
1 2 3 4 5 6 7 8 9 10 11
Time (seconds)
Fig. 2.1 Pianoroll view of the musical score shown in Fig. 1.1.
3 The scores depicted in this paper have been edited with Lilypond.
2.4. Formats of Musical Documents 23
0.5
Amplitude
−0.5
−1
1 2 3 4 5 6 7 8 9 10 11
Time (seconds)
timbre dimension that can be used for MIR tasks. An approach to audio
classification based on MPEG-7 descriptors has been proposed in [2].
4 Itcould be argued that the author of the SMF, that is the person who sequenced it, owns
the same rights of a music editor. In general, it is assumed that the labor needed to obtain
a useful printout from Midi files discourages this kind of activity, while authors grant for
free the possibility to download and listen to the files. Of course the musical work itself
can be copyrighted, and it is probably possible to find these files only because they are
not considered interesting for most users.
3
The Role of the User
Another crucial aspect is related to how well users can describe their
information needs. As it usually happens in IR, the degree of expertise
of the user on the particular application domain has a strong impact on
retrieval effectiveness, which in the case of MIR may vary dramatically.
For instance, the knowledge of the musical theory and practice, of the
dimensions conveyed by the two forms and their different formats, may
27
28 The Role of the User
In this case there is only one musical work that is relevant to the
user’s information need—e.g., the song was Knockin’ on Heaven’s Door
by Bob Dylan—and sometimes the user is interested in a particular
recording of that musical work—e.g., the version was the one performed
by Guns n’ Roses and included in the Use Your Illusion II album. The
search task is a particular case of information retrieval, where the rel-
evance of retrieved documents depends on their specific characteristics
rather than on the information they convey. An analogous in Web IR
can be the homepage finding task presented at the TREC [130] Web
track, where the only relevant document is the Web page maintained by
a given person or organization—e.g., the local Bob Dylan Fun Club—
while Web pages describing the person or the organization are not
considered relevant.
The use of the QBE paradigm is mandatory, because the only infor-
mation available to the user is a part of the document content. An
example can be provided by the user by singing a melodic/rhythmic
excerpt or, when available, by sending a recording of the unknown
musical work. In most of the cases, it is assumed that the users want
to access a digital recording of the searched musical piece in order to
listen eventually to it. Complete metadata information, such as author,
title, availability of the recording, may also be interesting for the user.
For this particular information need, the evaluation of the relevance
of the retrieved musical documents is straightforward. A person who
already knew the musical work corresponding to the user’s query—and
often the user himself/herself—may easily judge, using a binary scale,
whether or not the retrieved musical documents are relevant.
suggested musical items can be carried out, for music and other media,
measuring how well a recommender system predicts the behavior of
real users, normally using a manifold cross validation.
The literature on automatic recommendation is generally oriented
toward the exploitation of customers’ behavior—the choice of buying
or rating particular items—as in the case of collaborative filtering. This
approach has some drawbacks: the new item problem, that is an item
that has not been bought or rated cannot be recommended, and the
cold start problem, that is new customers do not have a reliable profile.
For this reason, recommender systems can also be content-based, as it
is described in Section 5.2.
Content-based approaches to music recommendations have a moti-
vation in theories on music perception, which are worth to be men-
tioned. The enjoyment of music listening can be due to the principles
of expectation-confirmation and expectation-denial. When listening to
a performance, the listeners create expectations on how the different
music dimensions will evolve. When expectations are confirmed too
often, the performance sounds uninteresting or even boring. On the
other hand, when expectations are denied too often, the performance
sounds incomprehensible and eventually boring. The ability of antici-
pating the evolution of a musical work depends on the user’s knowledge
of the music genre, besides a general knowledge of music theory and
practice.
These considerations suggest that similarity in the musical content
can be a viable approach to fulfill this information need, in order to
provide the user with previously unknown music that can be enjoyed
because the expectation-confirmation and expectation-denial mecha-
nisms may still hold thanks to the similarity with known and appreci-
ated musical works [68].
1 Asfor July 2006, the ACM Digital Library reported 89 papers using the term query-by-
humming, 13 query-by-singing, 1 query-by-whistling, and 2 query-by-playing.
36 The Role of the User
As for any other medium, the first step toward music retrieval involves
the automatic processing of the musical documents to extract relevant
content descriptors. There are different approaches to music processing,
depending on the form and format in which musical documents are
instantiated and on the dimensions of interest. As it can be expected,
great part of the research on feature extraction has been devoted to the
audio form, from which most of the music dimensions are particularly
challenging to extract.
The analysis of the publications on feature extraction aimed at
the development of MIR systems shows a clear drift from symbolic
toward audio forms. Early works on MIR focused on the symbolic
form, because relevant melodic features were easier to extract and also
because MP3 was not as popular and pervasive as it is nowadays. SMF
was the common format for musical documents, and it has always been
considered that Midi is more relevant for its symbolic representation
of a score rather than for the information it contains on temporal and
intensity performing parameters. Many approaches were based on the
38
4.1. Symbolic Form: Melody 39
Yet, for many genres, and in particular for pop and rock music,
it is assumed that there is only a relevant melodic line—usually
assigned to the voice—and all the other sounds are part of the
accompaniment—assigned to guitars, keyboards, bass, and drums. The
task that has to be carried out is the automatic recognition of the main
voice. To this end, statistical approaches have been proposed, starting
from a training corpus of manually annotated musical scores. The idea
is to describe each voice with a number of features that may help dis-
criminating sung melodies from other voices, including mean and vari-
ance of the pitch, range, mean and variance of the difference between
subsequent notes, and relative length of the voice in relationship with
the total length. An approach to the computation of the main voice, or
theme, of a polyphonic score is presented in [47].
A more difficult task is to extract the main melody from a poly-
phonic score when the voices are mixed together and when chords are
played together with single notes. The extraction of the main melody
can be carried out by exploiting previous knowledge of the perception
of melodic lines, and on how composers organize sounds. An example is
the system ThemeFinder, presented in [87]. Figure 4.1 shows the main
melody, as it is normally perceived by listeners, of the music excerpt
shown in Fig. 1.1.
The effectiveness of the automatic computation of the main melody
depends also on how users will perform the same task, which is a dif-
ficult task at least for listeners without a musical education. In fact,
if the query-by-humming approach is used, it is also assumed that the
users will recognize the main melody and use it as their query. To this
end, an interesting user study is reported in [137]. A number of sub-
jects were asked to assess the effectiveness by which different algorithms
extracted the main melody from a polyphonic score. Quite surprisingly,
subjects gave a higher score to the algorithm that purely extracted the
Fig. 4.2 A melodic line extracted from the score shown in Fig. 1.1.
highest note in the polyphony, even when the final extracted melody
was clearly a mixing of different voices.1
The extraction of the main melody is an error-prone process. For
instance, the extracted melody can have notes from other voices or
the wrong voice can be picked up as the representative melody. As an
example, Fig. 4.2 represents a descending melodic line that, although
being extracted from the score shown in Fig. 1.1, is unlikely to be
remembered by the users as its most representative melody. On the
other hand, the extraction of the main melody is not a necessary step
for approaches that are able to process and compare polyphonic scores
directly, as proposed in [76] and in [14].
1 Forexample, in the case of the opening bars of Yesterday by The Beatles, the algorithm
output would have been: the melody on the words yesterday, two descending notes of the
bass line, the melody on the words all my troubles seemed so far away, two more notes of
the bass line, and so on.
42 Music Processing
2 CharlieParker’s and John Coltrane’s modifications of well-known chord sequences are two
typical examples.
4.3. Audio Form: Timbre 45
3
Frequency (kilohertz)
1 2 3 4 5 6 7 8 9 10 11
Time (seconds)
very complex for other music languages, for example the ones developed
in Africa or in Eastern Europe, where different time signatures are
applied to the same musical work and rhythm is a multidimensional
feature.3
In the case of pop and rock music, rhythm information is based
on variants of the same structure of four equally spaced beats, the
first and the third stronger than the second and the fourth. What
becomes relevant then is the speed by which the beats are played.
To this end, a number of approaches on tempo tracking or foot-
tapping 4 have been proposed. Before its application to MIR, tempo
tracking has been applied to interactive music systems, using both
Midi [3] and audio [37]. The general approach is to highlight a peri-
odic structure in the amplitude envelope of the signal, for instance
using a self-similarity matrix [115] or the autocorrelation function [17].
Tempo tracking systems will give a very general description of the
musical work content, which nevertheless may be considered rele-
vant for users interested in creating mixes of songs, where tempo
coherence is important, organizing playlists for parties or retrieving
music that they intend to dance. In many radio companies, the songs
are classified according to their tempo, which is metadata manu-
ally added by tapping on the computer keyboard while the song is
played.
There is a particular Western music genre where rhythm infor-
mation, in particular the number and the organization of strong
and soft beats, becomes particularly relevant also for casual users:
dance music. Each rhythm is defined by a style label and is
associated with particular steps to dance to it. For example, the
approach presented in [38] uses general descriptors, such as tempo
and periodicity histograms, to classify dance music. The work
reported in [110] proposes to identify patterns in the spectrum
of the signal that are considered as the signature of a particu-
lar dance. These approaches, which are based on the assumptions
that particular music styles are described by temporal patterns,
3 This kind of music is usually defined polyrhythmic.
4 The term refers to the ability of simulating a user that taps his foot while listening to a
driving rhythm.
4.6. Audio Form: Melody 49
5 Thismeans that if the real fundamental frequency is an A at 440 Hz, the system may
recognize an A at either 220 Hz or 880 Hz.
50 Music Processing
53
54 Systems for Music Retrieval
that are within a given distance of the query, and provides links to
retrieve other similar documents.
Music browsing and navigation are based on the concept of similar-
ity between musical documents, which can be applied both to symbolic
and to audio forms. All the dimensions that have been presented, and
any combination, can be used to create new links between the docu-
ments. In principle, similarity is user-dependent, at least because the
individual contribution of each single dimension to the similarity score
depends on the importance that the user gives to it, and it may vary
with time and user expertise. Yet, most of the approaches to music
browsing are based on the static computation of similarity, based on a
predefined number of dimensions. Browsing can partially overcome the
problem of describing the content of a musical document, in particular
for casual users. To this end, defining a musical document through a list
of links to similar ones may be a useful tool for users in selecting—and
eventually purchasing—new musical works.
The first paper on content-based navigation inside a collection of
musical documents have been presented in [8]. In that case, similarity
was computed using melody as the only relevant dimension, in par-
ticular using the pitch contour. An interesting aspect is that an open
hypermedia model is adopted, which enables the user to find avail-
able links from an arbitrary fragment of music. Another approach to
content-based navigation of a music collection is presented in [90]. A
collection of musical documents and their lexical units are enriched by
a hypertextual structure, called hypermusic, that allows the user to
navigate inside the document set. An important characteristic is that
links are automatically built between documents, between documents
and lexical units, and between lexical units. The navigation can be car-
ried out across documents but also across relevant content descriptors,
where similarity is computed depending on the co-occurrence of lexical
units inside the documents.
main goal is to identify and label the audio in three different classes:
speech, music, and environmental sound. This first coarse classifica-
tion can be used to aid video segmentation or decide where to apply
automatic speech recognition. The refinement of the classification with
a second step, where music signals are labeled with a number of pre-
defined classes, has been presented in [144], which is also worth men-
tioning because it is one of the first papers that present hidden Markov
models as a tool for MIR.
An early work on audio classification, presented in [143], was aimed
at retrieving simple music signals using a set of semantic labels, in
particular focusing on the musical instruments that are part of the
orchestration. The approach is based on the combination of segmenta-
tion techniques with automatic separation of different sources and the
parameter extraction. The classification based on the particular orches-
tration is still an open problem with complex polyphonic performances,
as described in Section 4.4.
An important issue in audio classification, introduced in [30], is the
amount of audio data needed to achieve good classification rates. This
problem has many aspects. First, the amount of data needed is strictly
related to the computational complexity of the algorithms, which usu-
ally are at least linear with the number of audio samples. Second, per-
ceptual studies showed that even untrained listeners are quite good at
classifying audio data with very short excerpts (less than 1 sec). Finally,
in a query-by-example paradigm, where the examples have to be digi-
tally recorded by users, it is likely that users will not be able to record
a significantly large part of audio.
A particular aspect of audio classification is genre classification.
The problem is to correctly label an unknown recording of a song with
a music genre. Labels can be hierarchically organized in genres and
subgenres, as shown in Table 5.1. Labeling can be used to enrich the
musical document with high-level metadata or to organize a music col-
lection. In this latter case, it is assumed that the organization in genres
and subgenres is particularly suitable for a user, because it is followed
by almost all the CD sellers, and is one of the preferred access methods
for on-line stores. Genre classification is, as other aspects of MIR, still
biased by Western music, and thus genres are the ones typically found
5.3. Music Browsing, Classification, and Visualization 63
Genre Subgenre
Classical Choir
Orchestra
Piano
String quartet
Country
Disco
HipHop
Jazz BigBand
Cool
Fusion
Piano
Quartet
Swing
Rock
Blues
Reggae
Pop
Metal
in Western music stores. Some attempts have been made to extend the
approach also to other cultures, for instance in [104] genre classifica-
tion has been carried for traditional Indian musical forms together with
Western genres.
It can be argued that a simple classification based on genres may
not be particularly useful for a user, because a coarse categorization
will result in hundreds of thousands of musical documents in the same
category, and users may not agree on how the classification is carried
out with a fine grained categorization. Yet, this part of MIR research is
pretty active, because users still base their choices on music genres, and
information about genre preferences can be exploited to refine users’
profiles, as proposed in [48].
One of the first papers introducing the problem of music classifica-
tion is [136]. The proposed classification hierarchy is shown in Table 5.1,
64 Systems for Music Retrieval
from which it can be seen that there is a bias toward classical music
and jazz, while some genres—ambient, electronic, and ethnic—are not
reported. This is a typical problem of music classification, because the
relevance of the different categories is extremely subjective, as well as
the categories themselves. These problems are faced also by human
classifiers that try to accomplish the same task, and in fact in [136]
it is reported that college students achieved no more than about 70%
of classification accuracy when listening to three seconds of audio (lis-
tening to longer excerpt did not improve the performances). The auto-
matic classification is based on three different feature sets, related to
rhythmic, pitch, and timbre features. As also highlighted in subsequent
works, rhythm seems to play an important role for the classification.
The features used as content descriptors are normally the ones
related to timbre, and described in Section 4.3. This choice depends
on the fact that approaches try to classify short excerpts of an audio
recording, where middle-term features like melody and harmony are not
captured. Common music processing approaches compute the MFCCs,
while the use of the wavelet transform is exploited in [40] and in [78].
Systems on genre classification are normally trained with a set of
labeled audio excerpts, and classification is carried out using different
techniques and models from the classification literature. In particular,
k -Nearest Neighbor and Gaussian Mixtures Models are classically used
for classifying genres, but Support Vector Machines and Linear Dis-
criminant Analysis have also been successfully applied to this task.
listen to the songs themselves, will ease the user to browse his own
collection. The latter is motivated by the fact that a spatial organiza-
tion of the music collection will help the users finding particular songs
they are interested in, because they can remember their position in the
visual representation and they can be aided by the presence of similar
songs nearby the searched one.
There is a variety of approaches to music visualization, including the
symbolic score, the pianoroll view, the plot of the waveform, and the
spectrogram. Any representation has positive aspects and drawbacks,
depending on the dimensions carried by the music form it is related
to, and on the ability to capture relevant features. Representations can
be oriented toward a global representation or local characteristics. The
interested reader may refer to [58] for a complete overview on techniques
for music visualization.
Visualization of a collection of musical documents is usually based
on the concept of similarity. The problem of a graphical representation,
normally based on bidimensional plots, is typical of many areas of data
analysis. Techniques such as Multidimensional Scaling and Principal
Component Analysis are well known for representing a complex and
multidimensional set of data when a distance measure—such as the
musical similarity—can be computed between the elements or when
the elements are mapped to points in a high-dimensional space. The
application of bidimensional visualization techniques to music collec-
tions has to be carried out considering that the visualization will be
given to non-expert users, rather than to data analysts, who need a
simple and appealing representation of the data.
One example of system for graphical representation of audio col-
lection is Marsyas3D, which includes a variety of alternative 2D and
3D representations of elements in the collection. In particular, Princi-
pal Component Analysis is used to reduce the parameter space that
describe the timbre in order to obtain either a bidimensional or tridi-
mensional representation. Another example is the Sonic Browser, which
is an application for browsing audio collections [9] that provides the
user with multiple views, including a bidimensional scatterplot of audio
objects, where the coordinates of each point depend on attributes of the
dataset, and a graphical tree representation, where the tree is depicted
66 Systems for Music Retrieval
with the root at the center and the leaves over a circle. The sonic radar,
presented in [81], is based on the idea that only a few objects, called
prototype songs, can be presented to the user. Each prototype song is
obtained through clustering the collection with k -means algorithm and
extracting the song that is closer to the cluster center. Prototype songs
are plotted on a circle around a standpoint.
A number of approaches to music visualization are based on Self-
Organizing Maps (SOMs) [129]. In particular, the visual metaphor of
Islands of Music is presented in [108], where musical documents are
mapped on a plane and enriched by a third dimension in the form of a
geographical map. SOMs give a different visualization of the collection
depending on the choice of the audio parameters used for their train-
ing. A tool to align the SOMs is proposed to reduce the complexity
of alternative representation for non-expert users. Another approach
using Emergent SOMs for the bidimensional representation of a music
collection is presented in [93], where genre classification was combined
with visualization because genre labels are added to the elements in
the collection. In this case, instead of allowing the user to choose the
combination of dimensions that he prefers, the system is trained with
more than 400 different low-level features, which were also aggregated
to obtain high-level features, and the selection was made a posteriori
depending on the ability of each feature to separate a group of similar
songs from the others.
6
Evaluation
67
68 Evaluation
There is still some research work on MIR that has been evaluated
only qualitatively, for instance, all the approaches to the visualization
of music collection presented in Section 5.3.3 are difficult to evaluate
with classic techniques.
For the first task, the participants could have the raw audio data
to run their experiments, because they were provided by a Web service
that allowed the use of audio content for research. Due to copyright
issues, the organizers did not distribute the original recordings for the
second and third tasks, but distributed a set of low-level features that
they computed from the recording themselves, as proposed in [6].
6.1. The Audio Description Contest 69
the MIREX 2005 campaign, six on the audio form and three on the
symbolic form.
(1) Audio Genre Classification: assign a label, from a set of
10 pre-defined genres, to an audio recording of polyphonic
music.
(2) Audio Artist Identification: recognize the singer or the group
that performed a polyphonic audio recording.
(3) Audio Drum Detection: detect the onsets of drum sounds in
polyphonic pop songs, some synthesized and some recorded,
which have been manually annotated for comparison.
(4) Audio Onset Detection: detect the onsets of any musical
instrument in different kinds of recordings, from polyphonic
music to solo drums.
(5) Audio Tempo Extraction: compute the perceptual tempo of
polyphonic audio recordings.
(6) Audio Melody Extraction: extract the main melody, for
instance the singing voice in a pop song or the lead instru-
ment in a jazz ballad, from polyphonic audio.
(7) Symbolic and Audio Key Finding: extract the main key sig-
nature of a musical work, given either a representation of the
symbolic score or the recording of a synthetic performance
(the dataset was the same for the two tasks).
(8) Symbolic Genre Classification: assign a label, from a set of
38 pre-defined genres, to a symbolic representation—e.g.,
Midi—of a musical work.
(9) Symbolic Melodic Similarity: given an incipit, retrieve the
most similar documents from a collection of monophonic
incipits.
Audio genre classification and artist identification were carried out
on similar datasets that were based on two independent collections.
The participants had to make independent runs on the two collections,
and final results were based on the average performances. In the case of
artist identification, the best result obtained was a recognition rate of
72%. The use of alternative collections was done also for other tasks. In
particular, onset detection was divided into nine subtasks, depending
72 Evaluation
classification accuracy reached 84% for audio and 77% for symbolic
documents.
The task more similar to an IR task was the one on melodic
similarity. The evaluation was carried out using the Cranfield model
for information retrieval, with an experimental collection of 582
documents—monophonic incipits—and two sets of queries, to train and
test the system respectively. Queries were in the same form of docu-
ments, while the relevance judgments have been collected by the pro-
posers of the task [132]. The retrieval effectiveness has been compared
using classic IR measures—for instance a non-interpolated average pre-
cision of 0.51 has been obtained—together with an ad hoc measure,
called Average Dynamic Recall and suggested by the proposers [134];
the novel measure took into account the fact that relevance assessments
were not binary. The relative scoring of the different systems was not
affected by the kind of measure.
Being promoted and organized by different persons, and requiring
different kinds of manually annotated labeled data, the nine tasks of
MIREX 2005 were carried out with different collections, with very
different sizes, sometimes divided into training and test collections.
Some information about the type of task, the size of the collections,
and the number of participants for each task is reported in Table 6.1.
Audio-related tasks were definitely more popular than symbolic ones,
Table 6.1 Main characteristics of the tasks carried out at MIREX 2005 campaign.
A trend can be seen from the comparison between MIREX 2005 and
2006. First of all, the percentage of tasks involving the audio format is
increased, because there is only one task based only on the symbolic
form. Moreover, all the tasks on the extraction of high-level metadata—
genre classification, artist identification, and key finding—are not part
of the final MIREX 2006 set of tasks, even though they were part of
the initial proposal.
Another relevant difference with MIREX 2005 is that some of
the tasks can be grouped together, because they have similar goals,
and may be based on the same test collections. For instance, Audio
Beat Tracking and Audio Tempo Extraction, both focus on rhythm as
the most relevant music dimension, while Audio Music Similarity and
Retrieval and Audio Cover Song, though different tasks, are based on
the same collection. Also Symbolic Melodic Similarity and Query by
Singing/Humming, which are two typical IR tasks, are very similar,
with the former focusing on the effects of the collection and the latter
on the effects of queries.
The interested reader can find further information on MIREX 2006,
and in particular the evaluation results when they will be available, at
the official Web site of the contest [92].
7
Conclusions
76
77
There are some other aspects related to MIR that have not been
taken into account in this overview. Among these, perhaps the most
strictly related to issues of music retrieval is the research area named
audio fingerprinting. The task is to recognize all the copies of a
given audio document also in presence of major distortions, such as
additional noise, compression, resampling, or time stretching. Audio
fingerprint is more related to automatic music recognition rather
than music retrieval, and has significant applications on copyright
management. An interesting overview of audio fingerprinting motiva-
tion and techniques, presented in a unified framework, can be found
in [11]. Audio fingerprinting also found commercial applications, like
the one provided by [122]. Another aspect related to MIR, which
is out of focus of this overview but is worth at least to be men-
tioned, is audio watermarking aimed at inserting inaudible yet track-
able information about the owner and the user of a musical doc-
ument [42]. Audio watermarks are usually created to protect copy-
right because the owner of a digital recording can be recognized and
also because it is possible to track the final users who, after buying
digital recordings, are responsible for their diffusion on file-sharing
networks.
Despite its existence as a research area for at least five or six years,
MIR is still not well known in the scientific community. It may still hap-
pen that, after the presentation of a MIR technique, someone from the
audience approaches the speaker asking if it is really possible to retrieve
music by singing a melody, while nobody is amazed by a system that
can retrieve images. This situation may seem strange because, apart
from the successful ISMIR [59], which gathers each year an increasing
number of participants, MIR results are more and more published and
presented in journals, conferences, and workshops on IR, multimedia,
speech and audio processing. The weak perception of MIR as a research
discipline can be explained by the peculiarities of the musical language,
which make it radically different from other media in particular because
of the kind of content that is conveyed and how this content may be
related to users’ information needs. A side effect of this diversity is that,
even though there are a number of national and international projects
79
80
References
[1] M. Agosti, F. Bombi, M. Melucci, and G.A. Mian. Towards a digital library
for the venetian music of the eighteenth century. In J. Anderson, M. Deegan,
S. Ross, and S. Harold, editors, DRH 98: Selected Papers from Digital
Resources for the Humanities, pages 1–16. Office for Humanities Communica-
tion, 2000.
[2] E. Allamanche, J. Herre, O. Hellmuth, B. Fröba, T. Kastner, and M.
Cremer. Content-based identification of audio material using MPEG-7 low
level description. In Proceedings of the International Symposium on Music
Information Retrieval, pages 73–82, 2001.
[3] P.E. Allen and R.B. Dannenberg. Tracking musical beats in real time. In
Proceedings of the International Computer Music Conference, pages 140–143,
1990.
[4] D. Bainbridge, C.G. Nevill-Manning, I.H. Witten, L.A. Smith, and R.J.
McNab. Towards a digital library of popular music. In Proceedings of the
ACM Conference on Digital Libraries, pages 161–169, 1999.
[5] M.A. Bartsch and G.H. Wakefield. Audio thumbnailing of popular music
using chroma-based representations. IEEE Transactions on Multimedia,
7(1):96–104, 2005.
[6] A. Berenzweig, B. Logan, D.P.W. Ellis, and B. Whitman. A large-scale eval-
uation of acoustic and subjective music-similarity measures. Computer Music
Journal, 28(2):63–76, 2004.
[7] W.P. Birmingham, R.B. Dannenberg, G.H. Wakefield, M. Bartsch,
D. Bykowski, D. Mazzoni, C. Meek, M. Mellody, and W. Rand. MUSART:
music retrieval via aural queries. In Proceedings of the International Confer-
ence on Music Information Retrieval, pages 73–82, 2001.
81
82 References
[8] S. Blackburn and D. DeRoure. A tool for content based navigation of music.
In Proceedings of the ACM International Conference on Multimedia, pages
361–368, 1998.
[9] E. Brazil and M. Fernström. Audio information browsing with the sonic
browser. In Coordinated and Multiple Views in Exploratory Visualization,
pages 26–31, 2003.
[10] E. Cambouropoulos. Musical rhythm: a formal model for determining local
boundaries. In E. Leman, editor, Music, Gestalt and Computing, pages
277–293. Springer-Verlag, Berlin, DE, 1997.
[11] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of audio fingerprint-
ing. Journal of VLSI Signal Processing, 41:271–284, 2005.
[12] P. Cano, A. Loscos, and J. Bonada. Score-performance matching using hmms.
In Proceedings of the International Computer Music Conference, pages 441–
444, 1999.
[13] Cantate. Computer access to notation and text in music libraries, July 2006.
http://projects.fnb.nl/cantate/.
[14] M. Clausen, R. Engelbrecht, D. Meyer, and J. Schmitz. PROMS: a web-based
tool for searching in polyphonic music. In Proceedings of the International
Symposium of Music Information Retrieval, 2000.
[15] Coda Music. Enigma transportable file specification. Technical Report, ver-
sion 98c.0, July 2006. http://www.xs4all.nl/ hanwen/lily-devel/etfspec.pdf.
[16] R.B. Dannenberg and H. Mukaino. New techniques for enhanced quality
of computer accompaniment. In Proceedings of the International Computer
Music Conference, pages 243–249, 1988.
[17] M.E.P. Davies and M.D. Plumbley. Casual tempo tracking of audio. In Pro-
ceedings of the International Conference on Music Information Retrieval,
pages 164–169, 2004.
[18] S.B. Davis and P. Mermelstein. Comparison of parametric representations
for monosyllabic word recognition in Continuously Spoken Sentences. IEEE
Transactions on Acoustic, Speech, and Signal Processing, 28(4):357–366, 1980.
[19] A. de Cheveigné and A. Baskind. F0 estimation. In Proceedings of Eurospeech,
pages 833–836, 2003.
[20] S. Dixon and G. Widmer. Match: a music alignment tool chest. In Proceedings
of the International Conference of Music Information Retrieval, pages 492–
497, 2005.
[21] S. Doraisamy and S. Rüger. A polyphonic music retrieval system using N-
grams. In Proceedings of the International Conference on Music Information
Retrieval, pages 204–209, 2004.
[22] W.J. Dowling. Scale and contour: two components of a theory of memory for
melodies. Psychological Review, 85(4):341–354, 1978.
[23] J.S. Downie, J. Futrelle, and D. Tcheng. The International Music Information
Retrieval Systems Evaluation Laboratory: governance, access and security. In
Proceedings of the International Conference on Music Information Retrieval,
pages 9–14, 2004.
[24] J.S. Downie, K. West, A. Ehmann, and E. Vincent. The 2005 music infor-
mation retrieval evaluation exchange (mirex 2005): preliminary overview. In
References 83
[39] J.M. Grey and J.A. Moorer. Perceptual evaluations of synthesized musical
instruments tones. Journal of Acoustic Society of America, 62(2):454–462,
1977.
[40] M. Grimaldi, P. Cunningham, and A. Kokaram. A wavelet packet represen-
tation of audio signals for music genre classification using different ensemble
and feature selection techniques. In Proceedings of the ACM International
Workshop on Multimedia Information Retrieval, pages 102–108, 2003.
[41] L. Grubb and R.B. Dannenberg. A stochastic method of tracking a vocal
performer. In Proceedings of the International Computer Music Conference,
pages 301–308, 1997.
[42] J. Haitsma, M. van der Veen, T. Kalker, and F. Bruekers. Audio watermarking
for monitoring and copy protection. In Proceedings of the ACM Workshops on
Multimedia, pages 119–122, 2000.
[43] Harmonica. Accompanying action on music information in libraries, July 2006.
http://projects.fnb.nl/harmonica/.
[44] C. Harte, M. Sandler, S. Abdallah, and E. Gómez. Symbolic representation of
musical chords: a proposed syntax for text annotations. In Proceedings of the
International Conference on Music Information Retrieval, pages 66–71, 2005.
[45] J. Harvell and C. Clark. Analysis of the quantitative data of system perfor-
mance. Deliverable 7c, LIB-JUKEBOX/4-1049: Music Across Borders, July
2006. http://www.statsbiblioteket.dk/Jukebox/edit-report-1.html.
[46] G. Haus and E. Pollastri. A multimodal framework for music inputs. In Pro-
ceedings of the ACM Multimedia Conference, pages 282–284, 2000.
[47] Y. Hijikata, K. Iwahama, K. Takegawa, and S. Nishida. Content-based music
filtering system with editable user profile. In Proceedings of the ACM Sympo-
sium on Applied Computing, pages 1050–1057, 2006.
[48] K. Hoashi, K. Matsumoto, and N. Inoue. Personalization of user profiles for
content-based music retrieval based on relevance Feedback. In Proceedings of
the ACM International Conference on Multimedia, pages 110–119, 2003.
[49] H.H. Hoos, K. Renz, and M. Görg. GUIDO/MIR—an experimental musical
information retrieval system based on GUIDO music notation. In Proceedings
of the International Symposium on Music Information Retrieval, pages 41–50,
2001.
[50] J.-L. Hsu, C.C. Liu, and A.L.P. Chen. Efficient repeating pattern finding in
music databases. In Proceeding of the International Conference on Information
and Knowledge Management, pages 281–288, 1998.
[51] Humdrum. The Humdram Toolkit: software for music research, July 2006.
http://www.music-cog.ohio-state.edu/Humdrum/.
[52] D. Huron. The Humdrum Toolkit: Reference Manual. Center for Computer
Assisted Research in the Humanities, Menlo Park, CA, 1995.
[53] N. Hu, R.B. Dannenberg, and A.L. Lewis. A probabilistic model of melodic
similarity. In Proceedings of the International Computer Music Conference,
pages 509–515, 2002.
[54] N. Hu, R.B. Dannenberg, and G. Tzanetakis. Polyphonic audio matching and
alignment for music retrieval. In Proceedings of the IEEE Workshop on Appli-
cations of Signal Processing to Audio and Acoustics, pages 185–188, 2003.
References 85
[85] A. McLane. Music as information. In M.E. Williams, editor, Arist, volume 31,
chapter 6, pages 225–262. American Society for Information Science, 1996.
[86] C. Meek and W. Birmingham. Johnny can’t sing: a comprehensive error model
for sung music queries. In Proceedings of the International Conference on
Music Information Retrieval, pages 65–71, 2002.
[87] C. Meek and W. Birmingham. Automatic thematic extractor. Journal of Intel-
ligent Information Systems, 21(1):9–33, 2003.
[88] M. Melucci, N. Orio, and M. Gambalunga. An evaluation study on music
perception for musical content-based information Retrieval. In Proceedings of
the International Computer Music Conference, pages 162–165, 2000.
[89] M. Melucci and N. Orio. Musical information retrieval using melodic surface.
In Proceedings of the ACM Conference on Digital Libraries, pages 152–160,
1999.
[90] M. Melucci and N. Orio. Combining melody processing and information
retrieval techniques: methodology, evaluation, and system implementation.
Journal of the American Society for Information Science and Technology,
55(12):1058–1066, 2004.
[91] R. Middleton. Studying Popular Music. Open University Press, Philadelphia,
PA, 2002.
[92] Mirex 2006 Wiki. Second annual music information retrieval evaluation
exchange, July 2006. http://www.music-ir.org/mirex2006/.
[93] F. Mörchen, A. Ultsch, M. Nöcker, and C. Stamm. Databionic visualization of
music collections according to perceptual distance. In Proceedings of the Inter-
national Conference on Music Information Retrieval, pages 396–403, 2005.
[94] MPEG. The MPEG home page, July 2006. http://www.chiariglione.org/
mpeg/.
[95] M. Müller, F. Kurth, and M. Clausen. Audio matching via chroma-based
statistical features. In Proceedings of the International Conference of Music
Information Retrieval, pages 288–295, 2005.
[96] MuseData. An electronic library of classical music scores, July 2006. http://
www.musedata.org/.
[97] Musica. The international database of choral repertoire, July 2006. http://
www.musicanet.org/.
[98] MusicXML. Recordare: Internet music publishing and software, July 2006.
http://www.musicxml.org/.
[99] GUIDO Music Notation. The GUIDO NoteServer, July 2006. http://www.
noteserver.org/.
[100] Music Notation. Formats, July 2006. http://www.music-notation.info/.
[101] MusiXTeX. MusiXtex and related software, July 2006. http://icking-music-
archive.org/software/indexmt6.html.
[102] E. Narmour. The Analysis and Cognition of Basic Melodic Structures. Uni-
versity of Chicago Press, Chicago, MI, 1990.
[103] G. Neve and N. Orio. Indexing and retrieval of music documents through pat-
tern analysis and data Fusion Techniques. In Proceedings of the International
Conference on Music Information Retrieval, pages 216–223, 2004.
88 References
[104] N.M. Norowi, S. Doraisamy, and R. Wirza. Factors affecting automatic genre
classification: an investigation incorporating non-western musical forms. In
Proceedings of the International Conference on Music Information Retrieval,
pages 13–20, 2005.
[105] N. Orio and G. Neve. Experiments on segmentation techniques for music doc-
uments indexing. In Proceedings of the International Conference on Music
Information Retrieval, pages 104–107, 2005.
[106] N. Orio. Alignment of performances with scores aimed at content-based
music access and Retrieval. In Proceedings of European Conference on Digital
Libraries, pages 479–492, 2002.
[107] R.P. Paiva, T. Mendes, and A. Cardoso. On the detection of melody notes
in polyphonic audio. In Proceedings of the International Conference on Music
Information Retrieval, pages 175–182, 2005.
[108] E. Pampalk, S. Dixon, and G. Widmer. Exploring music collections by brows-
ing different views. In Proceedings of the International Conference on Music
Information Retrieval, pages 201–208, 2003.
[109] C.L. Parker. A tree-based method for fast melodic retrieval. In Proceedings of
the ACM/IEEE Joint Conference on Digital Libraries, pages 254–255, 2004.
[110] G. Peeters. Rhythm classification using spectral rhythm patterns. In Proceed-
ings of the International Conference on Music Information Retrieval, pages
644–647, 2005.
[111] J. Pickens and T. Crawford. Harmonic models for polyphonic music retrieval.
In Proceedings of the International Conference on Information and Knowledge
Management, pages 430–437, 2002.
[112] A. Pienimäki and K. Lemström. Clustering symbolic music using paradigmatic
and surface level analyses. In Proceedings of the International Conference of
Music Information Retrieval, pages 262–265, 2004.
[113] A. Pienimäki. Indexing music database using automatic extraction of frequent
phrases. In Proceedings of the International Conference on Music Information
Retrieval, pages 25–30, 2002.
[114] S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of ran-
dom fields. IEEE Transactions on Pattern Analysis and Machine Intelligence,
19:380–393, 1997.
[115] A. Pikrakis, I. Antonopoulos, and S. Theodoris. Music meter and tempo track-
ing from raw polyphonic audio. In Proceedings of the International Conference
on Music Information Retrieval, pages 192–197, 2004.
[116] L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice-
Hall, Englewood Cliffs, NJ, 1993.
[117] C. Raphael and J. Stoddard. Harmonic analysis with probabilistic graphical
models. In Proceedings of the International Conference on Music Information
Retrieval, pages 177–181, 2003.
[118] C. Raphael. Automatic segmentation of acoustic musical signals using hid-
den markov models. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 21(4):360–370, 1999.
[119] T. Rohdenburg, V. Hohmann, and B. Kollmeier. Objective perceptual quality
measures for the evaluation of noise reduction schemes. In Proceedings of the
References 89