Tablanet: A Real-Time Online Musical Collaboration System For Indian Percussion
Tablanet: A Real-Time Online Musical Collaboration System For Indian Percussion
Tablanet: A Real-Time Online Musical Collaboration System For Indian Percussion
Mihir Sarkar
Diplome dIngenieur ESIEA
Ecole Superieure dInformatique Electronique Automatique (1996)
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning,
in partial fulfillment of the requirements for the degree of
Master of Science in Media Technology
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2007
c Massachusetts Institute of Technology 2007. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Program in Media Arts and Sciences
August 10, 2007
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Barry L. Vercoe
Professor of Media Arts and Sciences
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prof. Deb Roy
Chairperson
Departmental Committee on Graduate Students
TablaNet:
A Real-Time Online Musical Collaboration System for
Indian Percussion
by
Mihir Sarkar
Submitted to the Program in Media Arts and Sciences
on August 10, 2007, in partial fulfillment of the
requirements for the degree of
Master of Science in Media Technology
Abstract
Thanks to the Internet, musicians located in different countries can now aspire to
play with each other almost as if they were in the same room. However, the time
delays due to the inherent latency in computer networks (up to several hundreds
of milliseconds over long distances) are unsuitable for musical applications. Some
musical collaboration systems address this issue by transmitting compressed audio
streams (such as MP3) over low-latency and high-bandwidth networks (e.g. LANs
or Internet2) to constrain time delays and optimize musician synchronization. Other
systems, on the contrary, increase time delays to a musically-relevant value like one
phrase, or one chord progression cycle, and then play it in a loop, thereby constraining the music being performed. In this thesis I propose TablaNet, a real-time online
musical collaboration system for the tabla, a pair of North Indian hand drums. This
system is based on a novel approach that combines machine listening and machine
learning. Trained for a particular instrument, here the tabla, the system recognizes
individual drum strokes played by the musician and sends them as symbols over the
network. A computer at the receiving end identifies the musical structure from the
incoming sequence of symbols by mapping them dynamically to known musical constructs. To deal with transmission delays, the receiver predicts the next events by
analyzing previous patterns before receiving the original events, and synthesizes an
audio output estimate with the appropriate timing. Although prediction approximations may result in a slightly different musical experience at both ends, we find that
this system demonstrates a fair level of playability by tabla players of various levels,
and functions well as an educational tool.
Thesis Supervisor: Barry L. Vercoe
Title: Professor of Media Arts and Sciences
TablaNet:
A Real-Time Online Musical Collaboration System for
Indian Percussion
by
Mihir Sarkar
Thesis Committee
Advisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Barry L. Vercoe
Professor of Media Arts and Sciences
Massachusetts Institute of Technology
Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tod Machover
Professor of Music and Media
Massachusetts Institute of Technology
Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Miller S. Puckette
Professor, Music
Associate Director, Center for Research in Computing and the Arts
University of California, San Diego
Acknowledgments
For their help, directly or indirectly, with this thesis, I would like to thank:
First and foremost, Sharmila, my wife, with whom I have shared the joys and
pains of going back to study, and who supported me patiently through the several
nights I spent solving problem sets, debugging code, and writing this thesis. She
helps me reach for my dreams.
Barry Vercoe, my advisor, who gave me the extraordinary opportunity to be here,
the freedom to explore and learn and grow, and who points me in the right direction
when I ask him to. For this, and more, I am deeply indebted to him.
Tod Machover and Miller Puckette, my thesis readers, for their encouragements,
patience, and insightful comments.
Owen Meyers (my office-mate), Anna Huang, Wu-Hsi Li, Judy Brown, Dale
Joachim, and Yang Yang (my UROPer) from the Music, Mind and Machine group
for their feedback, enthusiasm, and ideas... and the opportunity to share mango lassi.
Brian Whitman, a Music, Mind and Machine alumnus, for easing my transition
into the Media Lab and being the link between the groups current and former members, for always being available for pertinent discussions, and for his kindness.
Mutsumi Sullivan and Sandy Sener for their administrative support and help with
miscellaneous items (from repairing the tabla set to arranging for the compensation
of the user study subjects). In particular, I would like to extend very special thanks
to Mutsumi, our Music, Mind and Machine group assistant, for taking care of the
practical aspects of this thesissignature gathering, printing, and submissionup
until the last minute, while I was busy wrapping up my thesis in India.
The MIT and Media Lab community for truly being out of this world and continuously striving to make the world a better place.
And finally my parents, whose ideals and education have brought me thus far,
who have guided me without ever erecting a barrier, for their love and support.
To those above, and those whom I did not mention here but who played a role in this
work, I express all my gratitude for their contribution to this thesis.
Contents
1 Introduction
1.1 Overview . . . .
1.2 Scope . . . . .
1.3 Methodology .
1.4 Thesis Outline .
.
.
.
.
15
15
17
19
22
2 Background
2.1 Network Music Collaboration . . . . . . . . . . . . . . . . . . . . . .
2.2 The Tabla in Indian Music . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Tabla Analysis and Synthesis . . . . . . . . . . . . . . . . . . . . . .
23
23
32
35
3 Design
3.1 System Architecture . . . . . . . . . .
3.2 Hardware Setup . . . . . . . . . . . . .
3.3 Software Implementation . . . . . . . .
3.4 Tabla Stroke Training and Recognition
3.5 Tabla Phrase Prediction and Synthesis
.
.
.
.
.
41
41
42
44
49
57
4 Evaluation
4.1 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Qualitative Experiments . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
67
74
77
5 Conclusion
5.1 Contributions and Technical Relevance . . . . . . . . . . . . . . . . .
5.2 Applications and Social Relevance . . . . . . . . . . . . . . . . . . . .
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
79
80
81
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A Experimental Study
83
A.1 Study Approval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A.2 Study Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.3 Questionnaire Responses . . . . . . . . . . . . . . . . . . . . . . . . . 101
10
List of Figures
1-1 Technology employed in the TablaNet system . . . . . . . . . . . . .
21
34
35
3-1
3-2
3-3
3-4
3-5
3-6
3-7
. . .
Inc.)
. . .
. . .
. . .
. . .
. . .
42
43
45
46
51
56
61
. . . . . . .
. . . . . . .
dimensions
. . . . . . .
. . . . . . .
69
69
11
70
73
12
List of Tables
2.1
Tabla strokes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.1
62
4.1
4.2
71
71
13
14
Chapter 1
Introduction
1.1
Overview
Motivation
In 1992, I organized a concert in India. I recruited two of my musician friends, both
keyboard players like me, and we formed a band. We practiced over the summer and
performed at the end of August in a beautiful open air courtyard in front of a jampacked audience. We played instrumental covers of Western pop songs (The Beatles,
Stevie Wonder, Phil Collins) some of which were known to the audience in their Indianized version as Bollywood film songs! As I was living in France at that time, I had
brought with me sound and light equipment (smoke machine, black light, strobes and
scanners). That year, liberalization and cable operators had just introduced MTV
to Indian households, but few had yet been exposed to the kinds of synthetic sounds
and light effects that we had in our concert.
As I returned to France, my friends and I were eager to carry on with our collaboration. We even included another friend (a guitarist living in the US) in the process.
We exchanged multitrack audio cassettes (like many famous bands are said to have
done), conversed over the telephone, sent letters and parcels back and forth. But our
subsequent interactions never came close to our experience that summer. A few years
later, still eager, with our brand new e-mail accounts, we tried using the Internet to
exchange MIDI files. Digital audio streaming was still out of our reach then. However
with time, and probably because we never quite managed to collaborate with each
other like we had during our face-to-face encounters, our enthusiasm faded out (but
our friendship remained).
For many years, the problem was in the back of my mind. I had heard of companies
trying to address the issue (under the name Networked Music Performance) but was
never impressed with their results or approach. When I came to the Media Lab, I
decided to take up the challenge and find a novel solution to solve this problem.
15
Description
The main challenge I am trying to address in this thesis is how to overcome network
latency for online musical collaboration. If musicians are to play together in realtime over a computer network such as the Internet, they need to remain perceptually
synchronized with one another while data travels from one computer to another. I
attempt to tackle this problem by developing a system that is tuned to a particular
musical instrument and musical style, here the tabla, a traditional North Indian
percussion instrument. The system includes hardware and software components, and
relies on standard network infrastructure. Software processes the acoustic signal
coming out of the tabla by:
1. recognizing individual drum strokes,
2. transmitting symbolic events (instead of an audio stream) over the network,
3. extracting higher-level rhythmic features and identifying standard drumming
primitives from the input data,
4. analyzing previous patterns to predict current events, and
5. synthesizing and playing rhythmic phrases with the appropriate tempo at the
audio output.
Contributions
The research presented in this thesis has resulted in the following contributions:
I implemented a novel approach for real-time online musical collaboration,
enabled a real-world musical interaction between two tabla musicians over a
computer network,
designed a networked tabla performance system,
created a tabla phrase prediction engine, and
developed a real-time continuous tabla strokes recognizer.
This work resulted in a unidirectional playable prototype (from a musician to
a far-end listener)although the system is symmetrical, it has not been tested in
full-duplexand a software simulation environment for testing and demonstration.
Preliminary evaluation results show that the system is suitable for distance education
and distributed jamming.
16
1.2
Scope
Context
Several systems have been developed in the past to allow musicians located in different places to play together in real-time or, more generally, to enable various forms
of online musical collaboration (refer to Section 2.1). However, notwithstanding specific experimental concerts (usually during musical technology conferences with likeminded computer music buffs), wide-spread success has remained elusive, and many
commercial endeavors have died down (although new ones spring up every now and
then).
In spite of technological advances in digital audio and networkingthe two enabling technologies in this casemost existing frameworks have hit a hard limit due
to the inherent latency in computer networks because of protocol and system overhead (buffering, packet switching, etc.).
Much of the literature on networked music performance mentions the speed of
light as its main obstacle. Let us perform some basic calculations to verify that
claim. The circumference of the earth being around 40,000 km (there are a few kilometers difference between the measurement at the poles and at the equator), the
maximum distance between two points on earth is 20,000 km. Taking the speed of
light as approximately 300,000 km/s, it takes almost 70 milliseconds for light to travel
between the 2 farthest points on the planet. To apply this measure to sound and for
comparison purposes, 70 ms represents the time it takes for sound to travel between
two points distant by almost 24 meters (340 m/s is the approximate speed of sound
in air). This distance is only slightly beyond the size of a regular concert stage or
a recording room where musicians are expected to play in synchrony. Most of the
latency actually comes from network overheads resulting in speeds of twice the speed
of light at best (Chafe, 2003).
The speed of light provides a convenient hard theoretical limit, but if we add the
practical constraints of computer networks, transmission delays easily reach hundreds
of milliseconds. As an example, here are some average estimates of round trip times
using the ping tool between the US East Coast (MIT Media Lab network) and the
following locations on the Internet:
another computer on the same network (www.mit.edu): 50 ms,
a computer on the US West Coast (www.stanford.edu): 132 ms,
a computer in France (www.polytechnique.fr): 176 ms,
a computer in India (www.sify.com): 315 ms1
1
interestingly it is rather difficult to find an academic or public office website hosted in India that
enables echo request messages (ping).
17
Studies on musician synchronization (see Section 2.1) have shown that musicians
need to hear each other within a 20 ms window to remain synchronized. Therefore,
creative solutions are required to overcome the transmission delays that we find on
regular computer networks.
Barry Vercoes original idea, NetDuet (undocumented personal communication),
gave me the idea of solving the problem of network music performance with a predictive system.
Additionally, personal observations and discussions with musicians suggest that
visual contact is not a determining factor in a musical interaction. Therefore, this
system assumes that the fact that musicians cannot look at each other while playing
together over the network does not significantly affect their collaborative effortat
least, the purpose of their collaboration remains intact (for a caveat look at Section
5.3).
Approach
My approach in this thesis is based on the premise that in order to cancel transmission delays and enable musicians to remain synchronized, we need to predict each
musicians intent even before the sound that he or she produces reaches its far-end
destination. A realization according to this principle is enabled by technology in the
areas of machine listening and machine learning.
This approach implies that for the next musical events to be predicted a few milliseconds in advance, a suitable model of the music should be developed. Therefore
the system should be tuned to a particular musical style, and even a particular instrument. In this case, I chose the tabla, a pair of Indian hand drums, not only because
of its popularity and my familiarity with it, but also because of its intermediate
complexity as a percussion instrument: although tabla patterns are only based on
rhythmic compositions without melodic or harmonic structure, different strokes can
produce a variety of more than 10 pitched and unpitched sounds called bols, which
contribute to the tablas expressiveness (see Section 2.2). The tabla is proposed as a
case study to prove the feasibility of my approach.
Thus, I propose to develop a computer system that enables real-time online musical collaboration between two tabla players. Although, as I mentioned, the design
presented in this thesis is specific to Indian percussions (in particular, the system was
evaluated with the tabla in the North Indian Hindustani musical style), I expect that
the principles presented here can be extended and generalized to other instruments
and cultures.
Tabla duet may not the most common ensemble, and it might have been better to
choose two instruments that have a more fully developed duo modality of playing.
18
One approach could be to combine the TablaNet system on one side with a streaming
audio system (for a vocal or instrumental performance) on the other.
The point of using a real tabla instead of an electronic controller is to enable
the audience to participate in a real concert at each end of the interaction with
musicians playing on acoustic instruments. It could be interesting to add support for
music controllers or even hand claps, but not at this stage of the project.
Terminology
The word stroke used in this thesis does not stand so much for the gestures that generate a particular sound (by hitting the tabla with the hands in a specific fashion), but
designates the resulting sound itself. It should be noted that different musical schools
within the North Indian tradition use different gestures (with slightly different sounds)
for similar stroke names. This is dealt with by training the (player-dependent) stroke
recognition engine for each user.
The term musical collaboration encompasses different types of interactions two
musicians might be involved in. In the case of two tabla players in a single performance, which is a relatively unusual configuration (there is usually one drum player,
or at least one per type of drum, in traditional Indian music ensembles), I consider the
following possible scenarios: a student-teacher interaction in an educational setting,
and a call-and-reponse interaction in a performance setting. Rhythmic accompaniment, which is the usual role assigned to the tabla player, typically takes place with
a melodic instrumentalist or a vocalist, and will therefore not be taken into account
in the scope of this project.
In this thesis, I use the terms near-end and far-end, common in networking
especially in the fields of videoconferencing and echo cancellationto describe each
side of the interaction. The near-end refers to the side where the tabla signal is acquired. The far-end applies to the distant location where the transmitted signal is
played back (to another musician).
1.3
Methodology
Preliminary Studies
Before embarking on this research project, I performed two initial studies.
I conducted preliminary work where I demonstrated the concept elaborated in
this thesis by sensing vibrations on the tabla drumhead, analyzing stroke onsets, and
transmitting quantized onset events and the tempo over a non-guaranteed connection19
less UDP (User Datagram Protocol) network layer. On reception of the events, the
receiver would trigger sampled tabla sounds of a particular rhythmic pattern stored
in a predefined order. This application was prototyped in the Max/MSP environment
and demonstrated in October 2006.
In addition, I developed a stroke recognition algorithm as a final project for a
class on pattern classification. This non real-time algorithm, developed in Matlab,
was used as the basis for the stroke recognition engine developed in this thesis, and
then extended for greater accuracy and implemented in C for real-time performance.
The preliminary algorithm was presented in December 2006.
The work described in this thesis was partly published in an article by Sarkar and
Vercoe (2007).
Research
The main hypotheses driving this work are:
1. playing on a predictive system with another musician located across the network
is experientially, if not perceptually, similar to playing with another musician
located in the same room in that it provides as much satisfaction to the
musicians and the audience;
2. a recognition and prediction based model provides an adequate representation
of a musical interaction; and
3. a real-time networked system suggests new means of collaboration in the areas of distance education, real-world and virtual-world interactions, and online
entertainment.
The TablaNet system described in this thesis not only develops a solution to address the problem of distant musical collaboration, but also provides a platform, an
artifact, to evaluate the previous hypotheses. These hypotheses are evaluated based
on subjective tests with userstabla players and trained listeners.
Technology
This project involves many layers of complexity in terms of the technical fields involved (see Figure 1-1). The red boxes (second and third rows) represent the competencies that were required to develop the system. The blue boxes (fourth, fifth, and
sixth rows) represent the underlying technologies.
The mechanical study consisted in carefully selecting the optimal models of vibration sensors, as well as placing and attaching them to the tabla heads. The vibration
20
TablaNet
system
Hardware
Mechanical
Software
Software
Engineering
Electrical
Vibration
sensors
Audio path
Real-time
audio
framework
Algorithms
Networking
Machine
listening
Machine
learning
DSP
Pattern
classification
Artificial
intelligence
21
sensors were also included in the electrical development, which dealt mainly with the
audio path originating at the sensors, up until the computer input, and then again
from the computer output to the speakers.
The software design and implementation comprised the bulk of this work. I applied standard software engineering techniques to design an audio framework with
services to access the default audio device for real-time audio, and provide audio file
input and output, and networking capabilities. The algorithmsthe main part of the
research undertaken in this projectrely on machine listening and machine learning
models, which, in turn, are based upon digital signal processing, pattern classification, and artificial intelligence techniques.
1.4
Thesis Outline
Chapter two, Background, reviews the literature relevant to this work, in particular
networked music performance systems, and offers a description of the tabla in Indian
music.
Chapter three, Design, describes the TablaNet system, hardware, and software
implementation details.
Chapter four, Evaluation, discusses quantitative as well as qualitative tests performed on the TablaNet system.
Chapter five concludes this thesis by summarizing its contributions, and suggesting applications and future work.
22
Chapter 2
Background
This background presentation details two bodies of work relevant to the TablaNet system. First, I survey the literature related to networked musical collaboration, which
has been the subject of numerous projects. Then I describe the tabla in the context
of Indian music. Finally I review the literature on tabla analysis and synthesis which
leads to my work on tabla stroke recognition and phrase prediction.
2.1
Much work has been done on musical collaboration over computer networks. I organize here the relevant literature into several categories depending on the technology
employed, or on the intended application area.
I first describe notable commercial endeavors brought on by the Internet. Then I
veer off towards philosophical considerations and proposed roadmaps. I continue with
works that describe other types of musical interactions made possible by computer
networks. Then follow studies that deal with musician synchronization; studies which
have, in some instances, led to the design of architectural frameworks that convey
sensory modalities other than audition only. Then I list projects that use a particular
technology: from event-based and early MIDI systems to audio streaming with the
latest compression algorithms, via phrase looping schemes; I also look for previous
work on predictive systems. I follow it up with projects that highlight a particular
application area rather than hinge on design elements, such as distance education and
virtual studio applications. I end this section with a non-exhaustive survey of major
live network music performances that have taken place in the past ten years.
Commercial Endeavors
Since the advent of the Internet, musicians have been looking at online musical collaboration as the next killer app. In fact, this space has been and continues to be
23
the source of several commercial endeavors (from the defunct Rocket Network (1998)
to Ninjam (2005), Audio Fabric (2007), and Lightspeed Audio Labs (2007), a new
startup which just released a beta version of its software).
Rocket Network pretty much started it all. As early as 1998 they provided what
their marketing team called high-end Internet recording studio technology to audio
software companies (like Digidesign or Steinberg), which would bundle it with their
popular sequencing software (e.g. ProTools or Cubase) for musicians to use. On
closer inspection, it appears that Rocket Network did not provide real-time collaboration, but merely an asynchronous interface to play locally, then post (to a central
server)a way to extend the concept of multitrack overdubs beyond the walls of a
recording studio by reaching out across the Internet to musicians from all over the
world (Townley, 2000). Despite its revolutionary model, the limited bandwidth available on the Internet in the 1990s may have compromised its growth. The company
was bought by Digidesign, and its technology (or rather its ideas) was incorporated
into ProTools. In the same line, VSTunnel (2005) is a sequencer plug-in that functions much on the same concept as Rocket Network (i.e. non real-time collaboration
with a sequencer).
digitalmusician.net (2006) and their Digital Musician Link (DML) plug-in deliver
another virtual studio technology based on a peer-to-peer connection between two
users. In addition, the website acts as a virtual community portal and message board
that helps artists meet each other. Similarly, musicolab (2003) proposes a P2P file
sharing network and an upcoming (as of two years ago) community site.
Ninjam, which was developed by the authors of popular programs for the Internetage like Winamp and Gnutella, proposes a high-latency technique for jamming
similar to the one introduced by Goto and Neyama (2002). Ninjam works with a
central server and distributed clients. Each client, one on each participants computer, records and sends a compressed audio stream (in OGG Vorbis format) to the
server. The server then sends a copy of all the streams delayed by one measure (or
one musical interval) back to every participant. In effect, each user plays along with
everyone elses previous interval. This is a creative solution to the latency problem,
but it sets external constraints on the music (for instance, a drum set and a guitar
work well together whereas several melodic instruments improvising together may
not). It is also hard to imagine a call-and-reponse interaction on this system (the
first musician plays one measure then waits one measure for the other musician to
hear his contribution, then has to wait yet another one for the other musician to play
his part before finally hearing it back?). As the Ninjam website puts it, it is part
tool, part toy.
In February this year (2007), Audio Fabric released its real-time network jamming platform. According to the company website, the system deals with latency by
utilizing low-delay audio codecs. By targeting musicians within constrained geographical boundaries (coast to coast within the US for instance), Audio Fabric claims
24
25
Kon and Iazzetta (1998) establish the convergence of multimedia and networking
technologies that enable distributed musical applications, and describe the challenges
and opportunities of Internet-based interactive music systems.
In his work on interconnected music networks, Weinberg (2001, 2002, 2005a,b),
who has been involved with them since his MIT Media Lab days, discusses their
philosophical and aesthetic foundation with a particular focus on social dynamics
and group interdependency towards music composition or improvisation. After a
historical review of compositional works that highlight the role of interconnected networks, Weinberg describes two novel network music controllers that he developed:
the Squeezables and the Beatbug Network. Furthermore, Weinberg argues that the
network architecture influences the way musicians interact. He then goes on to propose various topologies and describes the kind of interactions that each one supports.
26
factors that play a role in latency perception and tolerance (e.g. timbre, familiarity
with a piece or an instrument) are discussed.
Berkeley) and was subsequently used as a communication protocol for the majority of
event-based network music performance systems (as a successor to the less powerful
and less flexible MIDI format). A paper on the state of the art of OSC (Wright et al.,
2003) proposes a tutorial overview and lists several implementations made available.
Quintet is an interactive networked multimedia performance environment that
lets five musicians play together over a network (Hajdu, 2007). By using digital
event-based instruments (MIDI synthesizers and sequencers, sensors) and playing
compositions tailored to the limitations of the Internet as a medium, the system combines composition and improvisation with a conductor-based approach to synchronize
control streams produced by the musicians and enable live musical collaboration.
2005). They also developed a system that allowed fluctuating tempo for more expressive performances (Yoshida et al., 2004).
Interestingly most of this research is based in Japan, with the exception of Ninjams implementation. Coincidence or could there be a cultural reason?
Predictive Systems
CCRMAs Chris Chafe, always interested in network musical performance, did some
early research about the prediction of solo piano performance (1997). There he describes his attempt at modeling human aspects of a score-based musical performance
such as note timing and velocity. Covariance analysis is used to predict closeness
30
between a candidate interpretation and stored ones, thus enabling anticipation when
confronted with a delayed transmission.
Distance Education
As early as 1999, Young and Fujinaga describe a system for the network transmission of a piano master class by MIDI over UDP. In a similar work, Vanegas (2005)
proposes a MIDI-based pedagogical system that brings together a music teacher and
a student via the Internet. Although he does not propose a solution against network
latency, he acknowledges that the problem exists even with a MIDI-based system.
Virtual Studios
In her masters thesis at the MIT Media Lab, Lefford (2000) tackles cognitive and
aesthetic issues surrounding network music performance in the particular context of
a producer recording an overdub from a distant musician.
31
2.2
Indian music encompasses an extremely wide variety of musical traditions and styles.
Considering the focus of the TablaNet system, I shall present in this section aspects
of the tabla and its place within North Indian classical music that are relevant to the
technical discussions that follow.
notice the emphasis here: I mean learn about Hindustani music, rather than learn Hindustani
music.
32
the verbal syllable, used to name both pitches and elements of rhythm.
The Tabla
I think there are more excellent tabla players today than ever before, and
they are doing great new things with the tabla! Zakir Hussain
Hand drums are essential to Indian music, and they are truly a living tradition
with many students and practicing musicians worldwide. The tabla is the main drum
in North India. It is comprised of a pair of drums (see Figure 2-2). The right-hand
34
drum is called the dayan and the left-hand drum is called the bayan (as seen from
the players perspective). One principal acoustic feature of the tabla is its ability to
contrast open (khula baj ) and closed (bandh baj ) sounds on the bayan, which is important in recognizing thekas and has probably contributed to the widespread success
of the tabla.
It must be noted that the position of the drums (left and right) are described here
from the point of view of the player, not the audience. Interestingly, the literature
varies on the adopted convention, probably depending on whether the author is a
tabla player or not.
Table 2.1 lists some of the most common bols and indicates how they are played.
2.3
In this section, I review the literature related to tabla sound analysis and synthesis.
Interestingly, there is some overlap in the list of researchers mentioned in the survey
on Network Music Collaboration (Section 2.1) and the one in this section. However,
apart from some, like Kapur, most of the overlapping researchers seem to have worked
on the two areas independently.
This section starts by introducing studies on the recognition of spoken tabla bols,
and follows with works on the recognition of tabla strokes (also called bolsI make
an arbitrary distinction in the section titles here)
35
Temporal and spectral features were extracted and fed to a neural net. The researchers reported around 90% recognition accuracy. In 2005, Tindale published a
survey of related work in drum identification and beat detection; most of the studies
they mention is presented in this section. Later work by the same authors describes
percussion sound recognition using pattern recognition techniques (ZeroR, Gaussian,
k-nearest neighbor, and neural nets).
Van Steelant et al. (2004) used Support Vector Machines (SVM) to recognize percussive sounds.
Tabla acoustics
The first modern account of the tablas tonal quality can be found in Nobel laureate C.V. Ramans work (1920; 1935). In his two papers, Raman mentions that the
tabla (like the mridangam, a longitudinal South Indian percussion instrument with
two headssimilar to the tablasopposite each other) differs from other percussion
instruments, which usually produce inharmonic overtones, in that it gives harmonic
overtones having the same relation of pitch to the fundamental tone as in stringed
instruments. In fact, he highlights the importance of the first three to five harmonics
which are derived from the drumheads vibration modes in the instruments sound.
He describes how the position of the fingers and the type of stroke on the drumhead
excite the membrane along some of its nodes and add different harmonics to the sound.
Although Bhat (1991) specifically studies the mridangam, he develops a mathematical model of the membranes vibration modes that could well be applied to
the tabla. In particular, his model explains the harmonic overtones found in the
mridangam, which account for its tonal quality, when its membranes are excited by
particular strokes.
Malu and Siddharthan (2000) confirmed C.V. Ramans observations on the harmonic properties of Indian drums, and the tabla in particular. They attribute the
presence of harmonic overtones to the central loading (black patch) in the center
of the dayan (the gab). This black patch is also present in the bayan, but there it
is placed asymmetrically. They solve the wave equation for the tabla membrane and
identify its vibration modes.
37
Tabla controllers
This section is concerned mostly with systems that convert musical gesture into sound
in the realm of percussion instruments.
In 1995, Hun Roh and Wilcox developed (as part of Hun Rohs masters thesis at
MIT) a system for novices to discover tabla drumming. A rhythmic input tapped on
a non-specific MIDI controller is mapped to a particular tabla phrase using an HMM.
The tabla phrase that corresponds to the drumming pattern is then played back with
the appropriate bol sounds.
Kapur (2002); Kapur et al. (2003a, 2004) proposed the ETabla (along with the
EDholakthe dholak is another Indian percussion instrumentand the ESitar). The
ETabla identifies strokes by capturing gestures with Force Sensing Resistors (FSRs)
placed on a non-acoustic tabla controller head. Strokes are recognized using a treebased classifier. The ETabla allows traditional as well as new performance techniques
and triggers sounds (synthesized with Essls banded waveguide model) and graphics.
Beat tracking
The seemingly simple human skill of identifying the beat in a piece of music is actually a complex cognitive process that is not yet completely understood. In fact, less
experienced musicians sometimes find it difficult to identify the correct beat, tapping instead at twice the speed, or twice slower, compared to professional musicians.
Similarly, beat tracking proves to be a difficult task for computers.
Allen and Dannenberg (1990) propose a survey of artificial beat trackers. Then
they describe their system which performs more accurately than algorithms based on
perceptual models by adding a heuristic function that simultaneous evaluates various
interpretations of a single piece.
More recently, Dannenberg (2005) took a more intuitive approach to music analysis by combining beat location and tempo estimation with elements of pitch tracking
and even genre identification to simulate the holistic aspect of human auditory
perception and provide additional constraints to the problem of beat tracking. He
reports improved results.
Goto took a path parallel to Dannenbergs. In 1995, Goto also worked on a
multiple-agent architecture for beat tracking: his system evaluated multiple hypothesis simultaneously. In 2001, he used three kinds of musical analysis (onset times,
chord changes, and drum patterns) to identify the hierarchical beat structure in music
with or without drums.
In his MIT Media Lab PhD thesis, Jehan (2005) makes use of a perceptually
38
grounded approach to onset detection. He computes and sums the first order difference of each spectral band and ignores transients within a 50 ms window (because
they fuse into a single event). By smoothing the resulting function, he obtains peaks
that correspond to onsets. He observes that onsets also correspond to local increase
in loudness. Armed with his onset detection algorithm, Jehan proceeds to his beat
detection and tempo estimation phase for which he uses a causal and bottom-up approach (based on signal processing rather than a-priori knowledge).
Commercial Applications
In the past ten years, Indian music has seen the widespread adoption of electronic
sound generators for instruments like the tampura (background drone), the sruti box,
39
which sets the reference tone for a performance, and also the tabla. To my knowledge, two companies (Radel and Riyaz) have developed electronic tabla devices. These
boxes come with a variety of presets that produce quite realistic (but static) rhythmic
phrases to accompany amateur instrumental or vocal performances.
Swarshala and Taalmala provide a software environment for personal computers
for learning, playing and composing tabla performances. Their method of sound generation is undocumented, but Swarshala provides several control parameters (pitch
bend, etc.) for individual sound that rule out sample playback.
40
Chapter 3
Design
3.1
System Architecture
The TablaNet system developed during the course of my research resulted in a prototype that includes basic functionality for training, recognition, and prediction. I
programmed a software simulation environment for testing the system and evaluating
the hypotheses formulated in Section 1.3.
I used the following resources to build this project:
a tabla set (from Prof. Barry Vercoes collection),
microphone, pre-amplified mixing console, audio components and cables,
piezoelectric vibration sensors with various characteristics,
audio speakers,
my Mac PowerBook laptop for development and evaluation, and
Xcode, the integrated development environment for Mac OS X.
The TablaNet system architecture is described in Figure 3-1. At the near-end,
a pair of sensors (one for each drum) captures the strokes that are played on the
tabla. The signals from both the sensors are mixed and pre-amplified, and sent to
the Analog-to-Digital converter on the near-end computer. After processing the input audio signal, the computer sends symbols over the network to a far-end computer
installed with the same software. The receiving computer interprets the events transmitted over the network and generates an appropriate audio output. The system is
symmetrical and full duplex so that each musician can simultaneously play, and listen
to the musician at the other end.
Timekeeping happens independently at each end. The systems goal is to synchronize the tabla output with the beat extracted from the incoming tabla rhythm
41
Sensors
Mixer / amplifier
Computer
system
Speaker
Network
Tabla
Tabla
3.2
Hardware Setup
drums are mixed is not an issue because strokes that are played on the right drum
(dayan) are distinct from those played on the left drum (bayan), and from those
played on both drums simultaneously (see Table 2.1). Moreover, because the bayan
produces lower pitched sounds than the dayan, the sound of each drum can be separated in the spectral domain in spite of some overlap.
Piezoelectric sensors generate an electric charge in response to mechanical stress
(e.g. vibrations). Contact microphones that are made of piezoelectric material pick
up vibrations through solid materials rather than airborne sound waves, and convert
them into an electric signal similar to that of a microphone output.
DT Series Elements with Lead Attachment
The DT series of piezo film sensors
elements are rectangular elements
of piezo film with silver ink screen
printed electrodes. They are available in a variety of different sizes
and thicknesses.
C
D
t
t
Metallization
B
Protective Coating
Piezo Film
Figure 3-2: Piezoelectric film sensor element (from Measurement Specialties, Inc.)
A
Film
B
Electrode
C
Film
D
Electrode
t
(m)
Cap
(nF)
Part
Number
Low Volume
Price
(US $)
DT1-028K/L w/rivets
.64 (16)
.484 (12)
1.63 (41)
1.19 (30)
40
1.38
1-1002908-0
3.50
DT1-052K/L w/rivets
.64 (16)
.484 (12)
1.63 (41)
1.19 (30)
64
.740
2-1002908-0
3.75
DT2-028K/L w/rivets
.64 (16)
.484 (12)
2.86 (73)
2.42 (62)
40
2.78
1-1003744-0
DT2-052K/L w/rivets
.64 (16)
.484 (12)
2.86 (73)
2.42 (62)
64
1.44
2-1003744-0
5.50
DT4-028K/L w/rivets
.86 (22)
.740 (19)
6.72 (171)
6.13 (156)
40
11.00
1-1002150-0
7.00
Description
4.50
I used piezo film sensors with lead attachments from Measurement Specialties, Inc
(Figure 3-2). Their catalog proposes a variety of lengths and thicknesses. Additionally, the films can either be plain (with a thin urethan coating to prevent oxidation)
or laminated (with a thicker polyester layer which develops much higher voltage than
the non-laminated version when flexed but happens to be less sensitive to high frequencies probably because of higher inertia). The lead attachments provide wiring
from the sensors, which is useful because high temperatures during soldering can
damage the films. I experimented with the following elements: non-laminated (DT
series) of length 1.63 cm, 2.86 cm, and 6.72 cm (in both 40 and 64 m thicknesses);
laminated (LDT series) of length 1.63 cm, 2.86 cm, 6.72 cm (205 m thickness).
DT4-052K/L w/rivets
.86 (22)
.740 (19)
6.72 (171)
6.13 (156)
64
5.70
2-1002150-0
Please contact factory for custom part quotations and volume pricing.
8.00
April 2006
I also tried different double-sided tape thicknesses. The thiner films worked better
but were found to be fragile (their silver ink coating would come off with the tape)
and therefore had to be changed after being pasted and then removed a few times
from the tabla heads. The thicker films were more robust in that way.
Each computer has an external or built-in amplified speaker to play the audio
output estimated from the other end.
43
3.3
Software Implementation
This section describes the software design and implementation, and provides an
overview of the external libraries used.
The computer program at the near-end runs the code to extract features from
the audio input, and to classify tabla strokes based on those features. The application then transmits the data (symbols representing the recognized strokes and their
timing) to the far-end computer over the Internet or an IP network. The receiver reassembles the packets, and generates a tabla phrase in real-time based on the events
received up to that point in time. The software is written in C (GCC) with some
offline processing implemented in Matlab.
The C code is developed in Apples Xcode integrated development environment. I
make use of the following free, cross-platform, and open source third-party libraries:
PortAudio (MIT license, compatible with the GNU GPL)
libsndfile (released under the terms of the GNU LGPL)
FFTW, the Fastest Fourier Transform in the West, developed at MIT (GNU
GPL license)
PortAudio is a multi-platform wrapper for real-time audio input/output. It provides a convenient way to access platform-specific audio devices through a callback
interface.
libsndfile is an audio file wrapper that offers a read/write interface to WAV, AIFF
and other types of audio files.
FFTW was used for FFT computations.
In addition, mathematical computations within my code use the math.h standard
library.
Audacity is a stand-alone open-source audio recording and processing package
that was used to record audio data in WAV or AIFF format and visualize its spectrum.
The software is implemented as two processes, one called TN Tx (TablaNet Transmitter) that runs on the near-end computer, and one called TN Rx (TablaNet Receiver) on the far-end computer.
The analysis modules, which convert the incoming audio signal into symbols, are
collectively called the Transmitter. On the other side, the Receiver contains the modules that listen to the network and convert incoming symbols back into an audio signal
44
Audio input
Audio input
pre-processing
Stroke
segmentation
Timing
Feature
extraction
PSD
Beat tracking
Tempo
estimation
Tempo
High-level
events
Bol training
Pitch tracking
Bol
classification
Bol
Pitch
Low-level
events
Mid-level
events
Event structure
(symbol)
45
Event
structure
(symbol)
High-level
events
Mid-level
events
Tempo
Low-level
events
Bol
A-priori
knowledge
Pitch
Bol history
Phrase
generator
Control
parameters
Trigger
next bol
"Pitch bend"
Sound
synthesis
Audio output
46
for playback. Figures 3-3 and 3-4 are high-level functional diagrams that represent
the tasks undertaken by the Transmitter and the Receiver. Both Transmitter and Receiver are present on the near-end and the far-end computers as the system operates
symmetrically in full duplex. The boxes with dashed lines are not fully implemented
in the current version of the system.
The software block diagram of the Transmitter is presented in Figure 3-3. In the
first step, the audio input is preprocessed (buffered, and converted from a time-domain
to a frequency-domain representation). The onset detector is based on an envelope
follower, and individual drum sounds are segmented into frames. Then, audio features
are extracted from each frame and combined to form a feature vector of reasonable
size. The features consist of spectral domain components for bol recognition, pitch
tracking for general tabla tuning and occasional pitch slides on the bayan (not implemented in the current version of TablaNet), and tempo data computed from the
onset timings (only timing differences are considered in this version). The bol classifier runs on each frame that contains a stroke. The identified inter-related events
that make up the incoming audio stream are combined into a data structure, and
sent asynchronously over the network through a callback mechanism. The Transmitter subjects the raw audio signal to perceptually-motivated digital signal processing
algorithms for bol classification, tempo estimation, and pitch tracking.
The Receiver, on the other hand, operates only on symbolic data. When the
event data structure reaches the Receiver through the network, the various categories
of events are demultiplexed. Individual bols influence the tabla phrase generator,
which estimates the most probable rhythmic pattern (i.e. the next stroke and its
timing) to be played locally. This module keeps track of the previous sequences of
bols with a logging mechanism. Tempo changes and pitch variations also contribute
to the dynamic adaptation of the computers generative model to the far-end musicians playing style. A-priori knowledge, in the form of grammatical rules for the
tabla also constrain the phrase generator and its prediction. A hierarchical structure
with a weighing scheme is used to represent the different levels of a tabla performance
structure (i.e. the stroke level, the phrase level, and the composition level). Then the
phrase generator triggers the appropriate sample in the sound synthesis engine at the
appropriate time. Low-level events such as pitch variations control parameters of the
sound synthesis engine.
The software architecture relies on a set of audio input buffers to perform various
computations. The output buffers (at the Receiver side) are straightforward so I do
not discuss them here. The first set of input buffers are inaccessible by the application directly. They are managed by the audio device driver. PortAudio offers an
encapsulation of these low-level buffers by copying their content asynchronously to
an application-level buffer through a callback function. This first application buffer
is a 3-second ring buffer (with a write pointer and a multiple read pointers) that
collects the audio samples provided by PortAudio whenever new data is available.
After an initial delay to make sure that the ring buffer has started to fill up, a syn47
chronous polling loop checks that new audio data is available in the ring buffer, and
copies a 512-sample frame into a circular array of eight 50% overlapping buffers. A
512-sample length window is applied to each buffer, and the snippet of time-domain
signal contained in it is converted to the frequency domain. This set of buffers checks
for the presence of an onset (see Section 3.4 for more details on the onset detection
algorithm). When an onset is detected, the samples are copied from frame n (where
the onset was detected) and frame n + 2 to another buffer of length 1024 for further
processing (again see Section 3.4 for additional details from this point onwards).
Let us discuss the choice of buffer sizes and their relation with musical structures
such as pitch and tempo. With a 44.1 kHz sampling rate, the 512-sample onset
detection buffers with 256 sample overlap have a duration of approximately 6 ms,
which is the quantization step for stroke detection. The 2048 samples contained in
the onset detection buffer array correspond to around 48 ms, which is the systems
smallest stroke interval. Considering that, at the fastest speed, tabla players produce
one stroke every 80 to 100 ms, this buffer duration provides ample room even at high
playing speeds. On the pitch detection front, the two 512-sample buffer in frequency
86 Hz, which is fine
domain allow for the detection of frequencies as low as 44100
512
for tabla sounds. It must be noted that the buffer lengths, although they provide
enough data for low-latency digital signal processing algorithms, are too short for
human perception (in terms of rhythmic structure or pitch resolution).
The current version of the system does not have a graphical user interface. However I experimented with the GTK+ toolbox and consider providing a GUI based on
it for easier access to the training, recognition, and prediction modules, and related
parameters. In particular, the GUI could provide an interface to save and load sessions and thereby retrieve historical data for a particular user.
48
3.4
The problem of tabla stroke training and recognition is basically one of supervised
learning and pattern classification. This section defines the problem and identifies its
constraints, proposes a solution, and then describes each of the components of the
solution that was implemented.
Design Principles
The tabla stroke recognition engine is similar in concept to a continuous speech recognizer although it does not implement advanced features of state-of-the-art speakerindependent speech recognizers (such as dealing with co-articulation between consecutive phonemes). Furthermore, I impose external constraints that further limit the
similarities between the two.
Phonemes are the basic units of speech. Speech recognizers extract features from
phonemes, and then usually train and use a Hidden Markov Model (HMM) (Rabiner,
1989) which maps a sequence of phonemes to a word. Then, words are assembled into
sentences, and homophones can be distinguished based on the grammatical context.
In the same way, tabla bols can also be combined to form word-like multi-syllabic
bols (e.g. TiRaKiTa or DhaGeDhiNa) that usually fit within one beat. However, in
the TablaNet system, I do not consider multi-syllabic bols as one compound bol, but
rather treat them as a concatenation of multiple bols (with an associated timing, a
fraction of a beat).
Most modern speech recognizers are continuous and speaker-independent. Although the recognizer in the TablaNet system is also continuous (it recognizes sequences of bols within rhythmic phrases, not necessarily in isolation), it is playerdependent. There are two reasons for this: the first is that for a recognizer to be
user-independent, it would have to be trained on a statistically significant amount of
data and I did not collect such a large dataset; the second one is that, depending on
the musical school (gharana) the player comes from, some bol names may correspond
to different strokes.
The stroke training phase corresponds to a supervised learning scheme where labels are provided. At the beginning of the session, the tabla player plays each bol
three times in a continuous sequence. The player is asked to play at his or her own
speed (a speed he or she is comfortable with) so that it resembles a real performance
(e.g., with partial overlapsthe decay of a stroke merging with the attack of the next
stroke). I have limited the system to 8 bols. Since the model is trained for each
player, it does not matter which bols the player chooses. However the system was
tested with the following bols: Ta Tin Tun Te Ke Ge Dha Dhin.
49
The part of the system design presented in this section benefited from the most
attention and development time. I was able to perform preliminary studies (implementation of an offline stroke recognition in Matlab, and user listening tests) which
provided some insights into the problem and informed the design principles presented
here.
The tabla stroke recognition presented in Section 2.3 mentions both time domain
and frequency domain features that have been used successfully. I also tested various
features in my preliminary test, including zero crossings (ZCR), power spectral density
(PSD), and Mel-frequency cepstrum coefficients (MFCC). The latter are widely used
in speech recognition, but didnt perform well in my experiment with tabla strokes.
In my early tests, PSD (512 bins reduced to 8 with principal component analysis)
performed the best. Therefore, I chose to use spectral density-derived values for my
feature vector. Details follow in the coming sections.
The insight which led to the previous modification is that the strokes evolve in
time. The attack transients contain almost all frequencies, and then the spectrum
settles down, with some strokes exhibiting their tonal property (i.e. their fundamental frequency F0 and harmonic overtones are visible in their spectrogram). Dividing
the time window into smaller frames helps take this time-varying information into
account.
As far as the recognition method is concerned, the literature describes the following
approaches for tabla stroke recognition:
hidden markov models (HMM),
neural networks,
decision trees,
multivariate Gaussian (mvGauss) model,
k-nearest neighbor (kNN),
kernel density (KD) estimation,
canonical discriminant analysis (which is really a dimensionality reduction method),
support vector machines (SVM),
expectation-maximization (EM), and others.
This list actually covers most of the machine learning algorithms!
In my preliminary study, I compared the performance of three pattern classification techniques: k-nearest neighbor, nave Bayes, and neural nets. In the current
50
system, I chose to implement the k-nearest neighbor (kNN) algorithm not only because it had performed the best in my study, but also because it displayed results
close to human perception. In fact, the confusion matrix (refer to Section 4.1) showed
that strokes that are difficult to distinguish for humans (actually more so for beginner tabla players than for expert musicians) posed similar difficulties for a kNN-based
machine recognizer.
Whereas nave Bayes builds a model and computes parameters to represent each
class (each stroke), kNN is an example of instance-based learning where the feature
vectors extracted from the training dataset are stored directly. Although it has drawbacks such as its memory footprint and computational complexity, the simplicity of
my requirements (eight bols, three eight-dimension feature vectors per stroke) make
this choice well suited for my application.
Algorithm Implementation
Audio input
buffer
Windowing &
FFT
Onset
detection
Stroke buffer
PSD
PCA
Compute
timing
Segmentation
Training
Bol output
structure
Store matrix
Recognition
kNN
Label stroke
dimension tractable considering the limited amount of training data. In the training
phase, the reduced feature vectors are stored in matrix form and stored in a file for
future reference. In the recognition phase, the matrix previously stored during the
training phase is retrieved and used in the kNN algorithm. The algorithm outputs
the label of the recognized stroke. Then, both the timing information from the stroke
onset, and the stroke label are placed into a structure for transmission.
The initial symbol structure contains the bol label and timing (number of frames
since last stroke). Additional data could later include the stroke amplitude or power
information (to extract accent information for instance), and its (time-varying) pitch
contour.
Stroke Detection
I present here details of the windowing function, the fast Fourier transform (FFT)
implementation, the onset detection, the stroke segmentation, and the stroke timing
computation.
The initial buffer frame size is 512 samples (see Section 3.3 for a discussion on
buffer sizes). A Hamming window (Equation 3.1) is applied to the samples in the
frame.
w[n] = 0.54 0.46 cos
2n
N 1
(3.1)
where w is the function output, n, the sample number, and N , the frame size.
The windowed audio input frame is computed by Equation (3.2).
x[n] = y[n] w[n]
(3.2)
a time-domain signal as a sum of sinusoids of varying amplitude and phase (in the
frequency domain). In mathematical terms, the DFT equation appears in (3.3).
X[k] =
N 1
2j
1 X
x[n] e N kn
N n=0
k = 0, . . . , N 1
(3.3)
(3.4)
where X denotes the complex conjugate of X; and the indices are module N .
In particular this means that the DC component (at k = 0) and the Nyquist component (at k = N2 ) are real-valued, and only half of the other elements are required
to be evaluated and stored (the other half can be recomputed from them).
The FFT is computed with the FFTW software package (see Section 3.3). Its
output is non-normalized, meaning that it is not scaled by N1 as in Equation (3.3).
This does not matter because we primarily use the FFT result for comparison purpose
between audio frames or feature vectors.
The onset detection problem found in the TablaNet system is a simplified version
of the generalized onset detection problem, which is still an active research problem. I take advantage of the following assumptions: the audio input only contains
tabla strokes (especially when using vibration sensors which do not pick-up ambient
sounds), and spurious transients (due to electrical noise, or fingers touching the sensors for example) are of short duration compared to tabla strokes. Therefore I am
able to simply check for an increase in energy between the current frame and the
previous one, and a slight drop at the next frame (when the stroke reaches its steady
state). I do not need to use more advanced perceptually-motivated techniques here.
A preliminary study of onset detection, where I had used a simpler technique of
detecting an increase in time-domain amplitude, performed very poorly, with either
a high amount of false positives or false negatives, depending on the threshold value.
The total energy of the frame is computed in Equation (3.5).
Ef ramei =
N
1
X
k=0
|X 2 [k]|
(3.5)
Then the system compares the energy of the current frame Ef ramei with the energy of the previous frame Ef ramei1 . If Ef ramei 3 Ef ramei1 , we verify that the next
frame sees a decrease in energy: Ef ramei+1 Ef ramei . If both conditions are satisfied,
the system reports an onset. The factor 3 in the first condition was chosen based
on empirical observations. This algorithm performed with close to 100% recognition
accuracy (based on ad-hoc testing).
Once an onset has been detected, the current frame and the next non-overlapping
frame are stored in a frequency domain stroke buffer for further processing (i.e. stroke
segmentation).
At this point, the timing of the current stroke (or rather its onset) is computed by
counting the number of frames since the last onset. Based on a 256-sample overlap
between frames (at 44.1 kHz sample rate), the stroke quantization has a resolution of
slightly less than 6 ms, which, according to my experiments, appears to be sufficient
for the musicians who have used the system (see Section 4.2 and user comments in
Appendix A). The relative timing (the delta) of each stroke is computed.
Feature Extraction
Feature extraction is performed on frequency-domain frames of length 1024. In my
preliminary study, I treated each stroke as a stationary random process (i.e. I did not
distinguish between the attack phase, the steady state, and the decay). I computed
the PSD using the periodogram method. Since I now treat the signal as time-varying
(the first frequency-domain analysis frame corresponds to the attack, and the second
frame corresponds to the steady state), I am not able to use the PSD formula (because it is non-stationary). Therefore I treat the sequence as periodic, compute its
DFT to make a discrete spectrum, and then evaluate its spectral density (Equation
3.6).
In this case, the windows dont overlap: the middle portion (transition between
noisy attack and steady state) is not taken into account. Moreover, frames are used
for comparison purpose, not time-domain reconstruction.
2
1 X
X[w] X [w]
x[n]ejwn =
(3.6)
f ramei [w] =
2
2
where f ramei is the energy of the current frame i, and w is the radial frequency.
In practice, however, I do not scale the result by 2, and use frequency bin indices
rather than radial frequency indices (as in Equation 3.7).
f rame [n] = X[n] X [n]
54
(3.7)
f ramei is computed for each of the two frames extracted from each stroke, and
then concatenated, resulting in a 1024-length feature vector.
Dimensionality Reduction
The feature vector obtained in the previous section presents a problem for efficient
classification. Its length, 1024 elements, is too large in comparison with the dataset
(8 bols with 3 training examples each). This is a problem with reference to the curse
of dimensionality: the data density is too low within the high-dimensional feature
space (Bellman, 2003).
A solution to this problem is to reduce the number of dimensions (and thus elements) in the feature vector. I do so by selecting an orthogonal linear combination
of the most relevant dimensions in the feature space (relevant in a representation
sensedimensions with the largest variancenot necessarily in a classification sense).
The feature vectors are then projected onto a new coordinate system. Used because
of its relatively simple implementation (compared with dimensionality reduction techniques optimal for classification), PCA works well enough for classification purposes
in this case according to my preliminary studies. PCA involves an eigenvalue decomposition. It does not have a fixed set of basis vectors: its basis vectors depend on the
data set.
Figure 3-6 shows the PCA algorithm. Additional information and mathematical
derivations can be found in Duda et al. (2000).
In the current system, the algorithms first five steps (one-time computations for
each training set) are performed offline in Matlab and saved to a file, which is then
used by the C program for real-time dimensionality reduction in the test data.
First, the training data is organized into an M N matrix where the M training
vectors are arranged as column vectors and the N rows correspond to the observed
variables. In this case, I have M = 3 8 = 24 training data vectors. In the first
algorithmic step, the empirical mean vector is computed (each vector element corresponds to the mean in a particular dimension). Then the mean vector is subtracted
to each column of the observation matrix so that the training data is centered around
the origin of the feature space. In the second step, the covariance matrix is computed
from the centered observation matrix. At this point, PCA involves an eigenvalue
decomposition. I use Matlabs numerical functions to compute the M eigenvectors
and eigenvalues of the covariance matrix. Eigenvectors and eigenvalues are stored in
their respective matrix and ordered according to decreasing eigenvalues while keeping
the eigenvectoreigenvalue correspondence (same column number). Finally, a set of
L basis vectors is chosen, starting from the eigenvectors with the largest associated
eigenvalue (dimensions with the largest variance). The optimal value for L is discussed in the evaluation section (4.1). This concludes the offline selection of basis
55
Compute
mean
Compute
covariance
matrix
Compute
ew and ev
Arrange ew
and ev
Select
basis
vectors
Project
data onto
new basis
56
vectors dependent on the data set. The last step, the projection of the data onto
the new basis vectors, is performed in real-time after the stroke segmentation phase,
whenever an onset is detected.
k-Nearest Neighbor
The k-nearest neighbor algorithm (kNN) classifies new strokes based on the distance
of their feature vector with the stored and labeled feature vectors of the training
dataset. kNN is an example of lazy learning where no model of the data is preestablishedspecific instances of the data are comparedand computation happens
at classification time. kNN is used here because of its simplicity and intuitive approach
to classification, and in spite of its run-time computation complexity.
The k parameter indicates the number of neighbors the test data point is compared
with. A majority vote will select the output label among the k nearest neighbors. I
tried out different values for k (see Section 4.1). The distance measure used here is
the Euclidian distance (Equation 3.8).
v
u n
uX
(3.8)
d(ai , bi ) = t (ai bi )2
i=1
where a and b are the points whose distance d is evaluated in n dimensions, and
i is the dimension index.
In the software implementation however, since we use the Euclidian distance for
comparison purposes, the square root is omitted. And given the small set of training
data, the algorithms computational complexity does not come in the way of real-time
behavior.
3.5
As opposed to the previous tabla recognition problem, the tabla phrase prediction
problem is one of unsupervised learning. In this section, I present the issue along with
background information, propose a solution, and give an example that illustrates its
implementation.
Unsupervised learning here means that there is no explicit teacher to label each
class or category. The system tends to dynamically form groups of the input patterns
during the systems lifetime (or during a rehearsal session). To solve the prediction
problem we look at the literature on estimation. We are primarily concerned about
an appropriate representation for the historical (input) data, and a suitable model
for the a-priori knowledge and the algorithm.
57
Design Principles
In this section I mainly discuss the tabla phrase prediction engine. I first look at
the literature on rhythm perception and musical expectation. Then I briefly look at
technical solutions to the prediction problem. Finally I use the insights offered by
some of the findings to develop a model that will respond to the specificities of the
tabla.
Jackendoff and Lerdahl in their 1996 landmark Generative Theory of Tonal Music (GTTM) for Western classical music discuss the importance of grouping (i.e.
music segmentation) and meter (i.e. the alternation of strong and weak beats) in the
perception of musical structure. They acknowledge the fact that meter is culturespecific (for instance syncopation is often avoided in Western music), explaining that
small integer ratios are easier to process than complex ones. However as we saw in
Section 2.2, Indian music makes liberal use of syncopation. They also talk about the
importance of expressive timing (i.e. timing that is slightly off) in the value of
interpretation.
Narmour (1999) talks about hierarchical expectation. He uses the idea of repetition to explain the notion of style in a way that shapes cognition, and enables us
to recognize elements of style, by knowing what to expect within a style that we are
familiar with.
Clarke (1999) proposes a survey of research in rhythm and timing in music. In
particular he mentions the substantial work of Paul Fraisse in this area. Fraisse
makes a distinction between temporal phenomena under 5 seconds, which contribute
to the perception of time, and longer events, which lead us to reconstruct temporal
estimates from the information stored in memory, and which also contribute to our
sense of expectation. In fact, he goes on further about our perception of time, and
explains that events that occur approximately within 400 to 600 ms of each other
(relative intervals between events) lead to our sense of grouping (although, according
to Povel and Okkerman, other characteristics like amplitude and pitch may contribute
to the phenomenon of auditory grouping). On the other hand, long durations (above
600 ms) make us aware of the passage of time. Finally, Fraisse also ties rhythm
perception with motor functioning (the fact that we literaly tap the beat).
1
s, we have
Snyder (2000) also categorizes events of different durations: under 32
1
event fusion, between 16
s and 8 s, we experience melodic and rhythmic grouping,
and above 8 s, we perceive the form. Further, he defines some common terms like
beat (a single event in time, an onset, or equally spaced temporal units), pulse (a
series of beats that occur at the tempo), and accent (the weight or quality of each
beatstrong or weak). Snyder emphasizes the importance of metrical hierarchy and
the smaller subdivisions of beats within the tempo. He also indicates that the meter
can be seen as a temporal schema that requires closure: we need to know where
the downbeat is (from context or from means other than temporal). In fact, my user
58
experiments (see Section 4.2) showed that the tactus is particularly difficult to catch
in Indian musicespecially when the context (e.g. the tala, the first downbeat) is
not clear.
These studies address music perception and cognition, some based on studies of
the brain, but the majority are based on Western classical music. It is therefore noteworthy that there are actually a few relatively recent studies specific to Indian music
(see Clayton (2000) and Berg (2004)).
Most of the studies described here emphasize the importance of hierarchical structure in the perception of rhythm, and the fact that intervals between events are of
utmost important to distinguish individual events (fusion), from grouping (e.g. rhythmic phrases), and form (or composition-level structure).
On the technical side, Russell and Norvig (1995) present various approaches to
learning in the context of artificial intelligence. In particular they mention the following three key elements:
the components to be learned,
the feedback available to learn those components, and
the representation used for components.
Russell and Norvig describe unsupervised learning as a type of learning which has
no a-priori output. To be effective, an unsupervised learning scheme is required to
find meaningful regularity in the data. They also advocate an approach that combines prior knowledge about the world with newly acquired (or current) knowledge.
In practical systems, the problem of prediction has often been solved with Kalman
filters, which are like hidden Markov models (HMM) except that the hidden state variables are continuous instead of being discrete. HMMs are statistical models that are
considered to be the simplest dynamic Bayesian network (DBN). They are widely used
in speech recognition and other time-varying pattern recognition problems. During
the training phase, hidden (model) parameters are estimated from observable parameters (e.g. the sequence of phonemes that make up a certain word). The model
parameters are then used for analysis during the pattern recognition phase (see Rabiner (1989) for a speech recognition application example).
The Kalman filter is a recursive filter that estimates the state of a dynamic system from partial information. The next state (in discrete time) is predicted from the
previous state and control parameters. The filter generates the visible output from
the hidden state.
59
Although Kalman filtering and hidden Markov models are powerful tools, I chose
another mechanism for the TablaNet prediction engine. For one, Kalman filters operate on continuous hidden state variables, whereas the sequence of tabla strokes that
produce rhythmic phrases are discrete events. As for HMMs, the model is trained on
a defined set of sequences (or words, or phrases). Moreover integrating new instances
within the model is time consuming (for real-time behavior).
Although the prediction problem seemed particularly hard to address at first, I
was inspired by the relative simplicity and the promising results of Chafes statistical
model for the prediction of solo piano performance (1997). This gave me hope.
I would like to remind the reader here of the model developed by Bel in his Bol
Processor (see Section 2.3). Based on a textual representation of bol sequences along
with grammar rules for improvisation, Bels approach inspired me to develop an alternative model for phrase prediction.
My solution resides in dynamically building a generative grammar. The sequence
of recognized bols (labels) is sent from the Transmitter to the Receiver as ASCII
characters (the eight bols are represented by characters a to h). These characters are
contained within a structure which also contains the relative timing of the current
stroke in relation to the previous stroke. The Receiver stores the string sequence as
an historical account of the performance and, if applicable, generates a new grammar
rule (or updates an existing one) to predict the most likely next stroke the next time
around. A string matching algorithm runs for various string lengths to account for
the hierarchical structure of rhythmic patterns. Additionally, timing and bol prediction are constrained by the output of a decision tree that contains a-priori knowledge
about tabla performance. The decision tree is not currently implemented. Although
simple, this approach offers a convenient and simple way to evaluate my hypotheses
about the TablaNet system.
I tried in this preliminary design to keep the constraints to a minimum. For instance, there is no higher-level beat tracking or tempo estimation. One reason for
this is that these problems are even more complicated for Indian music than they are
for Western music because of the pervasive use of syncopation, among other things.
Therefore the system trains itself with the previous events in an ad-hoc fashion.
If this previous description is somewhat abstract, the following section describes
the algorithm and is followed by an example that illustrates it.
Algorithm Implementation
The tabla phrase prediction algorithm is a non-metric method that makes use of a
simplified formal grammar with production rules and grammatical inference. For
more information on non-metric methods, please refer to Duda et al. (2000).
60
Generate
output bol
Input event
structure
Generate
grammar rules
A priori
knowledge
Synthesize
next stroke
Audio output
buffer
Generate
output timing
Example
This section presents an example (based on Tintal) for the generative grammar. The
X represents a random or unknown stroke.
2.
3.
4.
Sequence of strokes
Rules:
Input:
Output:
Rules:
Input:
Output:
Rules:
Input:
Output:
Rules:
Input:
Output:
Rules:
5.
Input:
Output:
Rules:
6.
Input:
Output:
0
a
X
0
a
X
0
a
X
a
a
X
a
b
a
X
a
b
b
a
X
X
b
X X
b
X
b
b
X
b
b
b
X
b
b
a
b
X
b
X X
-> b
b a
X X X
-> b
-> a
b a a
X X X X
-> b
-> a
-> a
b a a b
X X X X b
Step
Sequence of strokes
Rules:
7.
Input:
Output:
Rules:
8.
Input:
Output:
Rules:
9.
Input:
Output:
Rules:
10.
Input:
Output:
Rules:
11.
Input:
Output:
Rules:
12.
Input:
a
b
b
a
a
X
a
b
b
a
a
X
a
b
b
a
a
X
a
b
b
a
a
X
a
b
b
a
b
b
a
X
a
a
X
a
b
b
a
X
a
a
b
b
a
a
b
X
b
b
a
a
b
X
b
b
a
a
b
X
b
b
a
a
b
X
b
b
a
a
a
b
b
a
a
b
X
b
b
a
c
a
a
b
-> b
-> a
-> a
-> b
b a a b b
X X X X b a
-> b
-> a
-> a
-> b
b a a b b a
X X X X b a a
-> b
-> a
-> a
-> b
b a a b b a a
X X X X b a a b
-> b
-> a
-> a
-> b
b a a b b a a c
X X X X b a a b X
-> b
-> a
-> a
-> b | c (remove)
a -> b | c (remove)
a a -> b | c (remove)
b a a -> b | c (remove)
b b a a -> b
b b a a -> c
b a a b b a a c c
X X X X b a a b X X
-> b
-> a
-> a
-> c
b b a a -> b
b b a a -> c
b a a b b a a c c d
Step
Sequence of strokes
Output:
Rules:
13.
14.
15.
16.
Input:
Output:
Rules:
Input:
Output:
Rules:
Input:
Output:
Rules:
Input:
Output:
X
a
b
b
a
c
X
a
a
X
a
b
b
a
c
c
X
a
a
X
a
b
b
a
c
c
d
X
a
a
X
a
b
b
a
c
c
d
X
a
a
X
X
b
b
a
c
c
a
a
b
X
b
b
a
c
c
d
a
a
b
X
b
b
a
c
c
d
d
a
a
b
X
b
b
a
c
c
d
d
a
a
b
X
X X X
-> b
-> a
-> a
-> c
-> d
b b a
b b a
b a a
X X X
-> b
-> a
-> a
-> c
-> d
-> d
b b a
b b a
b a a
X X X
-> b
-> a
-> a
-> c
-> d
-> d
-> b
b b a
b b a
b a a
X X X
-> b
-> a
-> a
-> c
-> d
-> d
-> b
b b a
b b a
b a a
X X X
X b a a b X X X
a
a
b
X
-> b
-> c
b a a c c d d
b a a b X X X X
a
a
b
X
-> b
-> c
b a a c c d d b
b a a b X X X X X
a
a
b
X
-> b
-> c
b a a c c d d b b
b a a b X X X X X a
a
a
b
X
-> b
-> c
b a a c c d d b b a
b a a b X X X X X a a
Step
Sequence of strokes
Rules:
17.
Input:
Output:
Rules:
18.
Input:
Output:
Rules:
19.
Input:
Output:
Rules:
a
b
b
a
c
c
d
X
a
a
X
a
b
b
a
c
c
d
X
a
d
a
X
a
b
b
a
c
c
d
X
a
d
a
X
a
b
b
a
c
c
d
b
b
a
c
c
d
d
a
a
b
X
b
b
a
c
c
d
d
a
a
b
b
X
b
b
a
c
c
d
d
a
a
b
b
X
b
b
a
c
c
d
d
-> b
-> a
-> a
-> c
-> d
-> d
-> b
b b a a -> b
b b a a -> c
b a a b b a a
X X X X b a a
-> b
-> a
-> a
-> c
-> d
-> d
-> b
b b a a -> b
b b a a -> c
b a -> a
b a a b b a a
X X X X b a a
-> b
-> a
-> a
-> c
-> d
-> d
-> b
b b a a -> b
b b a a -> c
b a -> a
b a a b b a a
X X X X b a a
-> b
-> a
-> a
-> c
-> d
-> d
-> b
c c d d b b a a
b X X X X X a a X
c c d d b b a a b
b X X X X X a a X b
c c d d b b a a b b
b X X X X X a a X b a
Step
Sequence of strokes
X a b b a a -> b
a a b b a a -> c
d b b a -> a
This example shows the output of the table phrase prediction engine for a Tintal
(16-beat) input. Slightly after the beginning of the second cycle (beat 18), the model
has a sufficient set of production rules to predict accurately the next set of strokes.
If the input changes, the algorithm will diverge for a while until it builds a new set
of rules or updates the existing ones.
66
Chapter 4
Evaluation
The system has been evaluated on the following criteria:
tabla strokes recognition rate, and comparison with existing systems,
tabla phrase prediction rate, and
output realism and audio quality by performers and listeners based on a statistical perceptual assessment.
4.1
Quantitative Analysis
This section deals with objective error rates by comparing the actual algorithm outcome with the expected outcome for the recognition task, and then for the prediction
task.
100
95
90
85
80
75
70
65
60
55
50
10
100
95
90
85
80
75
70
65
60
55
50
200
400
600
800
1000
1200
1400
1600
1800
2000
69
findings.
100
95
90
85
80
75
70
65
60
55
50
6
8
10
Number of dimensions (PCA)
12
14
16
Figure 4-3: Evidence curve (discrete strokes) for varying number of dimensions (PCA)
The set of parameters that were selected for the algorithm for further user tests
were the following: k = 3, N = 512, and number of dimensions (PCA) = 8. The
stroke recognition algorithm with these parameters was applied to a subset of the
data recorded from four users (among the eight who participated in the study). The
four sets of user data were selected manually for their characteristics (in particular
their clean and consistent strokes). The plot presented above represent the averaged
values over these four data sets (each data set contains one training set, and two
testing sets of the same sequence of strokes). The best results using the parameters
mentioned above was 92% (the peak at 95.5% was observed with N = 1024).
The previous tests were performed using audio data captured with a microphone.
Since the later user tests were performed with piezoelectric sensors, I wanted to evaluate the impact of using contact microphones on the recognition rate. The same set
of four users mentioned previously were asked to play the same sequence of strokes
on a tabla set fitted with the sensors. The data was recorded as a WAV file in the
same way as had been done with the microphone input. The recognition rate with
the vibration sensors (non-laminated) was 90.6%, slightly lower than the microphone
input, but nevertheless comparable. An evaluation with laminated sensors was not
performed with the same set of users; instead, I ran the test on data that I collected
with me playing. The recognition rate (89.3%), although not obtained in the same
way as in the previous controlled study, gives a rough performance comparison between the two types of sensors.
70
Beginner
87.5
intermediate
90.6
Advanced
91.7
Ta Tin Tun
83
0
17
0
50
33
0
17
83
0
0
0
0
17
0
0
0
0
0
0
0
0
0
0
Te Ke
0
0
0
0
0
0
67 33
17 66
0
0
0
0
0
0
Ge
0
0
0
0
0
87
17
17
Dha Dhin
0
0
0
17
0
0
0
0
0
0
13
0
50
33
17
66
Finally, I tested the discrete recognition rate between three sets of users: beginners, intermediate, and advanced. I classified each user in one category after evaluating their tabla playing skills (see Section 4.2). The results are reported in Table
4.1.
The discrete stroke recognition rate for the advanced player is the highest, which
is predictable, because of the consistency of the stroke between the training session
and the test sessions. Then come the recognition rates of the intermediate players
and the beginner players.
Once the studies with discrete tabla strokes had been performed, I tested the system on continuous tabla phrases (on two cycles of Tintal as mentioned previously).
This study was performed only with the advanced tabla player. Recognition results
(with the same training sequence of discrete strokes as previously) reached 87.5%.
This raw result may not be significant because of the small sample size, but it gave
me the confidence that the system performed almost as well with continuous tabla
phrases.
To evaluate the performance of my system, I compare it here some results achieved
by other researchers whose work has already been introduced in Section 2.3. Gillet
and Richard (2003) report recognition results of up to 85.6% using 5-NN on a database
of 1821 strokes. The better results demonstrated by my method can be explained by
three factors: the more sophisticated feature vectors extracted from the input data,
the limited set of tabla strokes considered, and the much smaller set of testing data as
compared with Gillet and Richards database. Their best results are obtained using
an HMM model (93.4%). Chordia (2005) reports a recognition accuracy of up to 93%
with neural nets on an even larger data set.
71
Table 4.2 describes the machine recognition accuracy by indicating the correspondence between the ground truth (strokes that the players were asked to play) on each
row, and the recognized strokes after speaker-dependent training and recognition on
each column. The horizontal labels represent the ground truth labels, and the vertical
axis, the recognized labels. The point here is to compare the recognition algorithm
with the performance of a human listener (see Section 4.2 for human perceptual results). It is interesting to note that much confusion happens within classes of strokes
(e.g. Dha and Dhin are both resonant bols with the dayan and bayan playing at the
same time, similarly Ta, Tin, and Tun sound alike when played out of context, and
Te and Ke are both closed bols that sound very similar although Te is played on the
dayan and Ke on the bayan).
The bol recognition algorithm can be improved by taking the context into account
(e.g. language modeling as described in Gillet and Richard (2003)). There could also
be a feedback loop between the recognition and the prediction engine to make sure
that the recognized bol falls within a category of legal bols based on the preceding
bols (I dont propose to take following bols into account to avoid causality issues).
However, the system models a 10% recognition error in the prediction engine, which
makes sure that the tabla phrase fits within the constraints of tabla grammar.
100
90
80
70
60
50
40
10
15
20
Number of strokes
25
30
In the current scenario, the algorithm is expected to predict the next single stroke.
However in extreme cases, supposing the tabla plays four strokes per beat at 80 beats
per minute, each stroke lasts approximately 187 ms. If the tabla players play combination beats (a rapid succession of four bols), each stroke lasts less than 50 ms. High
latency networks over long distances (algorithmic delay might be negligible compared
to the latency of packet switching) implies that the algorithm should be able to predict several strokes in advance. In this situation, I expect the current algorithm to
break down much more rapidly as the error rate struggles to decrease.
One way to improve the prediction output realism would be to increase the number of tabla playing rules that the output should abide by, while limiting the historical
input data to constrain the algorithmic delay. Even though this method would not
decrease the error rate, a set of well-designed (i.e. musically informed) constraints
would ensure that the system performs more like a human player and less like a sequencer. In any case, even though the historical data helps decrease the error rate,
its main purpose is to convey a certain style of playing that emerges through the
balance between variety and repetition.
73
4.2
Qualitative Experiments
This section describes subjective user tests involving tabla players and trained listeners, and the results of these tests.
Method
This section gives an overview of the test procedure (shown in detail in Section A.2)
and the data set.
Eight subjects participated in the study. Most of them were from the Boston area,
including MIT and Harvard. There is a relatively large number of tabla players in
this area, however most of them were out on vacation during the evaluation period.
Many expressed their interest in this project and asked whether I would be conducting studies again during the academic year.
Subjects were recruited by e-mail (see copy of the advertisement in Section A.1)
and were given gift coupons for their participation. Needless to say, the participants
in the study had a favorable frame of mind towards the project.
The detailed study protocol is presented in Section A.2. I discuss here the rationale behind the study.
After asking the users to rate themselves as a tabla player (beginner, intermediate,
or advanced), I subjected them to a series of tabla playing exercises so that I could
rate all of them on a common basis (i.e. technique (cleanliness and consistency of
their strokes, etc.), knowledge of various talas). Following this, in the first part of the
study, I asked each one of them to play some sequences of bols and thekas to train
the recognition algorithm and evaluate its performance (both with discrete bols, and
with continuous phrases). In the second part of the study, after a short break, I asked
them to play the role of an audience member and answer some questions based on
what they heard so that I could evaluate their response to the tabla prediction engine.
The tests included what I like to call a Musical Turing test where participants
were asked to distinguish a rhythmic sequence produced by a human player from a
sequence generated by the computer.
In this Turing test, each rhythmic phrase presented to the user is chosen randomly
among the following possibilities:
a digital audio recording of a real tabla performance,
phrases generated using a sequencing software and tabla sound synthesizer,
74
phrases resulting from a recorded input to the recognizer which triggers tabla
samples, and
phrases generated from an input to the recognizer followed by the prediction
engine output.
For the purpose of this evaluation, the TablaNet system had a limited set of functionality. The evaluation was performed under constrained conditions:
medium tabla playing speed,
unidirectional system with no network transmission/reception,
no pitch consideration, no background tanpura (drone instrument)although
this could make the session much more interesting and reminiscent of a realworld situation,
microphone instead of sensors.
As informed listeners (trained in playing the tabla), I asked the participants not
only to evaluate the flow and naturalness of the sequences of strokes (e.g. variety,
quantization), but also the quality of the audio output. The user study combined
tests in a laboratory setting (playing or listening to bols out of context), as well as
tests that were conducted in a setting propitious to musical exchange.
Results
Based on my analysis of the subjects playing skills, I had the following distribution
of tabla players: four beginners, three intermediate players, one advanced player.
Questionnaire responses are included as an appendix (Section A.3) to my masters
thesis. This section compiles some of the study results that are available in their raw
form in the appendix.
To evaluate the confusion matrix for machine recognition presented in the previous section against human perception, I asked the users to name discrete strokes that
I played to them (blind test). I do not present the results in the form of a confusion
matrix here, but I highlight my findings. Most listeners, including advanced players,
had a difficult time distinguishing between Te, Re, and Ke. Although they correspond
to different musical gestures, these three bols sound indeed very similar. Intermediate players (and beginners) also had difficulties between Dha and Dhin, and Tun and
Tin. In addition, beginners were sometimes confused between Tin and Te, Ge and
Dhin, Tun and Ta, or Dha and Dhin. These observations show the importance of
experience in ear training, but also the importance of context to help in recognizing
strokes. Halfway in the session, subjects underwent a short training where they were
told which audio stream (bol sound) corresponded to which label. Then, when presented with a sequence of bols (as opposed to individual bols), most players (including
75
some beginners) could identify the bols with very little mistakes. In particular, the
compound bol TeReKeTe could be easily identified (because of the musical expectation associated with it, and the minute difference in spectrum between each stroke),
whereas when taken individually, the bols Te, Re and Ke were difficult to distinguish
one from another.
To evaluate the TablaNet audio quality, I asked users to participate in a blind
test. I played tabla phrases using either synthesized tabla sounds from a tabla software sound synthesis, or tabla samples that I collected and asembled into meaningful
rhythmic phrases using the TableNet software, and asked them to rate the ones they
prefered. At the exception of one user, they all chose the TablaNet output.
Another experiment consisted in asking users to identify the number of beats, and
if possible, the name of the tala and the sequence of bols from a rhythmic phrase that
was played to them. This task was particularly difficult for beginners. Intermediate
players could distinguish between open and closed bols, but were confused as to their
specific names. The advanced player was able to perform the task perfectly, probably
operating with some kind of template matching since there was no additional context
that was presented to her.
When asked to predict the next bol halfway through a rhythmic sequence, again
out of context, beginners had a very hard time keeping the beat while simultaneously listening to the bols and extracting the phrases structure. Intermediate players
tended to tap the beat with every stroke instead of grouping them, but some did predict the next bol correctly. In fact, some would guess the phrases structure correctly
and would predict the repetition of a pattern correctly, but would not be able to guess
if there was a variation in the structure. The advanced player, however, was able to
perform this task brilliantly, not only predicting the next stroke, but also guessing the
correct number of beats and the tala from limited information (half a rhythmic structure). Some beginners were able to do the same after three cycles or more were played.
Analysis
This section discusses the qualitative results presented previously.
The confusion matrix presented in section 4.1 matches the recognition capability of most human tabla players, suggesting that the spectrum is only part of the
information that is used by the human auditory system to recognize tabla strokes.
Humans as well as machines seem to be able to distiguish mostly between categories
of bols (closed, open or resonant) rather than specific bols, while the cognitive system
and upper-level processes that get honed with experience and continued exposure to
tabla sounds and rhythms extract specific bol values from contextual information and
from the relative differences of consecutive bols (in spectrum and timing).
76
It was interesting to note that in many instances, when playing a sequence of bols,
beginner players had some difficulty in producing a clean, steady sound. It took them
one or two more strokes to feel relaxed and confident enough with their technique.
This affected the recognition model because of the lack of consistency in the sound
of certain strokes depending on where they were played in the sequence. A specific
stroke recognition algorithm could take this information into account and modify its
stroke recognition model based on the position of the stroke.
The findings in this study highlight the importance of active listening in tabla
and rhythmic (or musical) education in general. Players with more experience benefited from a body of knowledge that shaped their expectation. In this context of
education, an automatic tabla transcription system which displays bol names along
with their audio counterpart in a performance can prove to be a useful tool for active
listening and learning.
It is noteworthy that beginners had a difficult time with synthesized sounds, not
being able to name them as accurately as sampled or recorded strokes. Probably for
this reason, most beginners preferred sampled sounds to synthesized sounds compared
to more experienced tabla players.
Finally, as a comparative rating of the stroke recognition system, it is useful to
note that it performed best with intermediate players. Beginner players played some
strokes inconsistently so the sound would vary between various instances, including
playing combined strokes (left hand with the right hand, like Dha) with a slight asynchrony, resulting in two onsets instead of one. Advanced players, on the other hand,
had the tendency to embellish their playing with ornamentations that did throw the
recognition and prediction engines off guard. This is definitely one aspect that I had
not taken into account while designing the system.
Overall, the user studies were one of the most enjoyable and educational parts.
In the future, I would recommend doing user tests much earlier in the process to be
able to incorporate their feedback in the system design.
4.3
Discussion
This section addresses how the results of the quantitative and qualitative evaluations
support the hypotheses of this research.
As a reminder, the main hypotheses of this research are:
1. playing on a predictive system with another musician located across the network
is experientially, if not perceptually, similar to playing with another musician
located in the same room in that it provides as much satisfaction to the
musicians and the audience;
77
78
Chapter 5
Conclusion
5.1
This thesis presents a novel way to overcome latency on computer networks for the
purpose of musical collaboration by predicting musical events before they actually occur on the network. I achieved this by developing a system, called TablaNet, that is
tuned to a particular instrument and musical style, in this case the Indian tabla drum.
I wrote software that recognizes drum strokes at one end, sends symbols representing
the recognized stroke and its timing over the network, and predicts and synthesizes
the next stroke based on previous events at the other end. What do evaluation results
demonstrate?
How do the main hypotheses that drove this work fare with regards to evaluation
results?
1. Playing on a predictive system with another musician located across the network
is experientially, if not perceptually, similar to playing with another musician
located in the same room in that it provides as much satisfaction to the
musicians and the audience;
2. a recognition and prediction based model provides an adequate representation
of a musical interaction; and
3. a real-time networked system suggests new means of collaboration in the areas of distance education, real-world and virtual-world interactions, and online
entertainment.
The work described in this thesis led to the following contributions:
I implemented a novel approach for real-time online musical collaboration,
enabled a real-world musical interaction between two tabla musicians over a
computer network,
designed a networked tabla performance system,
79
5.2
With a system such as TablaNet, we can imagine that it would be possible for an
instructor living in a city to teach music to children in villages who may not have
regular access to a teacher otherwise.
It is a well-known problem in India and other countries with a large ruralurban
divide that it is often difficult to find instruction for music and the arts, or even some
more practical skills, beyond those of the local tradition. This applies as much to
villages as it does to cities: classical musical instruments may be difficult to come by
in rural areas, while cities may have limited access to folk culture from villages even
nearby. In fact, this applies even more to areas with different musical traditions (e.g.
between the North and the South of India; or what if I live in the middle of Iowa and
want to learn the mridangama South Indian percussion instrument?) Although
it is absolutely necessary to preserve local traditions, one way to keep them alive is
in fact to make them live and evolve. With people being increasingly mobile and
connected, even in rural areas (especially in countries like India), communication
services over data networks are becoming ever more relevant, both socially and culturally. Therefore they can be counted upon as a possible means to sustain indigenous
artistic traditions.
80
With Western economic hegemony (villages in India may not have running potable
water but most of them have a Coca Cola stand) permeating into the cultural realm
(the influence of MTV mentioned in Section 1.1), it is often feared that local traditions are in danger of becoming extinct if not preserved and perpetuated through
teaching and practice while keeping up with modernization. As shown in Section 2.2,
Indian music is primarily based on an oral tradition, so a system such as TablaNet,
even if it cannot completely replace a live teacherstudent interaction, can at least
substitute some in-between sessions with online rehearsals. In addition, local traditions may benefit from increased exposure to different, more complex, maybe more
sophisticated, musical styles or practice elements (e.g. the use of more elaborate talas
in contemporary classical Indian music, rather than the ubiquitous tintal ) by evolving
and therefore staying alive.
The $100 laptop developed by One Laptop Per Child (OLPC) seems like an ideal
platform to implement and distribute the TablaNet system: the built-in wireless mesh
network capability can enable children to play music with each other at a distance
through an ad-hoc network as well as through the Internet.
Apart from distance education, preliminary testing also shows that the system
is suitable for distributed online jamming, especially in the context of musical
call-and-response interactions, which are very well established in Indian music, and
therefore carry relevance in entertainment (two amateur musicians wanting to jam
together from their respective homes, or a live network music performance between
musicians over long distances). Collaborations never heard before are thereby made
possible.
5.3
Future Work
A disclaimer should be made here to my statement in Section 1.2 about visual interaction between musicians. It must be noted that a traditional Indian music performance
does not actually rely only on auditory cues. For instance, audience members display their appreciation using non-verbal speech sounds and body gestures (specifically
hand and head movements). And musicians sharing the stage do glance at each other
on occasions. However, I suggest that these modalities do not contribute so much
to musician synchronization as to an excitement factor resulting from a live stage
performance. Therefore, I maintain that the study conducted in this thesis, which
deals mainly with the synchronization aspect of networked music collaboration, does
not suffer from leaving other modalities out. Nevertheless it would be interesting to
research the role of other communication channels in a musical collaboration, including the role of audience interaction, and study how they can be transmitted to distant
protagonists.
In fact, it had been suggested that as a preliminary study, I study the role of
81
visual contact between musicians, in particular between tabla players either in the
context of education, or during a rehearsal or performance. Time constraints did not
allow me to perform the study, but it would be rather interesting to conduct such
an experiment, without any latency involved, by placing a screen between two tabla
players and asking them to play together, and compare their experience with one
where they would be normally facing each other.
Further work on the TablaNet system itself includes the implementation of the
network message passing protocol and a graphical user interface that would allow
users to run the system on their own. Additionally, instead of using sample playback
for the sound synthesis engine, a physical model (waveguide synthesis) of a tabla
drumhead would be a significant improvement. This would enable a fine level of
control mimicking playing techniques involving, for instance, pitch slides. Also, the
current model does not account for differences in tuning. The stroke recognizer is
trained for particular tabla sounds tuned to a particular pitch. And by playing back
pre-recorded samples, the system at the output does not account for different pitches
at the input. If not using a physical model for sound synthesis, an intermediate solution would be to add a pitch tracker at the input, and provide a phase-vocoder-based
harmonizer at the output to tune the sample output to the required pitch. In fact, I
experimented with this in CSound (it was easy to program and it worked great), but
I did not implement it in the current version of the software.
Various improvements in the recognition and prediction algorithms could possibly
be achieved by developing a machine listening model based more closely on a human
auditory perception model rather than statistical machine learning. For instance, by
using a Constant-Q Transform (CQT), as proposed by Brown (1990), the recognizer
may have access to acoustic information relevant to the human ear more satisfactorily
than by using a Fast Fourier Transform (FFT). Similarly, a beat tracking algorithm
based on the findings of Povel and Okkerman (1981) could account for more accurate
tempo estimation.
Finally, I propose to extend the system by enabling more than two tabla players
to play together. An Internet-based percussion ensemble! It would also be interesting
to support other types of musical instruments, in particular melodic ones. This could
lead to a true Indian musical performance where the tabla accompanies a distant
vocalist or a solo instrumentalist.
I hope that this thesis will provide a foundational work for researchers who wish
to further the principles presented here to other instruments and musical styles. It
was my wish to document the user studies by producing video segments to illustrate
various usage scenarios of the system in action (e.g. rhythmic accompaniment, call
and response). Unfortunately, lack of time prevented me from doing so, but I am
confident that I will be able to cater to this in the near future.
82
Appendix A
Experimental Study
A.1
Study Approval
The following pages contain the forms submitted to the MIT Committee On the Use
of Humans as Experimental Subjects (COUHES), as well as their approval notices.
83
A.2
Study Protocol
This section contains the protocol documentation used during the subjective evaluation of the TablaNet system.
95
A.3
Questionnaire Responses
This section contains the anonymous responses to the questionnaire of the users who
participated in the study.
101
110
Bibliography
audiofabric, 2007. URL http://www.audiofabric.com.
digitalmusician.net, 2006. URL http://www.digitalmusician.net/.
ejamming, 2007. URL http://www.ejamming.com/.
indabamusic, 2007. URL http://www.indabamusic.com/.
Jamglue, 2007. URL http://www.jamglue.com/.
Lightspeed Audio Labs, 2007. URL http://www.lightspeedaudiolabs.com/.
Lemma. URL http://visualmusic.org.
Max/msp. URL http://www.cycling74.com/products/maxmsp.
Musicolab, 2003. URL http://www.reggieband.com/musicolab/.
Rocketears, May 2004. URL http://www.jamwith.us/.
splice, 2007. URL http://www.splicemusic.com/.
Swarshala, 2007. URL http://www.swarsystems.com/SwarShala/.
Taalmala, 2005. URL http://taalmala.com/.
Vstunnel, 2005. URL http://www.vstunnel.com/en/.
P. Allen and R. Dannenberg. Tracking musical beats in real time. Proceedings of the
1990 International Computer Music Conference, pages 140143, 1990.
A. Barbosa. Displaced Soundscapes: A Survey of Network Systems for Music and
Sonic Art Creation. Leonardo Music Journal, 13(1):5359, 2003.
A. Barbosa and M. Kaltenbrunner. Public sound objects: a shared musical space
on the web. Web Delivering of Music, 2002. WEDELMUSIC 2002. Proceedings.
Second International Conference on, pages 916, 2002.
R. Bargar, S. Church, A. Fukuda, J. Grunke, D. Keislar, B. Moses, B. Novak, B. Pennycook, Z. Settel, J. Strawn, et al. AES white paper: Networking audio and music
using Internet2 and next-generation Internet capabilities. Technical report, AES:
Audio Engineering Society, 1998.
111
A. Khan and G. Ruckert. The Classical Music of North India: The Music of the
Baba Allauddin Gharana as taught by Ali Akbar Khan, Volume 1. East Bay Books,
distributed by MMB music, Saint Louis, Missouri, 1991.
J. Kippen and B. Bel. Modelling Music with Grammars: Formal Language Representation in the Bol Processor. Computer Representations and Models in Music,
Ac. Press ltd, pages 207232, 1992.
J. Kippen and B. Bel. Computers, Composition and the Challenge of New Music
in Modern India. Leonardo Music Journal, 4:7984, 1994.
F. Kon and F. Iazzetta. Internet music: Dream or (virtual) reality. Proceedings of
the 5th Brazilian Symposium on Computer Music, 1998.
I. Kondo, K. Kojima, and S. Ueshima. A study of distributed jam session via content
aggregation. Web Delivering of Music, 2004. WEDELMUSIC 2004. Proceedings of
the Fourth International Conference on, pages 1522, 2004.
D. Konstantas. A Telepresence Environment for the Organization of Distributed
Musical Rehearsals. Objects at Large edited by D. Tsichritzis, Technical report of
the University of Geneva, 1997.
C. Latta. Notes from the NetJam Project. Leonardo Music Journal, 1(1):103105,
1991.
J. Lazzaro and J. Wawrzynek. A case for network musical performance. In Proceedings
of the 11th international workshop on Network and operating systems support for
digital audio and video, pages 157166. ACM Press New York, NY, USA, 2001.
URL http://www.cs.berkeley.edu/ lazzaro/nmp/.
M. Lefford. Recording Studios Without Walls. Masters thesis, Massachusetts Institute of Technology, 2000.
T. Maki-Patola. Musical Effects of Latency. Suomen Musiikintutkijoiden, 9:8285,
2005.
T. Maki-Patola and P. Hamalainen. Effect of Latency on Playing Accuracy of
Two Gesture Controlled Continuous Sound Instruments Without Tactile Feedback.
Proc. Conf. on Digital Audio Effects, Naples, Italy, Oct, 2004a.
T. Maki-Patola and P. Hamalainen. Latency Tolerance for Gesture Controlled Continuous Sound Instrument Without Tactile Feedback. Proc. International Computer
Music Conference (ICMC), pages 15, 2004b.
S. Malu and A. Siddharthan. Acoustics of the Indian Drum. Arxiv preprint mathph/0001030, 2000.
Y. Nagashima, T. Hara, T. Kimura, and Y. Nishibori. GDS (Global Delayed Session)
Music. Proceedings of the ICMC 2003boundaryless music, pages 291294, 2003.
116
Townley.
Rocket
network,
March
http://smw.internet.com/audio/reviews/rocket/.
2000.
URL
B. Vercoe. Erasing the Digital Divide: Putting your Best Idea on the $100 Laptop.
Keynote lecture, WORLDCOMP06, Las Vegas, June 2006.
G. Weinberg. Interconnected Musical NetworksBringing Expression and Thoughtfulness to Collaborative Music Making. PhD thesis, Massachusetts Institute of
Technology, 2001.
G. Weinberg. The Aesthetics, History, and Future Challenges of Interconnected Music
Networks. In Proceedings of the 2002 Computer Music Conference, 2002.
G. Weinberg. Interconnected Musical Networks: Toward a Theoretical Framework.
Computer Music Journal, 29(2):2339, 2005a.
G. Weinberg. Local Performance Networks: musical interdependency through gestures and controllers. Organised Sound, 10(03):255265, 2005b.
M. Wright and D. Wessel. An Improvisation Environment for Generating Rhythmic
Structures Based on North Indian Tal Patterns. International Computer Music
Conference, Ann Arbor, Michigan, 1998.
M. Wright, A. Freed, and A. Momeni. Open sound control: State of the art 2003. In
Proceedings of the New Interfaces for Musical Expression Conference, pages 153
159, Montreal, 2003.
A. Xu, W. Woszczyk, Z. Settel, B. Pennycook, R. Rowe, P. Galanter, J. Bary, G. Martin, J. Corey, and J. Cooperstock. Real-Time Streaming of Multichannel Audio
Data over Internet. Journal of the Audio Engineering Society, 48(7/8), 2000.
M. Yoshida, Y. Obu, and T. Yonekura. A Protocol For Remote Musical Session with
Fluctuated Tempo. Proceedings of the 2004 International Conference on Cyberworlds (CW04)-Volume 00, pages 8793, 2004.
M. Yoshida, Y. Obu, and T. Yonekura. A Protocol for Real-Time Remote Musical
Session. IEICE Transactions on Information and Systems, 88(5):919925, 2005.
J. Young and I. Fujinaga. Piano master classes via the Internet. Proceedings of the
1999 International Computer Music Conference, pages 135137, 1999.
119