Neural Network Based Systems for Computer-Aided
Musical Composition:Supervised x Unsupervised Learning
Débora C. Corrêa
Universidade Federal de São
Carlos
São Carlos, SP, Brasil
+55-16-3351-8579
debora correa@dc.ufscar.br
José H. Saito
João F. Mari
Universidade Federal de Universidade Federal de
São Carlos
São Carlos
Instituto de Física de São Carlos
São Carlos, SP, Brasil
São Carlos, SP, Brasil
São Carlos, SP, Brasil
+55-16-33518579
+55-16-3351-8579
+55-16-3351-8579
alexandreluis@ursa.ifsc.usp.br
saito@dc.ufscar.br
joão mari@dc.ufscar.br
Alexandre L. M. Levada
Universidade de São Paulo
processes [2] [3] [4] [5] [9] [10] [11].
The ANNs, also known as connectionist systems, represent a nonalgorithmic computation form inspired on the human brain
structure and processing. In this new computing approach,
computation is performed by a set of several simple processing
units, the neurons, connected in a network and acting in parallel.
The neurons are connected by weights, which store the network
knowledge. To represent a desired solution of a problem, the
ANNs perform a training or learning stage, which consists of
showing a set of examples (dataset training) to the network so that
it can extract the necessary features to represent the given
information. [1] [7] [8] [12] This process can be divided in two
groups: supervised and unsupervised learning. [1] [7] [8] [12]
ABSTRACT
This ongoing project describes neural network applications for
helping musical composition using as inspiration the natural
landscape contours. We propose supervised and unsupervised
learning approaches, by using Back-Propagation-Through-Time
(BPTT) and Self Organizing Maps (SOM) neural networks. In the
supervised learning, the network learns certain aspects of musical
structure by means of measure examples taken from melodies of
the training set and uses these measures learned to compose new
melodies using as input the extracted data of the landscapes
contour. In the unsupervised learning, the network also uses
measure examples as input during training and the extracted data
of the landscapes contour in the composition stage. The obtained
results show the viability of both approaches.
The supervised learning is performed by input-output mapping.
i.e., the input patterns and correspondent desired outputs are given
to the network by a supervisor. The goal of this learning approach
is to adjust the network parameters until the network can answer
correctly to any input pattern from training set. After the training
stage, the network can generalize what was learned. It occurs
when the network produces suitable outputs for inputs that were
not include in training. [7] [8] [12]
Categories and Subject Descriptors
I.2.6 [Artificial Intelligent]: Learning – Connectionism and
Neural Nets.
J.5.6 [Arts and Humanities]: Performing Arts.
In the unsupervised learning, the exact numerical output supposed
to be produced by the network is unknown, that is, only the input
patterns are available. The network learns to establish statistic
regularities from the input patterns and develop the ability to
formulate internal representations to encode input features and
create new groups. Since the teacher or supervisor is not present,
the network must organize itself in order to be able to associate
clusters with units. [7] [8] [12]
General Terms
Algorithms, Experimentation, Human Factors.
Keywords
Artificial Neural Network, Self-Organized
propagation, musical composition, learning.
Maps,
back-
For musical computation, the connectionist systems, as well as
other systems that involve machine learning, are able to learn
patterns and features available in the melodies of the training set
and generalize them to compose new melodies. Therefore, the use
of neural networks in music learning and composition has
attracted researches and many approaches have been developed.
Some of them include:
1. INTRODUCTION
The musical computation, including reproduction and generation,
has attracted researches a long time ago. Interests in musical
composition by computers, specifically, date back to the 1950’s
when Markov chains were used to generate melodies [5]. Since
music students often learn composition by examples, early
approaches were motivated and based on pattern analysis in
existing music. More recently, the artificial neural networks
(ANNs) have been deployed as models for learning musical
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SAC’08, March 16-20, 2008, Fortaleza, Ceará, Brazil.
Copyright 2008 ACM 978-1-59593-753-7/08/0003…$5.00.
1738
-
Todd’s recurrent neural network [11]: used to learn to
reproduce and compose musical sequences. He used a
recurrent link on each input layer, so that the actual pitch is a
decaying average of the most recent output values, providing
a decaying melody memory.
-
CONCERT [10]: recurrent network than can predict note-bynote and can also learn some musical phrases. Mozer [10]
uses a representation of pitch, duration and chord
representation that has psychological and musical basis.
-
Eck and Shmidhuber’s LSTM recurrent network [4]: used to
learn and compose blues music.
-
Laden and Keefe’s feedforward network used to classify
chords . [9]
-
Chen and Mikkulainen’s recurrent network [3]: used to learn
musical sequences from a specific style. They also used
generic algorithms to evolve the network.
Musical measures can be viewed as melodic primitives that are
connected with others musical measures for the melody
formation. Therefore, in the training stage, the musical measures
should belong to the same musical style. For now, our musical
measures consist of notes and their duration attributes and they
can be formed as having fixed quantity of time each. For musical
measures formed by four time units, for example, the quarter note
is the time unit, the half and whole notes have two and four time
units, respectively.
In this paper, we propose novel methodologies for music
composition based on both supervised and unsupervised learning.
For the supervised approach we used a recurrent network and the
standard back-propagation algorithm for training. For the
unsupervised approach, we selected the SOM (Self-Organized
Maps) algorithm. We also propose to add a Nature-based
inspiration to the composition stage. Among the several issues to
be discussed in the proposed methodology we remark: network
architecture and parameters, music representation, training
algorithms and composition strategies.
At first, our focus is on digital music at the pitch and duration
level; i.e., we assumed that pitches and duration of notes are
available during training. The rhythmic notations are limited in
eight note (ƈ), quarter note ( ), half note ( ) and whole note ( ).
There are many possibilities to represent the pitch of a given note.
It is possible, to specify the actual value or each pitch, or use the
relative transitions (intervals) between successive pitches. In the
former case, it would be necessary output units corresponding to
each of the possible pitches the network could produce, for
example, one unit for middle C, one unit for C#, etc. In the
interval case, there would be output units corresponding to pitch
changes. Then one output unit would represent a pitch change of
+2 half steps, another change of -1 half step, and so on. The pitchinterval representation is suitable first because given a fixed
number of output units, this representation is not restricted in the
pitch range it can cover. Second, the melodies are not so
depended on a specific key. [5] [9] [11]
The objective of this work is to investigate and to study the
viability of different network architectures and training algorithms
in computer aided-music composition. This paper is organized as
follows: section 2 discuss the data representation; section 3 shows
the BBTT results; section 4 shows SOM results and section 5
presents the conclusions.
2. PROPOSED MUSIC REPRESENTATION
AND NATURE INSPIRATION
It is also possible to represent pitch of notes using the harmonic
complex representation. This approach is motivated by the
spectral structure of a music sound and by the theory of pitch
perception. Listeners are able to distinguish the first five to seven
harmonics, and it is assumed that the human ear can individually
resolve partials that are separated in frequency by more than a
critical bandwidth. [9] [11]
Besides
musical knowledge, emotions and intentions, the
composers generally consider an inspiration when composing a
new melody. In this ongoing work, we complement the
composition process with a kind of inspiration, from Nature. The
new melodies are Nature-based by using the landscape contours
as input data during the training and composition stage. The input
data (the inspiration of the network) are encoded from previously
selected images having landscape contours. Figure 1 presents an
example of this representation. The landscape contours are
extracted from an image (Figure 1 (a)) and converted into integer
numbers (Figure 1 (b)).
We chose to represent the pitches by combining integer number
and musical intervals. Each note has its proper integer number
(Figure 2). Each musical interval determines a frequency distance
from a note to another, in semitones or logarithm frequencies.
Even with a fixed number of neurons, many note intervals can be
achieved when using this representation. The interval between
two notes s and t can be represented by equation 1:
(1)
int( s , t ) t s
where s and t are integer numbers that represent the musical
notes.
(a)
(b)
Figure 2. Representing the musical notes.
The duration of notes could be represented with separated output
units, alongside the pitch output units. These units could encode
the duration of notes in a localist fashion, with one unit
designating a whole note, another unit a quarter note, and so on.
Another possibility is to use a distribute representation, i.e., the
numbers of units “on” would represent the duration of the current
note in sixteenth-notes. [11]
Figure 1. Representing the input data.
1739
Error Back-Propagation is a supervised learning algorithm based
on error correction. Basically, this learning may be devised in two
steps [7] [12]:
We represent the duration of notes with 1 and 0, indicating “on”
and “off”, respectively. The duration is “on” when a new note
begins and is “off” when the note covers the continuation of a
previously started note. Using this representation, it is possible to
know whether the network output {A,A} meant two notes of pitch
A, each lasting one time unit, or one pitch A that is two time
units. It is essential to allow this distinction; otherwise the
network would be unable to deal with melodies with repeated
notes of the same pitch. As example, consider the Figure 3 (a),
which represents an example of musical measure extracted from
“Marcha Soldado” music [13], and Figure 3 (b) that illustrate the
representation of this musical measure in the network, considering
the neurons (yi) as output neurons.
-
Propagation: the input array is applied to the sensorial units
and its effect propagates to the entire network, layer by
layer. Then an output set is produced as the real network
output. In the propagation step, the weighs are fixed.
-
Back-propagation: the real network output is compared to
the desired target and the difference of both produce the
error signal. This error signal is back-propagated through the
network and used to adjust the weights.
3.1 BPTT EXPERIMENTS AND RESULTS
The learning rate changes dynamically, according to the network
performance. The weight updates are online, i.e., they are update
at each pair {input, target output} applied to the network. The
neurons use sigmoid activation functions at the hidden layer and
linear activation functions at the output layer.
(a)
(b)
The musical measures from the training set and the landscape
contour patterns form the input vector in the training process.
When composing, the network receives, at first, one initial
measure, that may be the first one from the training set, and the
same or very similar landscape patterns.
Figure 3. Representing the musical measure.
The following two sessions describe the proposed connectionist
approaches to the music composition, with supervised and
unsupervised learning. As training samples, it was used musical
measures extracted from Brazilian folk music (such as “Mulher
Rendeira”, “O Cravo e a Rosa”, “Peixe Vivo” and others [13]),
once they are simple and monophonic .
Generally, in the untrained situation, the outputs produced will be
very different from the desired target values and this can interfere
in the weights update equations. In conformity with the network
training, the outputs will get closer and closer to the targets until
they are sufficiently closer and then the training phase can
stop.[5] [11]
3. MUSIC COMPOSITION WITH BPTT
The network model used in the experiments for the supervised
learning approach (Figure 4) is a BPTT [7] (Back-PropagationThrough-Time) with recurrent inputs (xi) and non-recurrent inputs
(ii), a hidden layer (represented by zi neurons) and an output layer
(yi neurons), that represents the training measures. This network
was trained with standard error back-propagation algorithm [7]
[12]. In the training stage, the network should learn the musical
measures chosen by the user, and in the composition stage, the
network should be able to compose new melodies based on the
musical measures previously trained.
We tried to optimize this training in two ways. One of them
consists of the teaching forcing technique [5] [11]. Consider the
network fully trained. In this case, the outputs will match the
target. The teaching forcing technique considers the output values
obtained by the network and feed back the target values to the
input units, since the target values are known during training. This
makes the training phase shorter. Figure 5 shows the melody
created by the network for 10 musical measures using the teacher
forcing.
Figure 5. Musical measures generated by the network
using teaching forcing.
Figure 4. Network architecture.
1740
Our motivation for the SOM unsupervised learning is the
possibility to create an N-dimensional musical notes feature space
and use it to generate new musical measures. Our SOM is trained
with measures training samples and, as in the case of BPTT
approach, the landscape contour extracted data is presented as
input during the composing stage. The feature map obtained as
the output of a SOM network has a series of important properties.
One of these properties states that the obtained feature map,
represented by the neuron weights, provides a good
approximation for the input space, that is, the final configuration
of weights defines a suitable representation for the specified
notes. The basic algorithm for the unsupervised training of a SOM
network is given below [7] [12].
Figure 6. Musical Measures composed by the Network
using Gaussian probability function.
However, there are some drawbacks to the teaching forcing [5].
First, it is not applied to the hidden units, once their target values
aren’t available. Second, even when the network is totally trained,
its output values may never be exactly equal to the targets. So,
when we use the network for new examples after learning, the
actual output is fed back and will contain variations not present
during the training stage. Trying to minimize this last drawback,
we incorporate a Gaussian probability function in the input units,
with mean being the unit desired output and very small variance
(for example, 0.001). This means that the input units will not
receive the exact desired output values; instead they will receive
random values that belong to a small interval containing the
desired output value (interval center) of the previous training step.
Now the network learns how to deal with variations during
training and performs better when composing. Figure 6 shows the
melody created by the network for 10 musical measures using the
probability function in the input units.
SOM Unsupervised Learning:
1.
2.
Assign uniform random initial weights belonging to
interval [-1, 1] for each neuron in the grid.
G
Choose an input pattern xk as input vector to all neurons.
3.
Identify the winner neuron
4.
Update the winner neuron weights through:
5.
4. MUSIC COMPOSITION WITH SOM
A Self Organizing Map (SOM) is a category of ANN used in
unsupervised learning, i.e., the correct output cannot be defined a
priori, and therefore a numerical measure of the magnitude of the
error cannot be used in the training stage [7] [12]. A selforganizing map consists of a single-layer feed-forward network
where the outputs are arranged in a low dimensional (usually 2D)
grid, as shown in Figure 7. Each input is connected to all output
neurons. Attached to every neuron is a weight vector of the same
dimension as the input vectors. All neurons are linear. This
network is capable to learn and extract important features
contained in the given input space. The fundamental principle
behind this network model is the competitive learning, in which
neurons
compete
against
each
other.
6.
7.
8.
G
wi .
G
G
G G
wi(t 1) wi(t ) D > xk wi @
where 0 D 1 is the learning rate.
G
Update the neighborhood of wi using:
G
G
G G
w(jt 1) w(jt ) D f (r ) ª¬ xk w j º¼
where f (r ) is a monotonic decreasing function of r (the
G
G
Euclidean distance between w j and winner neuron wi ).
Reduce the learning rate (ie., D ( t 1) D ( t ) 0.01 ).
Repeat steps 2 to 6 for all input patterns ( k 1,..., N ).
Repeat steps 2 to 7 MAX_ITERATION times.
4.1 SOM EXPERIMENTS AND RESULTS
The SOM neural network implemented is composed by 64
neurons disposed in a 2D mesh topology. The network input is a
musical measure vector of four elements, each one corresponding
to a note pitch. During the training stage it was used 18 musical
measure samples, extracted from Brazilian folk music. We use Į
= 0.5 as learning rate and 1 / (1 + r) as decaying function to
neighborhood update, where r is the Euclidean distance between
the winner neuron and its neighbor. In the composition stage, it
was presented at the network input a vector of four elements
representing the extracted data from landscapes contours. In this
stage, the winner neuron weights determine the resulted measure.
Figure 8 shows examples of the generated melodies.
Figure 7. SOM network architecture.
1741
Future works include, besides the system necessary
improvements, the development of a quality improvement system,
optimizing the network application phase. Future works also
include a study to check the possibility to extend these
architectures to compose polyphonic melodies.
6. ACKNOWLEDGMENTS
We would like to thanks CAPES for Débora C Corrêa and João F.
Mari student scholarship.
7. REFERENCES
[1] BRAGA, A. P.; LUDERMIR, T. B.; CARVALHO, A. P.
“Redes Neurais Artificiais – Teoria e Aplicações.”, Rio de
Janeiro - RJ, LTC, 2000.
[2] CARPINTEIRO, O. A. S. (2001) “A neural model to
segment musical pieces.” In Proceedings of the Second
Brazilian Symposium on Computer Music, Fifteenth
Congress of the Brazilian Computer Society, p. 114 – 120.
(a)
[3] CHEN, C. C. J.; MIIKKULAINEN, R. (2001) “Creating
Melodies with Evolving Recurrent Neural Network.” In
Proceedings of the International Joint Conference on Neural
Networks, IJCNN’01, 2001, p.2241–2246,Washington - DC.
[4] ECK, D.; SCHMIDHUBER, J. (2002) “A First Look at
Music Composition using LSTM Recurrent Neural
Networks.” Technical Report: IDSIA-07-02.
[5] FRANKLIN, J. A. (2006) “Recurrent Neural Networks for
Music Computation.” Informs Journal on Computing, Vol.8,
No.3, pp. 321-338.
[6] FREEMAN, J. A., SKAPURA, D. M, Neural Networks:
Algorithms, Applications and Programming Techniques,
Adison-Wesley, 1991.
[7] HAYKIN, S. “Neural Networks – A Comprehensive
Foundation.” Prentice Hall, USA, 1999.
[8] KOVÁCS, Z.,L., “Redes Neurais Artificiais: Fundamentos e
Aplicações.”Editora livraria da Física, São Paulo, 1986.
[9] LADEN, B.; KEEFE, D. H. (1989) “The Representation of
Pitch in a Neural Net Model for Chord Classification.”
Computer Music Journal, Vol. 13, No.4.
(b)
Figure 8. Melodies created by SOM.
[10] MOZER, M. C. (1994) “Neural network music composition
by prediction: Exploring the benefits of psychoacoustic
constraints and multiscale processing.” Connection Science,
6(2-3), p. 247 – 280.
5. CONCLUSIONS
Independent of the learning approach, the proposed systems are
useful for automatic primary melodies generation so that the user
can perform posterior analysis. It’s also clear that although
considered the landscapes contours, the origin of this inspiration
can be generalized.
[11] TODD, P. M. A (1989) “Connectionist Approach to
Algorithmic Composition.” Computer Music Journal:
Vol.13, No. 4.
Comparing the results of the supervised and unsupervised
learning approaches, it can be noted that both methodologies
generated different melodies, once the training samples and
learning process are different. The similarity of both approaches
should be same tonality of the resulting melodies, when it is used
the same measures as input. Further, it is also interesting to
compare the generated melodies in terms of harmony, surprising
and global structure.
[12] ROJAS, R., Neural Networks: A Systematic Introduction,
Springer, 1996.
[13] YOGI, C. “Aprendendo e Brincando com Música. Vol1.”
Editora
Fapi,
Minas
Gerais
–
MG,
2003.
1742