Neural network based systems for computer-aided musical composition

Débora C. Corrêa; Alexandre L. M. Levada; José H. Saito; Joäo F. Mari

Neural Network Based Systems for Computer-Aided Musical Composition:Supervised x Unsupervised Learning Débora C. Corrêa Universidade Federal de São Carlos São Carlos, SP, Brasil +55-16-3351-8579 debora correa@dc.ufscar.br José H. Saito João F. Mari Universidade Federal de Universidade Federal de São Carlos São Carlos Instituto de Física de São Carlos São Carlos, SP, Brasil São Carlos, SP, Brasil São Carlos, SP, Brasil +55-16-33518579 +55-16-3351-8579 +55-16-3351-8579 alexandreluis@ursa.ifsc.usp.br saito@dc.ufscar.br joão mari@dc.ufscar.br Alexandre L. M. Levada Universidade de São Paulo processes [2] [3] [4] [5] [9] [10] [11]. The ANNs, also known as connectionist systems, represent a nonalgorithmic computation form inspired on the human brain structure and processing. In this new computing approach, computation is performed by a set of several simple processing units, the neurons, connected in a network and acting in parallel. The neurons are connected by weights, which store the network knowledge. To represent a desired solution of a problem, the ANNs perform a training or learning stage, which consists of showing a set of examples (dataset training) to the network so that it can extract the necessary features to represent the given information. [1] [7] [8] [12] This process can be divided in two groups: supervised and unsupervised learning. [1] [7] [8] [12] ABSTRACT This ongoing project describes neural network applications for helping musical composition using as inspiration the natural landscape contours. We propose supervised and unsupervised learning approaches, by using Back-Propagation-Through-Time (BPTT) and Self Organizing Maps (SOM) neural networks. In the supervised learning, the network learns certain aspects of musical structure by means of measure examples taken from melodies of the training set and uses these measures learned to compose new melodies using as input the extracted data of the landscapes contour. In the unsupervised learning, the network also uses measure examples as input during training and the extracted data of the landscapes contour in the composition stage. The obtained results show the viability of both approaches. The supervised learning is performed by input-output mapping. i.e., the input patterns and correspondent desired outputs are given to the network by a supervisor. The goal of this learning approach is to adjust the network parameters until the network can answer correctly to any input pattern from training set. After the training stage, the network can generalize what was learned. It occurs when the network produces suitable outputs for inputs that were not include in training. [7] [8] [12] Categories and Subject Descriptors I.2.6 [Artificial Intelligent]: Learning – Connectionism and Neural Nets. J.5.6 [Arts and Humanities]: Performing Arts. In the unsupervised learning, the exact numerical output supposed to be produced by the network is unknown, that is, only the input patterns are available. The network learns to establish statistic regularities from the input patterns and develop the ability to formulate internal representations to encode input features and create new groups. Since the teacher or supervisor is not present, the network must organize itself in order to be able to associate clusters with units. [7] [8] [12] General Terms Algorithms, Experimentation, Human Factors. Keywords Artificial Neural Network, Self-Organized propagation, musical composition, learning. Maps, back- For musical computation, the connectionist systems, as well as other systems that involve machine learning, are able to learn patterns and features available in the melodies of the training set and generalize them to compose new melodies. Therefore, the use of neural networks in music learning and composition has attracted researches and many approaches have been developed. Some of them include: 1. INTRODUCTION The musical computation, including reproduction and generation, has attracted researches a long time ago. Interests in musical composition by computers, specifically, date back to the 1950’s when Markov chains were used to generate melodies [5]. Since music students often learn composition by examples, early approaches were motivated and based on pattern analysis in existing music. More recently, the artificial neural networks (ANNs) have been deployed as models for learning musical Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’08, March 16-20, 2008, Fortaleza, Ceará, Brazil. Copyright 2008 ACM 978-1-59593-753-7/08/0003…$5.00. 1738 - Todd’s recurrent neural network [11]: used to learn to reproduce and compose musical sequences. He used a recurrent link on each input layer, so that the actual pitch is a decaying average of the most recent output values, providing a decaying melody memory. - CONCERT [10]: recurrent network than can predict note-bynote and can also learn some musical phrases. Mozer [10] uses a representation of pitch, duration and chord representation that has psychological and musical basis. - Eck and Shmidhuber’s LSTM recurrent network [4]: used to learn and compose blues music. - Laden and Keefe’s feedforward network used to classify chords . [9] - Chen and Mikkulainen’s recurrent network [3]: used to learn musical sequences from a specific style. They also used generic algorithms to evolve the network. Musical measures can be viewed as melodic primitives that are connected with others musical measures for the melody formation. Therefore, in the training stage, the musical measures should belong to the same musical style. For now, our musical measures consist of notes and their duration attributes and they can be formed as having fixed quantity of time each. For musical measures formed by four time units, for example, the quarter note is the time unit, the half and whole notes have two and four time units, respectively. In this paper, we propose novel methodologies for music composition based on both supervised and unsupervised learning. For the supervised approach we used a recurrent network and the standard back-propagation algorithm for training. For the unsupervised approach, we selected the SOM (Self-Organized Maps) algorithm. We also propose to add a Nature-based inspiration to the composition stage. Among the several issues to be discussed in the proposed methodology we remark: network architecture and parameters, music representation, training algorithms and composition strategies. At first, our focus is on digital music at the pitch and duration level; i.e., we assumed that pitches and duration of notes are available during training. The rhythmic notations are limited in eight note (ƈ), quarter note ( ), half note ( ) and whole note ( ). There are many possibilities to represent the pitch of a given note. It is possible, to specify the actual value or each pitch, or use the relative transitions (intervals) between successive pitches. In the former case, it would be necessary output units corresponding to each of the possible pitches the network could produce, for example, one unit for middle C, one unit for C#, etc. In the interval case, there would be output units corresponding to pitch changes. Then one output unit would represent a pitch change of +2 half steps, another change of -1 half step, and so on. The pitchinterval representation is suitable first because given a fixed number of output units, this representation is not restricted in the pitch range it can cover. Second, the melodies are not so depended on a specific key. [5] [9] [11] The objective of this work is to investigate and to study the viability of different network architectures and training algorithms in computer aided-music composition. This paper is organized as follows: section 2 discuss the data representation; section 3 shows the BBTT results; section 4 shows SOM results and section 5 presents the conclusions. 2. PROPOSED MUSIC REPRESENTATION AND NATURE INSPIRATION It is also possible to represent pitch of notes using the harmonic complex representation. This approach is motivated by the spectral structure of a music sound and by the theory of pitch perception. Listeners are able to distinguish the first five to seven harmonics, and it is assumed that the human ear can individually resolve partials that are separated in frequency by more than a critical bandwidth. [9] [11] Besides musical knowledge, emotions and intentions, the composers generally consider an inspiration when composing a new melody. In this ongoing work, we complement the composition process with a kind of inspiration, from Nature. The new melodies are Nature-based by using the landscape contours as input data during the training and composition stage. The input data (the inspiration of the network) are encoded from previously selected images having landscape contours. Figure 1 presents an example of this representation. The landscape contours are extracted from an image (Figure 1 (a)) and converted into integer numbers (Figure 1 (b)). We chose to represent the pitches by combining integer number and musical intervals. Each note has its proper integer number (Figure 2). Each musical interval determines a frequency distance from a note to another, in semitones or logarithm frequencies. Even with a fixed number of neurons, many note intervals can be achieved when using this representation. The interval between two notes s and t can be represented by equation 1: (1) int( s , t ) t s where s and t are integer numbers that represent the musical notes. (a) (b) Figure 2. Representing the musical notes. The duration of notes could be represented with separated output units, alongside the pitch output units. These units could encode the duration of notes in a localist fashion, with one unit designating a whole note, another unit a quarter note, and so on. Another possibility is to use a distribute representation, i.e., the numbers of units “on” would represent the duration of the current note in sixteenth-notes. [11] Figure 1. Representing the input data. 1739 Error Back-Propagation is a supervised learning algorithm based on error correction. Basically, this learning may be devised in two steps [7] [12]: We represent the duration of notes with 1 and 0, indicating “on” and “off”, respectively. The duration is “on” when a new note begins and is “off” when the note covers the continuation of a previously started note. Using this representation, it is possible to know whether the network output {A,A} meant two notes of pitch A, each lasting one time unit, or one pitch A that is two time units. It is essential to allow this distinction; otherwise the network would be unable to deal with melodies with repeated notes of the same pitch. As example, consider the Figure 3 (a), which represents an example of musical measure extracted from “Marcha Soldado” music [13], and Figure 3 (b) that illustrate the representation of this musical measure in the network, considering the neurons (yi) as output neurons. - Propagation: the input array is applied to the sensorial units and its effect propagates to the entire network, layer by layer. Then an output set is produced as the real network output. In the propagation step, the weighs are fixed. - Back-propagation: the real network output is compared to the desired target and the difference of both produce the error signal. This error signal is back-propagated through the network and used to adjust the weights. 3.1 BPTT EXPERIMENTS AND RESULTS The learning rate changes dynamically, according to the network performance. The weight updates are online, i.e., they are update at each pair {input, target output} applied to the network. The neurons use sigmoid activation functions at the hidden layer and linear activation functions at the output layer. (a) (b) The musical measures from the training set and the landscape contour patterns form the input vector in the training process. When composing, the network receives, at first, one initial measure, that may be the first one from the training set, and the same or very similar landscape patterns. Figure 3. Representing the musical measure. The following two sessions describe the proposed connectionist approaches to the music composition, with supervised and unsupervised learning. As training samples, it was used musical measures extracted from Brazilian folk music (such as “Mulher Rendeira”, “O Cravo e a Rosa”, “Peixe Vivo” and others [13]), once they are simple and monophonic . Generally, in the untrained situation, the outputs produced will be very different from the desired target values and this can interfere in the weights update equations. In conformity with the network training, the outputs will get closer and closer to the targets until they are sufficiently closer and then the training phase can stop.[5] [11] 3. MUSIC COMPOSITION WITH BPTT The network model used in the experiments for the supervised learning approach (Figure 4) is a BPTT [7] (Back-PropagationThrough-Time) with recurrent inputs (xi) and non-recurrent inputs (ii), a hidden layer (represented by zi neurons) and an output layer (yi neurons), that represents the training measures. This network was trained with standard error back-propagation algorithm [7] [12]. In the training stage, the network should learn the musical measures chosen by the user, and in the composition stage, the network should be able to compose new melodies based on the musical measures previously trained. We tried to optimize this training in two ways. One of them consists of the teaching forcing technique [5] [11]. Consider the network fully trained. In this case, the outputs will match the target. The teaching forcing technique considers the output values obtained by the network and feed back the target values to the input units, since the target values are known during training. This makes the training phase shorter. Figure 5 shows the melody created by the network for 10 musical measures using the teacher forcing. Figure 5. Musical measures generated by the network using teaching forcing. Figure 4. Network architecture. 1740 Our motivation for the SOM unsupervised learning is the possibility to create an N-dimensional musical notes feature space and use it to generate new musical measures. Our SOM is trained with measures training samples and, as in the case of BPTT approach, the landscape contour extracted data is presented as input during the composing stage. The feature map obtained as the output of a SOM network has a series of important properties. One of these properties states that the obtained feature map, represented by the neuron weights, provides a good approximation for the input space, that is, the final configuration of weights defines a suitable representation for the specified notes. The basic algorithm for the unsupervised training of a SOM network is given below [7] [12]. Figure 6. Musical Measures composed by the Network using Gaussian probability function. However, there are some drawbacks to the teaching forcing [5]. First, it is not applied to the hidden units, once their target values aren’t available. Second, even when the network is totally trained, its output values may never be exactly equal to the targets. So, when we use the network for new examples after learning, the actual output is fed back and will contain variations not present during the training stage. Trying to minimize this last drawback, we incorporate a Gaussian probability function in the input units, with mean being the unit desired output and very small variance (for example, 0.001). This means that the input units will not receive the exact desired output values; instead they will receive random values that belong to a small interval containing the desired output value (interval center) of the previous training step. Now the network learns how to deal with variations during training and performs better when composing. Figure 6 shows the melody created by the network for 10 musical measures using the probability function in the input units. SOM Unsupervised Learning: 1. 2. Assign uniform random initial weights belonging to interval [-1, 1] for each neuron in the grid. G Choose an input pattern xk as input vector to all neurons. 3. Identify the winner neuron 4. Update the winner neuron weights through: 5. 4. MUSIC COMPOSITION WITH SOM A Self Organizing Map (SOM) is a category of ANN used in unsupervised learning, i.e., the correct output cannot be defined a priori, and therefore a numerical measure of the magnitude of the error cannot be used in the training stage [7] [12]. A selforganizing map consists of a single-layer feed-forward network where the outputs are arranged in a low dimensional (usually 2D) grid, as shown in Figure 7. Each input is connected to all output neurons. Attached to every neuron is a weight vector of the same dimension as the input vectors. All neurons are linear. This network is capable to learn and extract important features contained in the given input space. The fundamental principle behind this network model is the competitive learning, in which neurons compete against each other. 6. 7. 8. G wi . G G G G wi(t 1) wi(t ) D > xk wi @ where 0 D 1 is the learning rate. G Update the neighborhood of wi using: G G G G w(jt 1) w(jt ) D f (r ) ª¬ xk w j º¼ where f (r ) is a monotonic decreasing function of r (the G G Euclidean distance between w j and winner neuron wi ). Reduce the learning rate (ie., D ( t 1) D ( t ) 0.01 ). Repeat steps 2 to 6 for all input patterns ( k 1,..., N ). Repeat steps 2 to 7 MAX_ITERATION times. 4.1 SOM EXPERIMENTS AND RESULTS The SOM neural network implemented is composed by 64 neurons disposed in a 2D mesh topology. The network input is a musical measure vector of four elements, each one corresponding to a note pitch. During the training stage it was used 18 musical measure samples, extracted from Brazilian folk music. We use Į = 0.5 as learning rate and 1 / (1 + r) as decaying function to neighborhood update, where r is the Euclidean distance between the winner neuron and its neighbor. In the composition stage, it was presented at the network input a vector of four elements representing the extracted data from landscapes contours. In this stage, the winner neuron weights determine the resulted measure. Figure 8 shows examples of the generated melodies. Figure 7. SOM network architecture. 1741 Future works include, besides the system necessary improvements, the development of a quality improvement system, optimizing the network application phase. Future works also include a study to check the possibility to extend these architectures to compose polyphonic melodies. 6. ACKNOWLEDGMENTS We would like to thanks CAPES for Débora C Corrêa and João F. Mari student scholarship. 7. REFERENCES [1] BRAGA, A. P.; LUDERMIR, T. B.; CARVALHO, A. P. “Redes Neurais Artificiais – Teoria e Aplicações.”, Rio de Janeiro - RJ, LTC, 2000. [2] CARPINTEIRO, O. A. S. (2001) “A neural model to segment musical pieces.” In Proceedings of the Second Brazilian Symposium on Computer Music, Fifteenth Congress of the Brazilian Computer Society, p. 114 – 120. (a) [3] CHEN, C. C. J.; MIIKKULAINEN, R. (2001) “Creating Melodies with Evolving Recurrent Neural Network.” In Proceedings of the International Joint Conference on Neural Networks, IJCNN’01, 2001, p.2241–2246,Washington - DC. [4] ECK, D.; SCHMIDHUBER, J. (2002) “A First Look at Music Composition using LSTM Recurrent Neural Networks.” Technical Report: IDSIA-07-02. [5] FRANKLIN, J. A. (2006) “Recurrent Neural Networks for Music Computation.” Informs Journal on Computing, Vol.8, No.3, pp. 321-338. [6] FREEMAN, J. A., SKAPURA, D. M, Neural Networks: Algorithms, Applications and Programming Techniques, Adison-Wesley, 1991. [7] HAYKIN, S. “Neural Networks – A Comprehensive Foundation.” Prentice Hall, USA, 1999. [8] KOVÁCS, Z.,L., “Redes Neurais Artificiais: Fundamentos e Aplicações.”Editora livraria da Física, São Paulo, 1986. [9] LADEN, B.; KEEFE, D. H. (1989) “The Representation of Pitch in a Neural Net Model for Chord Classification.” Computer Music Journal, Vol. 13, No.4. (b) Figure 8. Melodies created by SOM. [10] MOZER, M. C. (1994) “Neural network music composition by prediction: Exploring the benefits of psychoacoustic constraints and multiscale processing.” Connection Science, 6(2-3), p. 247 – 280. 5. CONCLUSIONS Independent of the learning approach, the proposed systems are useful for automatic primary melodies generation so that the user can perform posterior analysis. It’s also clear that although considered the landscapes contours, the origin of this inspiration can be generalized. [11] TODD, P. M. A (1989) “Connectionist Approach to Algorithmic Composition.” Computer Music Journal: Vol.13, No. 4. Comparing the results of the supervised and unsupervised learning approaches, it can be noted that both methodologies generated different melodies, once the training samples and learning process are different. The similarity of both approaches should be same tonality of the resulting melodies, when it is used the same measures as input. Further, it is also interesting to compare the generated melodies in terms of harmony, surprising and global structure. [12] ROJAS, R., Neural Networks: A Systematic Introduction, Springer, 1996. [13] YOGI, C. “Aprendendo e Brincando com Música. Vol1.” Editora Fapi, Minas Gerais – MG, 2003. 1742

RELATED PAPERS

RELATED TOPICS

Log In

Neural network based systems for computer-aided musical composition

Neural network based systems for computer-aided musical composition

Related Papers

RELATED PAPERS

RELATED TOPICS