Introducing Machine Learning To Parameter Estimation
Introducing Machine Learning To Parameter Estimation
DAFX-11
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Section 4 reports the implementation details and the experiments W (m) , m = 1, ..., M are the filter weights to be learned, σ(·) is
conducted, while Section 5 provides the results of these experi- a non-linear sigmoid activation function, Z (m−1) is the output of
ments and discusses them. Finally in Section 6 conclusions are layer m − 1, called feature map, Q(m) is the result of convolution
drawn and open avenues for research are suggested. on the previous feature map and h(·) is a pooling function that
reduces the feature map dimensionality. After M convolutional
2. THE PROPOSED METHOD layers, one or more fully connected layers are added. The final
layer has size p and outputs an estimate of the model parameters
A physical model solves a set of differential equations that model θ̂.
a physical system and requires a set of of parameters θ to gener- Learning is conducted according to an update rule, which is
ate a discrete time audio sequence s(n). The goal of the model based on the evaluation of a loss function, such as
is to approximate acoustic signals generated by the physical sys-
tem in some perceptual sense. If the model provides a mapping ℓ(W, e) = kθ − θ̂ (e) k2 (3)
from the parameters to the discrete sequence s(n), the problem of
estimating the parameters θ that yield a specific audio sequence where e is the training epoch. Training is iterated until a conver-
identified as a target (e.g. an audio sequence sampled from the gence criterion is matched or a maximum number of epochs has
physical system we are approximating), is equivalent to finding passed. To avoid overfitting and reduce training times early stop-
the model inverse. Finding an inverse mapping (or an approxima- ping by validation can be performed, which consists in evaluating
tion thereafter) for the model is a challenging task to face, and a after a constant number of epochs the loss, called validation loss,
first necessary condition is the existence of the inverse for a given calculated against a validation set, i.e. a part of the dataset that is
s(n). Usually, however, in physical modelling applications, re- not used for training and is, hence, new to the network. Even if the
quirements are less strict, and generally it is only expected that training loss may still be improving, if the validation loss does not
audio signals match in perceptual or qualitative terms, rather than improve after some training epochs, the training may stop avoiding
on a sample-by-sample basis. This means that, although, a signal a network overfit.
r(n) cannot be obtained from the model for any θ, a sufficiently Finally, once the network is trained, it can be fed with novel
close estimate r̂(n) may. Evaluating the distance between the two audio sequences to estimate the physical model parameters that
signals in psychoacoustic terms is a rather complex task and is out can obtain a result close to the original. In the present work we em-
of the scope of this work ploy additional audio sequences generated by the model in order
Artificial neural networks, and more specifically, deep neu- to measure the distance in the parameter space between the param-
ral networks of recent introduction, are well established to solve eters θi and θ̂i . If non-labelled audio sequences are employed (e.g.
a number of inverse modelling problems. Here, we propose the sequences sampled from the real-world), it is not straightforward
application of a convolutional neural network that, provided with to validate the generated result, that is why in the present work no
an audio signal of maximum length L in a suitable time-frequency attempt has been made to evaluate the estimation performance of
representation, can estimate model parameters θ̂ that fed to the the network with real-world signals.
physical model obtain an audio signal ŝ(n) close to s(n).
To achieve this, the inverse of the model must be learned em- 3. USE CASE
ploying deep machine learning techniques. If a supervised train-
ing approach is employed, the network must be fed with audio se- The method described in the previous section has been applied to a
quences and the related model parameters, also called target. The specific use case of interest, i.e. a patented digital pipe organ phys-
production of a dataset D of such tuples D = {(θi , si (n)), i = ical model. A pipe organ is a rather complex system [14, 15], pro-
1, ..., M } allows the network to try learn the mapping that con- viding a challenging scenario for physical modelling itself. This
nects these. The production of the dataset is often a lengthy task specific model, already employed on a family of commercial dig-
and may require human effort. However, in this application, the ital pipe organs, exposes 58 macro-parameters to be estimated for
model is known and, once implemented, it can be employed to each key, some of which are intertwined in a non linear fashion and
automatically generate a dataset D in order to train the neural net- are acoustic-wise non-orthogonal (i.e. jointly affect some acoustic
work. features of the resulting tone).
The neural network architecture proposed here allows for end- We introduce here some key terms for later use. A pipe organ
to-end learning and is based on convolutional layers. Convolu- has one or more keyboards (manuals or divisions), each of which
tional neural networks globally received attention from a large can play several stops, i.e. a set of pipes, typically one or more
number of research communities and found application into com- per key, which can be activated or deactived at once by means of a
mercial applications, especially in the field of image processing, drawstop. When a stop is activated, air is ready to be conveyed to
classification, etc. They are also used with audio signals, where, the embouchure of each pipe, and when a key is pressed, a valve
usually, the signal is provided to the CNN in a suitable time-fre- is opened to allow air flow into the pipe. Each stop has a different
quency representation, obtained by means of a Short-Time Fourier timbre and pitch (generally the pitch of the central C is expressed
Transform (STFT) with appropriate properties of time-frequency in feet measuring the pipe length). From our standpoint, the con-
localization. The architecture of a CNN is composed of several cept of stop is very important, since each key in a stop will sound
layers in the following form, similar to the neighboring ones in terms of timbre and envelope,
and each key will trigger different stops which may have similar
Z (m) = h(σ(Q(m) )), (1) pitch but different timbre. In a pipe organ it can be expected that
(m) (m) (m−1) pipes in a stop have consistent construction features (e.g. mate-
Q =W ∗Z , (2)
rials, geometry, etc.) and a physical model that mimics that pipe
and Z (0) ≡ X, where M denotes the total number of layers, stop may have correlated features along the keys but this is not
DAFX-12
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 1: Overview of the proposed system including the neural network for parameter estimation and the physical model for sound
synthesis.
an assumption that can be done, so it is necessary to conduct an dataset has been split by 90% and 10% for the training and vali-
estimate of the parameters for each key in a stop. dation sets respectively, for a total of 1998 samples for the former
The physical model employed in this work is meant to emulate and 222 for the latter. The only pre-processing conducted is the
the sound of flue pipes and is described in detail in the related normalization of the parameters and the extraction of the STFT.
patent. To summarize, it is constituted by three main parts: A trade-off has been selected in terms of resolution and hop size
1. exciter: models the wind jet oscillation that is created in the to allow a good tracking of harmonics peaks and attack transient.
embouchure and gives rise to an air pressure fluctuation, Figure 2 shows the input STFT for a A4 tone used for training the
network.
2. resonator: a digital waveguide structure that simulates the
bore,
15621
3. noise model: a stochastic component that simulates the air
13383
noise modulated by the wind jet.
Frequency [Hz]
11156
The parameters involved in the sound design are widely different
in range and meaning, and are e.g. digital filters coefficients, non- 8922
linear functions coefficients, gains, etc. The diverse nature of the
6695
parameters requires a normalization step which is conducted on
the whole training set and maps each parameter in a range [-1, 1], 4461
in order for the CNN to learn all parameters employing the same 2234
arithmetic. A de-normalization step is required, thus, to remap the
parameters to their original range. 0
0 1 2 3
Figure 1 provides an overview of the overall system for pa- Time [s]
rameter estimation and sound generation including the proposed
neural network and the physical model. Figure 2: The STFT for an organ sound in the training set. The
tone is a A4.
4. IMPLEMENTATION
The CNN and the machine learning framework has been imple- The CNN architecture is composed of up to 6 convolutional
mented as a python application employing Keras1 libraries and layers, with optional batch normalization [16] and max pooling,
Theano2 as a backend, running on a Intel i7 Linux machine equipped up to 3 fully connected layers and a final activation layer. For the
with 2 x GTX 970 graphic processing units. The physical model training, stochastic gradient descent (SGD), Adam and Adamax
is implemented as both an optimized DSP algorithm and a PC ap- optimizers have been tested. The training made use of early stop-
plication. The application has been employed in the course of this ping on the validation set. A coarse random search has been per-
work and has been modified to allow producing batches of audio formed first to pinpoint some of the best performing hyperparam-
sequences and labels for each key in a stop (e.g. to produce the eters. A finer random search has been, later, conducted keeping
dataset, given some specifications). Each audio sequence contains the best hyperparameters from the previous search constant. Tests
a few seconds of a tone of specific pitch with given parameters. have been conducted with other stops not belonging to the training
A dataset of 30 Principal pipe organ stops, each composed of set and averaged for all the keys in the stops.
74 audio files, one per note, has been created taking the parameters
from a database of pre-existing stops hand-crafted by expert mu-
5. RESULTS
sicians to mimic different hystoric styles and organ builders. The
1 http://keras.io The loss used in training, validation and testing is the Mean Square
2 http://deeplearning.net/software/theano/ Error (MSE) calculated at the output of the network with respect
DAFX-13
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
to the target, before de-normalization. Results are, therefore, eval- HMSE10 HMSE20 HMSE35
uated in terms of MSE. Stentor 5.2 dB 9.4 dB 18.9 dB
Table 2 reports the 15 best hyperparameters combinations in HW-DE 0.3 dB 10.1 dB 12.3 dB
the fine random search. The following activation function combi-
nations are reported in Table 2: Table 1: Harmonics MSE (HMSE) for the first 10, 20 and all 35
harmonics for the tones shown in Figure 3.
1. A: employs tanh for all layers,
2. B: employs tanh for all layers besides the last one, that uses Activations Minibatch Internal layers MSE
a Rectified Linear Unit (ReLU)[17], class size size
3. C: employs ReLU functions for all layers, A 40 (16, 16, 512, 58) 0.261
B 50 (4, 6, 8, 10, 512, 58) 0.203
4. all other combinations of the two aforementioned activation A 40 (16, 16, 32, 32, 512, 58) 0.164
functions obtained higher MSE score and are not included A 40 (16, 16, 32, 32, 512, 58) 0.139
here. A 25 (16, 16, 32, 32, 1024, 58) 0.161
A 40 (16, 16, 32, 32, 1024, 58) 0.266
Results are provided against a test set of 222 samples from three
organ stops, and have same learning rate (1E−5), momentum max A 25 (16 ,16, 32, 32, 512, 58) 0.156
(0.9), pool sizes (2x2 for each convolutional layer), receptive field A 40 (16, 16, 32, 32, 1024, 58) 0.252
sizes (3x3 for each convolutional layer) and optimizer (Adamax A 50 (4, 6, 8, 10, 512, 58) 0.166
[18]). These fixed parameters have been selected as the best ones A 40 (16, 16, 32, 32, 256, 58) 0.179
after the coarse random search. A 25 (2, 2, 4, 4, 128, 58) 0.254
Figure 4 shows the training and validation loss plots for the C 740 (16, 16, 32, 32, 512, 58) 0.252
first 200 training epochs for the first combination in Table 2. The B 50 (4, 6, 8, 10, 512, 58) 0.214
loss is based on the MSE for all parameters before denormaliza- C 740 (16, 16, 32, 32, 512, 58) 0.257
tion. This means all parameters contribute to the MSE with the B 40 (16, 16, 512, 58) 0.179
same weight and makes results clearer to evaluate. Indeed, if the
MSE would be evaluated after de-normalization, parameters with Table 2: The best 15 results of the fine random hyperparameters
larger excursion ranges would have a larger effect on the loss (e.g. search. Activation classes are described in the text. The MSE
a delay line length versus a digital filter pole coefficient). Vali- is evaluated before denormalization, thus, all parameters have the
dation Early Stopping is performed when the minimum validation same weight. Please note: all the layers are convolutional with ker-
loss is achieved to prevent overfitting and reduce training times. In nel size as indicated, exception made for the second to last which
Figure 4, e.g., the validation loss minimum (0.027) occurs at epoch is a fully connected layer. The output layer has fixed size equal to
122, while the training loss minimum (0.001) occurs at epoch 198. the number of model parameters.
Two spectra and their waveforms are shown in Figure 3 showing
the similarity of the original tone and the estimated one, both ob-
tained from the flue pipe model.
Figure 3 are evaluated in terms of HMSE for the first 10, 20 and
Results provided in terms of MSE, unfortunately, are not acous- 35 harmonics in Table 1 3 .
tically motivated: not all errors have the same effect, since param-
eters affect different perceptual features, thus large errors on some The first tone, shown in Figure 3(a) has a spectrum with a good
parameters may not result as easily perceived as small errors on match for the first harmonics, but with some outliers and a gener-
other parameters. To the best of the authors’ knowledge there is ally bad match for harmonics higher than 12. The latter, shown
no agreed method in literature to objectively evaluate the degree in Figure 3(b) has a good match, especially in its first 10 partials,
of similarity of two musical instruments spectra. Previous works but the error raises with higher partials, especially from the 12th
suggested the use of subjective listening tests [19, 20, 21, 22], but up. This is reflected by an HMSE10 of 5.2 dB vs. 0.3 dB and an
an objective way to measure this distance is still to be addressed. error on the whole spectrum (HMSE35) of 18.9 dB vs. 12.3 dB.
In order to provide the reader with some cues on how to eval- HMSE20 values for the two tones do not differ significantly, due
uate these results, we draw from the psychoacoustic literature, as to the averaging done on spectral ranges with different results, but
an example, the work from Caclin et al. [23], where spectral ir- we leave them to the reader so that they can be compared to the
regularity is proposed as a salient feature in sound similarity rat- experiments of Caclin et al. The HMSE20 values are somewhere
ing. Spectral irregularity is modelled, in their work, as the atten- between the two extremes, “same” and “different”, tending more
uation of even harmonics in dB (EHA). The perceptual extremes towards the former. Informal listening tests conducted with expert
are chosen to be a tone with a regular spectral decay and a tone musicians suggest that the estimated “Stentor” tone does not match
with all even harmonics attenuated by 8dB. The mean squared er- well to the original, while the “HW-DE” does match sufficiently.
ror calculated for these two tones (HMSE) for the first 20 harmon- We hypothesize that the spectral matching of the first harmonics is
ics (as done in their work) is 32dB. In our experiments, results more relevant in psychoacoustic terms to assess similarity, but we
vary greatly, depending on the pipe sounds to be modeled by the leave this to more systematic studies as a future work. The tones
CNN. As an example, Figure 3 shows time and frequency plots of are made available to the reader online4 .
two experiments. They both present two A4 signals created by the
physical model with two different parameter configurations hand- 3 The sampling frequency of the tones is 31250 Hz, thus, 35 is the high-
crafted by an expert musician, called respectively “Stentor” and est harmonic for a A4 tone.
“HW-DE”. The peak amplitude of the harmonics for the tones in 4 http://a3lab.dii.univpm.it/research/10-projects/84-ml-phymod
DAFX-14
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0 0
−10 −10
Magnitude [dB]
Magnitude [dB]
−20 −20
−30 −30
−40 −40
−50 −50
−60 −60
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Frequency [Hz] Frequency [Hz]
0.04 0.10
0.02 0.05
0.00 0.00
−0.02 −0.05
−0.10
−0.04 −0.15
−0.06 −0.20
0 500 1000 1500 2000 0 500 1000 1500 2000
0.10 0.1
0.05
0.0
0.00
−0.05 −0.1
−0.10 −0.2
−0.15 −0.3
0 500 1000 1500 2000 0 500 1000 1500 2000
Time [samples] Time [samples]
(a) (b)
Figure 3: Spectra and harmonic content for two A4 tones from (a) Principal stop named “Stentor”, and (b) Principal stop named “HW-DE”.
The gray lines and crosses show the spectrum and the harmonic peaks of ŝ(n), while the black line and dots show the spectrum and the
harmonic peaks of s(n). In the waveform plots, the upper ones are obtained by the target parameters, while the lower ones are obtained
with the estimated parameters.
0.30 validation is required with data sampled from a real pipe organ for
0.25
further assessment and to evaluate the robustness of this method to
noise, reverberation and such.
0.20
MSE Loss
DAFX-15
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
view and update,” Journal of New Music Research, vol. 33, [13] C. Zinato, “Method and electronic device used to synthe-
no. 3, pp. 283–304, 2004. sise the sound of church organ flue pipes by taking advantage
[3] Stefano Zambon, Leonardo Gabrielli, and Balazs Bank, “Ex- of the physical modeling technique of acoustic instruments,”
pressive physical modeling of keyboard instruments: From Oct. 28 2008, US Patent 7,442,869.
theory to implementation,” in Audio Engineering Society [14] NH Fletcher, “Sound production by organ flue pipes,” The
Convention 134. Audio Engineering Society, 2013. Journal of the Acoustical Society of America, vol. 60, no. 4,
[4] Janne Riionheimo and Vesa Välimäki, “Parameter estimation pp. 926–936, 1976.
of a plucked string synthesis model using a genetic algorithm [15] Neville H Fletcher and Suszanne Thwaites, “The physics of
with perceptual fitness calculation,” EURASIP Journal on organ pipes,” Scientific American, vol. 248, no. 1, pp. 94–
Advances in Signal Processing, vol. 2003, no. 8, 2003. 103, 1983.
[5] Vasileios Chatziioannou and Maarten van Walstijn, “Estima- [16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
tion of clarinet reed parameters by inverse modelling,” Acta deep network training by reducing internal covariate shift,”
Acustica united with Acustica, vol. 98, no. 4, pp. 629–639, arXiv preprint arXiv:1502.03167, 2015.
2012.
[17] Vinod Nair and Geoffrey E Hinton, “Rectified linear units
[6] Carlo Drioli and Davide Rocchesso, “Learning pseudo- improve restricted boltzmann machines,” in Proceedings
physical models for sound synthesis and transformation,” in of the 27th international conference on machine learning
Systems, Man, and Cybernetics, 1998. 1998 IEEE Interna- (ICML-10), 2010, pp. 807–814.
tional Conference on. IEEE, 1998, vol. 2, pp. 1085–1090.
[18] Diederik Kingma and Jimmy Ba, “Adam: A method for
[7] Aurelio Uncini, “Sound synthesis by flexible activation func- stochastic optimization,” arXiv preprint arXiv:1412.6980,
tion recurrent neural networks,” in Italian Workshop on Neu- 2014.
ral Nets. Springer, 2002, pp. 168–177.
[19] Simon Wun and Andrew Horner, “Evaluation of weighted
[8] Alvin WY Su and Liang San-Fu, “Synthesis of plucked-
principal-component analysis matching for wavetable syn-
string tones by physical modeling with recurrent neural net-
thesis,” J. Audio Engineering Society, vol. 55, no. 9, pp.
works,” in Multimedia Signal Processing, 1997., IEEE First
762–774, 2007.
Workshop on. IEEE, 1997, pp. 71–76.
[9] Alvin Wen-Yu Su and Sheng-Fu Liang, “A new automatic [20] H. M. Lehtonen, H. Penttinen, J. Rauhala, and V. Välimäki,
IIR analysis/synthesis technique for plucked-string instru- “Analysis and modeling of piano sustain-pedal effects,” J.
ments,” IEEE transactions on speech and audio processing, Acoustical Society of America, vol. 122, pp. 1787–1797,
vol. 9, no. 7, pp. 747–754, 2001. 2007.
[10] Ali Taylan Cemgil and Cumhur Erkut, “Calibration of [21] Brahim Hamadicharef and Emmanuel Ifeachor, “Objective
physical models using artificial neural networks with ap- prediction of sound synthesis quality,” 115th Convention of
plication to plucked string instruments,” PROCEEDINGS- the AES, New York, USA, p. 8, October 2003.
INSTITUTE OF ACOUSTICS, vol. 19, pp. 213–218, 1997. [22] L. Gabrielli, S. Squartini, and V. Välimäki, “A subjective val-
[11] Katsutoshi Itoyama and Hiroshi G Okuno, “Parameter esti- idation method for musical instrument emulation,” in 131st
mation of virtual musical instrument synthesizers,” in Proc. Audio Eng. Soc. Convention, New York, 2011.
of the International Computer Music Conference (ICMC), [23] Anne Caclin, Stephen McAdams, Bennett K Smith, and
2014. Suzanne Winsberg, “Acoustic correlates of timbre space di-
[12] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh mensions: A confirmatory study using synthetic tones a,”
Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and The Journal of the Acoustical Society of America, vol. 118,
Yoshua Bengio, “SampleRNN: an unconditional end-to-end no. 1, pp. 471–482, 2005.
neural audio generation model,” in 5th International Confer-
ence on Learning Representations (ICLR 2017), 2017.
DAFX-16