Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
20 views

Introducing Machine Learning To Parameter Estimation

Uploaded by

yutaom
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Introducing Machine Learning To Parameter Estimation

Uploaded by

yutaom
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017

INTRODUCING DEEP MACHINE LEARNING FOR PARAMETER ESTIMATION IN


PHYSICAL MODELLING

Leonardo Gabrielli ∗ Stefano Tomassetti ∗


A3LAB, A3LAB,
Università Politecnica delle Marche Università Politecnica delle Marche
Ancona, IT Ancona, IT
l.gabrielli@univpm.it tomassetti.ste@gmail.com

Stefano Squartini Carlo Zinato


A3LAB, Viscount International SpA,
Università Politecnica delle Marche c.zinato@viscount.it
Ancona, IT
s.squartini@univpm.it

ABSTRACT to provide an estimate in a white-box approach. Furthermore, spe-


cific estimation algorithms must be devised for each parameter. To
One of the most challenging tasks in physically-informed sound
solve these issues, a black-box approach could be undertaken to
synthesis is the estimation of model parameters to produce a de-
provide a good estimate of all the parameters at once. The goal
sired timbre. Automatic parameter estimation procedures have
of this preliminary work is to support the thesis that adequate ma-
been developed in the past for some specific parameters or ap-
chine learning techniques can be identified to satisfactorily esti-
plication scenarios but, up to now, no approach has been proved
mate a whole set of model parameters without specific physical
applicable to a wide variety of use cases. A general solution to pa-
knowledge or model knowledge.
rameters estimation problem is provided along this paper which is
based on a supervised convolutional machine learning paradigm. In the past some works extended the use of early machine
The described approach can be classified as “end-to-end” and re- learning techniques to the parametrization of nonlinearities in phys-
quires, thus, no specific knowledge of the model itself. Further- ical models [6, 7], or employed nonlinear recursive digital filters
more, parameters are learned from data generated by the model, as physical models and employed parameter estimation techniques
requiring no effort in the preparation and labeling of the training mutuated from the machine learning literature for the estimate of
dataset. To provide a qualitative and quantitative analysis of the the coefficients [8, 9]). More in the spirit of this paper comes the
performance, this method is applied to a patented digital waveg- work of Cemgil et al. on the calibration of a simple physical model
uide pipe organ model, yielding very promising results. employing artificial neural networks [10]. This work, however, to
the best of our knowledge saw no continuation. Recently another
computational intelligence approach for the estimation of a syn-
1. INTRODUCTION thesizer parameters using a multiple linear regression model has
been proposed, employing hand-crafted features [11]. To the best
Almost all sound synthesis techniques require a nontrivial effort
of our knowledge, however, no further attempt has been made to
in the selection of the parameters, to allow for expressiveness and
the estimation of a physical model parameters for sound synthesis
obtain a specific sound. The choice of the parameters depends on
employing other machine learning approaches. From this point of
tone pitch, control dynamics, interpretation, and aesthetic criteria,
view, the swift development of deep machine learning techniques,
with the aim of producing all the nuances required by musicians
and the exciting results obtained by these in a plethora of appli-
and their taste. Hereby, interest is given to the so-called physically-
cation scenarios, including musical representation and regression
informed sound synthesis, a family of algorithms[1, 2] usually in-
[12] suggests their application to the problem at hand. Follow-
spired by acoustic physical systems or derived from the transform
ing the recent advances of deep neural networks in audio applica-
in the digital domain of their formulation in the continuous-time
tions, we propose here an end-to-end approach to the parameter
domain. Such acoustic systems (e.g. strings, bores, etc.) often re-
estimation of acoustic physical models for sound synthesis based
quire simplifying hypotheses to limit the modeling complexity and
on Convolutional Neural Networks (CNN). The training can be
to separate the acoustic phenomenon into different components.
conducted in a supervised fashion, since the model itself can pro-
Notwithstanding this, the number of micro-parameters that control
vide audio and ground-truth parameters in an automated fashion.
the sound and its evolution may be extremely large (see e.g. [3])
To evaluate the approach, this concept is applied to a valuable use
and if the effects of the parameters is intertwined the estimation
case, i.e. a commercial flue pipe organ physical model, detailed in
effort may grow.
[13]. The estimation yields promising results which call for more
In the past some algorithms have been proposed to estimate
research work.
some of the parameters of a physical model in an algorithmic fash-
ion (see e.g. [4, 5]). These, however, require specific knowledge The paper outline follows. Section 2 provides a mathematical
of physics, digital signal processing and psychoacoustic in order formulation of the problem and the machine learning techniques
employed to provide a general solution to it. Section 3 describes
∗ This work is partly supported by Viscount International SpA a real-world use case for validation of the proposed techniques.

DAFX-11
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017

Section 4 reports the implementation details and the experiments W (m) , m = 1, ..., M are the filter weights to be learned, σ(·) is
conducted, while Section 5 provides the results of these experi- a non-linear sigmoid activation function, Z (m−1) is the output of
ments and discusses them. Finally in Section 6 conclusions are layer m − 1, called feature map, Q(m) is the result of convolution
drawn and open avenues for research are suggested. on the previous feature map and h(·) is a pooling function that
reduces the feature map dimensionality. After M convolutional
2. THE PROPOSED METHOD layers, one or more fully connected layers are added. The final
layer has size p and outputs an estimate of the model parameters
A physical model solves a set of differential equations that model θ̂.
a physical system and requires a set of of parameters θ to gener- Learning is conducted according to an update rule, which is
ate a discrete time audio sequence s(n). The goal of the model based on the evaluation of a loss function, such as
is to approximate acoustic signals generated by the physical sys-
tem in some perceptual sense. If the model provides a mapping ℓ(W, e) = kθ − θ̂ (e) k2 (3)
from the parameters to the discrete sequence s(n), the problem of
estimating the parameters θ that yield a specific audio sequence where e is the training epoch. Training is iterated until a conver-
identified as a target (e.g. an audio sequence sampled from the gence criterion is matched or a maximum number of epochs has
physical system we are approximating), is equivalent to finding passed. To avoid overfitting and reduce training times early stop-
the model inverse. Finding an inverse mapping (or an approxima- ping by validation can be performed, which consists in evaluating
tion thereafter) for the model is a challenging task to face, and a after a constant number of epochs the loss, called validation loss,
first necessary condition is the existence of the inverse for a given calculated against a validation set, i.e. a part of the dataset that is
s(n). Usually, however, in physical modelling applications, re- not used for training and is, hence, new to the network. Even if the
quirements are less strict, and generally it is only expected that training loss may still be improving, if the validation loss does not
audio signals match in perceptual or qualitative terms, rather than improve after some training epochs, the training may stop avoiding
on a sample-by-sample basis. This means that, although, a signal a network overfit.
r(n) cannot be obtained from the model for any θ, a sufficiently Finally, once the network is trained, it can be fed with novel
close estimate r̂(n) may. Evaluating the distance between the two audio sequences to estimate the physical model parameters that
signals in psychoacoustic terms is a rather complex task and is out can obtain a result close to the original. In the present work we em-
of the scope of this work ploy additional audio sequences generated by the model in order
Artificial neural networks, and more specifically, deep neu- to measure the distance in the parameter space between the param-
ral networks of recent introduction, are well established to solve eters θi and θ̂i . If non-labelled audio sequences are employed (e.g.
a number of inverse modelling problems. Here, we propose the sequences sampled from the real-world), it is not straightforward
application of a convolutional neural network that, provided with to validate the generated result, that is why in the present work no
an audio signal of maximum length L in a suitable time-frequency attempt has been made to evaluate the estimation performance of
representation, can estimate model parameters θ̂ that fed to the the network with real-world signals.
physical model obtain an audio signal ŝ(n) close to s(n).
To achieve this, the inverse of the model must be learned em- 3. USE CASE
ploying deep machine learning techniques. If a supervised train-
ing approach is employed, the network must be fed with audio se- The method described in the previous section has been applied to a
quences and the related model parameters, also called target. The specific use case of interest, i.e. a patented digital pipe organ phys-
production of a dataset D of such tuples D = {(θi , si (n)), i = ical model. A pipe organ is a rather complex system [14, 15], pro-
1, ..., M } allows the network to try learn the mapping that con- viding a challenging scenario for physical modelling itself. This
nects these. The production of the dataset is often a lengthy task specific model, already employed on a family of commercial dig-
and may require human effort. However, in this application, the ital pipe organs, exposes 58 macro-parameters to be estimated for
model is known and, once implemented, it can be employed to each key, some of which are intertwined in a non linear fashion and
automatically generate a dataset D in order to train the neural net- are acoustic-wise non-orthogonal (i.e. jointly affect some acoustic
work. features of the resulting tone).
The neural network architecture proposed here allows for end- We introduce here some key terms for later use. A pipe organ
to-end learning and is based on convolutional layers. Convolu- has one or more keyboards (manuals or divisions), each of which
tional neural networks globally received attention from a large can play several stops, i.e. a set of pipes, typically one or more
number of research communities and found application into com- per key, which can be activated or deactived at once by means of a
mercial applications, especially in the field of image processing, drawstop. When a stop is activated, air is ready to be conveyed to
classification, etc. They are also used with audio signals, where, the embouchure of each pipe, and when a key is pressed, a valve
usually, the signal is provided to the CNN in a suitable time-fre- is opened to allow air flow into the pipe. Each stop has a different
quency representation, obtained by means of a Short-Time Fourier timbre and pitch (generally the pitch of the central C is expressed
Transform (STFT) with appropriate properties of time-frequency in feet measuring the pipe length). From our standpoint, the con-
localization. The architecture of a CNN is composed of several cept of stop is very important, since each key in a stop will sound
layers in the following form, similar to the neighboring ones in terms of timbre and envelope,
and each key will trigger different stops which may have similar
Z (m) = h(σ(Q(m) )), (1) pitch but different timbre. In a pipe organ it can be expected that
(m) (m) (m−1) pipes in a stop have consistent construction features (e.g. mate-
Q =W ∗Z , (2)
rials, geometry, etc.) and a physical model that mimics that pipe
and Z (0) ≡ X, where M denotes the total number of layers, stop may have correlated features along the keys but this is not

DAFX-12
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017

Figure 1: Overview of the proposed system including the neural network for parameter estimation and the physical model for sound
synthesis.

an assumption that can be done, so it is necessary to conduct an dataset has been split by 90% and 10% for the training and vali-
estimate of the parameters for each key in a stop. dation sets respectively, for a total of 1998 samples for the former
The physical model employed in this work is meant to emulate and 222 for the latter. The only pre-processing conducted is the
the sound of flue pipes and is described in detail in the related normalization of the parameters and the extraction of the STFT.
patent. To summarize, it is constituted by three main parts: A trade-off has been selected in terms of resolution and hop size
1. exciter: models the wind jet oscillation that is created in the to allow a good tracking of harmonics peaks and attack transient.
embouchure and gives rise to an air pressure fluctuation, Figure 2 shows the input STFT for a A4 tone used for training the
network.
2. resonator: a digital waveguide structure that simulates the
bore,
15621
3. noise model: a stochastic component that simulates the air
13383
noise modulated by the wind jet.
Frequency [Hz]

11156
The parameters involved in the sound design are widely different
in range and meaning, and are e.g. digital filters coefficients, non- 8922
linear functions coefficients, gains, etc. The diverse nature of the
6695
parameters requires a normalization step which is conducted on
the whole training set and maps each parameter in a range [-1, 1], 4461
in order for the CNN to learn all parameters employing the same 2234
arithmetic. A de-normalization step is required, thus, to remap the
parameters to their original range. 0
0 1 2 3
Figure 1 provides an overview of the overall system for pa- Time [s]
rameter estimation and sound generation including the proposed
neural network and the physical model. Figure 2: The STFT for an organ sound in the training set. The
tone is a A4.
4. IMPLEMENTATION

The CNN and the machine learning framework has been imple- The CNN architecture is composed of up to 6 convolutional
mented as a python application employing Keras1 libraries and layers, with optional batch normalization [16] and max pooling,
Theano2 as a backend, running on a Intel i7 Linux machine equipped up to 3 fully connected layers and a final activation layer. For the
with 2 x GTX 970 graphic processing units. The physical model training, stochastic gradient descent (SGD), Adam and Adamax
is implemented as both an optimized DSP algorithm and a PC ap- optimizers have been tested. The training made use of early stop-
plication. The application has been employed in the course of this ping on the validation set. A coarse random search has been per-
work and has been modified to allow producing batches of audio formed first to pinpoint some of the best performing hyperparam-
sequences and labels for each key in a stop (e.g. to produce the eters. A finer random search has been, later, conducted keeping
dataset, given some specifications). Each audio sequence contains the best hyperparameters from the previous search constant. Tests
a few seconds of a tone of specific pitch with given parameters. have been conducted with other stops not belonging to the training
A dataset of 30 Principal pipe organ stops, each composed of set and averaged for all the keys in the stops.
74 audio files, one per note, has been created taking the parameters
from a database of pre-existing stops hand-crafted by expert mu-
5. RESULTS
sicians to mimic different hystoric styles and organ builders. The
1 http://keras.io The loss used in training, validation and testing is the Mean Square
2 http://deeplearning.net/software/theano/ Error (MSE) calculated at the output of the network with respect

DAFX-13
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017

to the target, before de-normalization. Results are, therefore, eval- HMSE10 HMSE20 HMSE35
uated in terms of MSE. Stentor 5.2 dB 9.4 dB 18.9 dB
Table 2 reports the 15 best hyperparameters combinations in HW-DE 0.3 dB 10.1 dB 12.3 dB
the fine random search. The following activation function combi-
nations are reported in Table 2: Table 1: Harmonics MSE (HMSE) for the first 10, 20 and all 35
harmonics for the tones shown in Figure 3.
1. A: employs tanh for all layers,
2. B: employs tanh for all layers besides the last one, that uses Activations Minibatch Internal layers MSE
a Rectified Linear Unit (ReLU)[17], class size size
3. C: employs ReLU functions for all layers, A 40 (16, 16, 512, 58) 0.261
B 50 (4, 6, 8, 10, 512, 58) 0.203
4. all other combinations of the two aforementioned activation A 40 (16, 16, 32, 32, 512, 58) 0.164
functions obtained higher MSE score and are not included A 40 (16, 16, 32, 32, 512, 58) 0.139
here. A 25 (16, 16, 32, 32, 1024, 58) 0.161
A 40 (16, 16, 32, 32, 1024, 58) 0.266
Results are provided against a test set of 222 samples from three
organ stops, and have same learning rate (1E−5), momentum max A 25 (16 ,16, 32, 32, 512, 58) 0.156
(0.9), pool sizes (2x2 for each convolutional layer), receptive field A 40 (16, 16, 32, 32, 1024, 58) 0.252
sizes (3x3 for each convolutional layer) and optimizer (Adamax A 50 (4, 6, 8, 10, 512, 58) 0.166
[18]). These fixed parameters have been selected as the best ones A 40 (16, 16, 32, 32, 256, 58) 0.179
after the coarse random search. A 25 (2, 2, 4, 4, 128, 58) 0.254
Figure 4 shows the training and validation loss plots for the C 740 (16, 16, 32, 32, 512, 58) 0.252
first 200 training epochs for the first combination in Table 2. The B 50 (4, 6, 8, 10, 512, 58) 0.214
loss is based on the MSE for all parameters before denormaliza- C 740 (16, 16, 32, 32, 512, 58) 0.257
tion. This means all parameters contribute to the MSE with the B 40 (16, 16, 512, 58) 0.179
same weight and makes results clearer to evaluate. Indeed, if the
MSE would be evaluated after de-normalization, parameters with Table 2: The best 15 results of the fine random hyperparameters
larger excursion ranges would have a larger effect on the loss (e.g. search. Activation classes are described in the text. The MSE
a delay line length versus a digital filter pole coefficient). Vali- is evaluated before denormalization, thus, all parameters have the
dation Early Stopping is performed when the minimum validation same weight. Please note: all the layers are convolutional with ker-
loss is achieved to prevent overfitting and reduce training times. In nel size as indicated, exception made for the second to last which
Figure 4, e.g., the validation loss minimum (0.027) occurs at epoch is a fully connected layer. The output layer has fixed size equal to
122, while the training loss minimum (0.001) occurs at epoch 198. the number of model parameters.
Two spectra and their waveforms are shown in Figure 3 showing
the similarity of the original tone and the estimated one, both ob-
tained from the flue pipe model.
Figure 3 are evaluated in terms of HMSE for the first 10, 20 and
Results provided in terms of MSE, unfortunately, are not acous- 35 harmonics in Table 1 3 .
tically motivated: not all errors have the same effect, since param-
eters affect different perceptual features, thus large errors on some The first tone, shown in Figure 3(a) has a spectrum with a good
parameters may not result as easily perceived as small errors on match for the first harmonics, but with some outliers and a gener-
other parameters. To the best of the authors’ knowledge there is ally bad match for harmonics higher than 12. The latter, shown
no agreed method in literature to objectively evaluate the degree in Figure 3(b) has a good match, especially in its first 10 partials,
of similarity of two musical instruments spectra. Previous works but the error raises with higher partials, especially from the 12th
suggested the use of subjective listening tests [19, 20, 21, 22], but up. This is reflected by an HMSE10 of 5.2 dB vs. 0.3 dB and an
an objective way to measure this distance is still to be addressed. error on the whole spectrum (HMSE35) of 18.9 dB vs. 12.3 dB.
In order to provide the reader with some cues on how to eval- HMSE20 values for the two tones do not differ significantly, due
uate these results, we draw from the psychoacoustic literature, as to the averaging done on spectral ranges with different results, but
an example, the work from Caclin et al. [23], where spectral ir- we leave them to the reader so that they can be compared to the
regularity is proposed as a salient feature in sound similarity rat- experiments of Caclin et al. The HMSE20 values are somewhere
ing. Spectral irregularity is modelled, in their work, as the atten- between the two extremes, “same” and “different”, tending more
uation of even harmonics in dB (EHA). The perceptual extremes towards the former. Informal listening tests conducted with expert
are chosen to be a tone with a regular spectral decay and a tone musicians suggest that the estimated “Stentor” tone does not match
with all even harmonics attenuated by 8dB. The mean squared er- well to the original, while the “HW-DE” does match sufficiently.
ror calculated for these two tones (HMSE) for the first 20 harmon- We hypothesize that the spectral matching of the first harmonics is
ics (as done in their work) is 32dB. In our experiments, results more relevant in psychoacoustic terms to assess similarity, but we
vary greatly, depending on the pipe sounds to be modeled by the leave this to more systematic studies as a future work. The tones
CNN. As an example, Figure 3 shows time and frequency plots of are made available to the reader online4 .
two experiments. They both present two A4 signals created by the
physical model with two different parameter configurations hand- 3 The sampling frequency of the tones is 31250 Hz, thus, 35 is the high-
crafted by an expert musician, called respectively “Stentor” and est harmonic for a A4 tone.
“HW-DE”. The peak amplitude of the harmonics for the tones in 4 http://a3lab.dii.univpm.it/research/10-projects/84-ml-phymod

DAFX-14
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017

0 0

−10 −10
Magnitude [dB]

Magnitude [dB]
−20 −20

−30 −30

−40 −40

−50 −50

−60 −60
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Frequency [Hz] Frequency [Hz]

0.04 0.10
0.02 0.05
0.00 0.00
−0.02 −0.05
−0.10
−0.04 −0.15
−0.06 −0.20
0 500 1000 1500 2000 0 500 1000 1500 2000
0.10 0.1
0.05
0.0
0.00
−0.05 −0.1
−0.10 −0.2
−0.15 −0.3
0 500 1000 1500 2000 0 500 1000 1500 2000
Time [samples] Time [samples]

(a) (b)

Figure 3: Spectra and harmonic content for two A4 tones from (a) Principal stop named “Stentor”, and (b) Principal stop named “HW-DE”.
The gray lines and crosses show the spectrum and the harmonic peaks of ŝ(n), while the black line and dots show the spectrum and the
harmonic peaks of s(n). In the waveform plots, the upper ones are obtained by the target parameters, while the lower ones are obtained
with the estimated parameters.

0.30 validation is required with data sampled from a real pipe organ for
0.25
further assessment and to evaluate the robustness of this method to
noise, reverberation and such.
0.20
MSE Loss

During the evaluation of the results, it has been discovered that


0.15 results may greatly vary depending on the stops acoustic charac-
ter. A rigorous approach to machine learning requires understand-
0.10
ing whether the training set, which is a sampling of the probability
0.05 distribution of all flue pipe stops obtained by the model, is repre-
0.00
sentative of that probability distribution. Furthermore, it can be ex-
0 50 100 150 200 pected that the organ stops that can be obtained from the model are
Epoch a subset of all organ stops that could physically built, due to model
limitations and simplifying hypotheses. On the other hand, due
Figure 4: Training (solid line) and validation loss (dashed line) to its digital implementation, the model could circumvent some
for the best combination reported in Table 2. Please note that the physical limitation of real flue pipes, thus, yielding stops that are
validation loss minimum (0.027) occurs at epoch 122, while the not physically feasible. This calls for a better understanding of
training loss minimum (0.001) occurs at epoch 198. how different stops are related to each others in the modelled and
the physical realms, to understand before trying the machine learn-
ing approach, whether satisfying results can be obtained. As a last
6. CONCLUSIONS remark, these are general issues that apply also to other physical
model or musical instruments.
In this paper a machine learning paradigm that is general and flex-
ible is proposed and applied to the problem of estimating the pa-
rameters for a physical model for sound synthesis. To validate the 7. REFERENCES
idea a specific use case of a flue pipe physical model has been em-
ployed. Results in term of MSE are good and tones spectra have [1] Vesa Välimäki, Jyri Pakarinen, Cumhur Erkut, and Matti
a good match in terms of harmonic content, although results vary. Karjalainen, “Discrete-time modelling of musical instru-
Such results, coming from a real-world application scenario moti- ments,” Reports on progress in physics, vol. 69, no. 1, pp.
vate the authors in believing that a machine learning paradigm can 1, 2005.
be employed with success for the problem at hand. Nonetheless,
this first achievement calls for more research works. First of all a [2] Julius O. Smith, “Virtual acoustic musical instruments: Re-

DAFX-15
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017

view and update,” Journal of New Music Research, vol. 33, [13] C. Zinato, “Method and electronic device used to synthe-
no. 3, pp. 283–304, 2004. sise the sound of church organ flue pipes by taking advantage
[3] Stefano Zambon, Leonardo Gabrielli, and Balazs Bank, “Ex- of the physical modeling technique of acoustic instruments,”
pressive physical modeling of keyboard instruments: From Oct. 28 2008, US Patent 7,442,869.
theory to implementation,” in Audio Engineering Society [14] NH Fletcher, “Sound production by organ flue pipes,” The
Convention 134. Audio Engineering Society, 2013. Journal of the Acoustical Society of America, vol. 60, no. 4,
[4] Janne Riionheimo and Vesa Välimäki, “Parameter estimation pp. 926–936, 1976.
of a plucked string synthesis model using a genetic algorithm [15] Neville H Fletcher and Suszanne Thwaites, “The physics of
with perceptual fitness calculation,” EURASIP Journal on organ pipes,” Scientific American, vol. 248, no. 1, pp. 94–
Advances in Signal Processing, vol. 2003, no. 8, 2003. 103, 1983.
[5] Vasileios Chatziioannou and Maarten van Walstijn, “Estima- [16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
tion of clarinet reed parameters by inverse modelling,” Acta deep network training by reducing internal covariate shift,”
Acustica united with Acustica, vol. 98, no. 4, pp. 629–639, arXiv preprint arXiv:1502.03167, 2015.
2012.
[17] Vinod Nair and Geoffrey E Hinton, “Rectified linear units
[6] Carlo Drioli and Davide Rocchesso, “Learning pseudo- improve restricted boltzmann machines,” in Proceedings
physical models for sound synthesis and transformation,” in of the 27th international conference on machine learning
Systems, Man, and Cybernetics, 1998. 1998 IEEE Interna- (ICML-10), 2010, pp. 807–814.
tional Conference on. IEEE, 1998, vol. 2, pp. 1085–1090.
[18] Diederik Kingma and Jimmy Ba, “Adam: A method for
[7] Aurelio Uncini, “Sound synthesis by flexible activation func- stochastic optimization,” arXiv preprint arXiv:1412.6980,
tion recurrent neural networks,” in Italian Workshop on Neu- 2014.
ral Nets. Springer, 2002, pp. 168–177.
[19] Simon Wun and Andrew Horner, “Evaluation of weighted
[8] Alvin WY Su and Liang San-Fu, “Synthesis of plucked-
principal-component analysis matching for wavetable syn-
string tones by physical modeling with recurrent neural net-
thesis,” J. Audio Engineering Society, vol. 55, no. 9, pp.
works,” in Multimedia Signal Processing, 1997., IEEE First
762–774, 2007.
Workshop on. IEEE, 1997, pp. 71–76.
[9] Alvin Wen-Yu Su and Sheng-Fu Liang, “A new automatic [20] H. M. Lehtonen, H. Penttinen, J. Rauhala, and V. Välimäki,
IIR analysis/synthesis technique for plucked-string instru- “Analysis and modeling of piano sustain-pedal effects,” J.
ments,” IEEE transactions on speech and audio processing, Acoustical Society of America, vol. 122, pp. 1787–1797,
vol. 9, no. 7, pp. 747–754, 2001. 2007.
[10] Ali Taylan Cemgil and Cumhur Erkut, “Calibration of [21] Brahim Hamadicharef and Emmanuel Ifeachor, “Objective
physical models using artificial neural networks with ap- prediction of sound synthesis quality,” 115th Convention of
plication to plucked string instruments,” PROCEEDINGS- the AES, New York, USA, p. 8, October 2003.
INSTITUTE OF ACOUSTICS, vol. 19, pp. 213–218, 1997. [22] L. Gabrielli, S. Squartini, and V. Välimäki, “A subjective val-
[11] Katsutoshi Itoyama and Hiroshi G Okuno, “Parameter esti- idation method for musical instrument emulation,” in 131st
mation of virtual musical instrument synthesizers,” in Proc. Audio Eng. Soc. Convention, New York, 2011.
of the International Computer Music Conference (ICMC), [23] Anne Caclin, Stephen McAdams, Bennett K Smith, and
2014. Suzanne Winsberg, “Acoustic correlates of timbre space di-
[12] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh mensions: A confirmatory study using synthetic tones a,”
Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and The Journal of the Acoustical Society of America, vol. 118,
Yoshua Bengio, “SampleRNN: an unconditional end-to-end no. 1, pp. 471–482, 2005.
neural audio generation model,” in 5th International Confer-
ence on Learning Representations (ICLR 2017), 2017.

DAFX-16

You might also like