Pitch Detection Algorithms
Pitch Detection Algorithms
The aim of this paper is to present a method improving pitch estimation accuracy, show-
ing high performance for both synthetic harmonic signals and musical instrument sounds.
This method employs an Artificial Neural Network of a feed-forward type. In addition, octave
error optimized pitch detection algorithm, based on spectral analysis is introduced. The pro-
posed algorithm is very effective for signals with strong harmonic, as well as nearly sinusoidal
contents. Experiments were performed on a variety of musical instrument sounds and sample
results exemplifying main issues of both engineered algorithms are shown.
1. Introduction
There are two major difficulties, namely, octave errors and pitch estimation accu-
racy [1–3], that most pitch detection algorithms (PDAs) have to deal with. Octave errors
problems, seems to be present in all pitch tracking algorithms, known so far, however,
these errors are caused by different input signal properties in the estimation process.
In time- domain based algorithms [4–7], i.e., AMDF, modified AMDF [8–10] or nor-
malized cross correlation (NCC) [3, 7, 11], octave errors may be caused by low energy
content of odd harmonics. In some cases AMDF or autocorrelation methods are per-
formed first and in addition some information is gathered from calculated spectrum, in
order to decrease the possibility of estimation errors [12, 13], resulting in more accurate
pitch tracking. Such operations usually require increased computational cost, and larger
block sizes, than PDAs working in the time-domain. In the frequency domain, errors are
caused mostly by low energy content of the lower order harmonics. In cepstral [2], as
well as in autocorrelation of log spectrum (ACOLS) [14] analyses, problems are caused
by high energy content in higher frequency parts of the signal. Some algorithms oper-
ate directly on time-frequency representation, and are based on analysing trajectories of
sinusoidal components in spectrogram (sonogram) of the signal [15, 16]. On the other
2 M. DZIUBIŃSKI and B. KOSTEK
hand, estimation accuracy problem for all mentioned domains is caused by a number of
samples representing analyzed peaks related to fundamental frequency.
There is an additional problem related to pitch detection. For example, in case of
speech signals [1, 17–20], it is very important to determine pitch almost instantaneously,
which means that processed frames of the signal must be small. This is because voiced
fragments of speech may be very short, with rapidly varying pitch. In case of musi-
cal signals, voiced (pitched) fragments are relatively long and pitch fluctuations lower.
This property of musical signals enables the use of larger segments of the signal in the
pitch estimation procedure. But for both application domains, efficient pitch detection
algorithm should estimate pitch periods accurately and smoothly between successive
frames, and produce pitch contour that has high resolution in the time-domain.
The proposed pitch detection algorithm, a so-called Spectrum Peak Analysis (SPA),
is based on analyzing peaks in the frequency domain, representing harmonics of a pro-
cessed signal. The general concept is based on such relatively easiness of pitch determi-
nation by observing signal spectrum and especially intervals between partials that are
present in the spectrum. This is independent of the fact that some harmonics may be
absent, or they can be partially obscured by the background noise. It should, however,
be assumed that they are greater than the energy of the background noise. Estimating
pitch contour is performed in block processing, i.e., the signal is divided into blocks
with widths depending on pitch estimated for preceding blocks, whereas overlap can be
time-varying. The width of the first block is initialized to 4096 samples and is decreased
for successive blocks, if the detected pitch is relatively high, and can be represented by
lower spectrum resolution. Similarly, if estimated pitch decreases in consecutive blocks,
the block width is increased, to provide satisfying spectrum resolution. Each block is
weighted by the Hann window.
fundamental frequencies, assuming that a chosen harmonic (the largest maximum of the
spectrum signal) can be 1,2, . . . , or M -th harmonic of the analyzed sound:
M
X FM
Ffund [i] = (1)
i
i=1
where:
Ffund – vector of possible fundamental frequencies,
FM – frequency of the chosen (largest) harmonic.
The main concept of the engineered algorithm is testing the set of K harmonics
related to vector Ffund , that are most likely to be peaks representing pitch. The value of
K is limited by FM as follows:
µ ¶
Fs
K = floor (2)
M
where:
floor (x) – returns the largest integer value not greater than x,
Fs – sampling frequency.
Based on M, Ffund vector and K, the matrix of frequencies used in analysis can be
formed in the following way:
M X
X K
F AM (i, j) = Ffund [i] · j (3)
i=1 j=1
where:
FAM – matrix containing frequencies of M harmonics sets.
If M is significantly larger than K, and most energy carrying harmonics are higher
order harmonics (the energy of first K harmonics is significantly smaller than, for ex-
ample, K, K + 1,. . . , 2 · K, or higher order harmonics), it is better to choose a set
of K consecutive harmonics representing the largest amount of energy. Therefore, fre-
quency of the first harmonic in each set (each row of FAM) does not have to represent
the fundamental frequency. Starting frequencies of chosen sets can be calculated in the
following way:
K
X
Hmaxset [j] = EH(i+j)·Ffund , j = 0, ..., L − 1 (4)
i=1
where:
Hmaxset – vector containing energy of consecutive K harmonics for the chosen set,
where Hmaxset [k] is the sum of K harmonics energies for the following frequencies:
k · Ffund , (k + 1) · Ffund ,. . . , (k + K) · Ffund , EHfund – energy of the harmonic with
Fs
frequency equal to f , L – dimension of Hmaxset vector: L = floor( Ffund − K).
4 M. DZIUBIŃSKI and B. KOSTEK
Starting frequency of each set is based on the index representing the maximum value of
Hmaxset: Fstart [m] = indmax [m] · Ffund [m] for m = 1, . . . , M .
Finally, modified FAM can be formed in the following way:
M X
X K
F AM (i, j) = Fstart [i] + Ffund [i] · (j − 1) (5)
i=1 j=1
where:
Hv [i] – value of a spectrum component for i-th frequency for the analyzed set.
If the analyzed spectrum component is not a local maximum – left and right neighboring
samples are not smaller than the one assigned to the local maximum, then it is set at
0. Additionally, if local maxima of neighboring regions of spectrum are found, Hv is
decreased – values of the maxima found are subtracted fromHv .
Neighboring regions of the spectrum surrounding the frequency FHv , representing
Hv , are limited by the following frequencies:
Ffund
FL = FHv − (7a)
2
Ffund
FR = FHv + (7b)
2
where:
FL , FR – frequency boundaries of spectrum regions surrounding FHv ,
Ffund – assumed fundamental frequency of the analyzed set.
The fundamental frequency, related to the largest V , is assumed to be the desired
pitch of the analyzed signal. As observed from Figs. 1–3, three situations are possible.
For example, in Fig. 1, one can see that the analyzed spectrum peak value is not a local
maximum, therefore it is set at 0. In addition, local maxima are detected in surrounding
regions, which subtracted from Hv give negative values. It is clear that in this situation,
it is highly unlikely that Hv is a harmonic. Figure 2 presents a situation in which Hv is
a local maximum, and surrounding maxima have small values, opposite to Fig. 3, where
analyzed regions contain large local maxima. Therefore Fig. 2 represents a peak that is
most likely to be a harmonic.
HIGH ACCURACY AND OCTAVE ERROR . . . 5
Fig. 1. Analysis of a possible harmonic peak and its surrounding region (analyzed fundamental frequency
is not related to peak frequency).
Fig. 2. Analysis of a possible harmonic peak and its surrounding region (analyzed fundamental frequency
is correctly related to peak frequency).
6 M. DZIUBIŃSKI and B. KOSTEK
Fig. 3. Analysis of a possible harmonic peak and its surrounding region (analyzed fundamental frequency
is two times larger than pitch).
Since the spectrum peak representing pitch is sampled with limited resolution, inter-
polation is required to improve the algorithm accuracy. Different linear methods have
been tested in order to find computationally efficient and suitable interpolation tech-
niques, however, estimating pitch based on a discrete spectrum is not a trivial task.
Problems are caused by other frequency components surrounding peak, related to pitch.
In practice, those disturbances are caused by spectral leakage of sinusoidal components
of a signal (higher order harmonics), and depend on frequency distance between those
components and their energy. Therefore, using simple interpolation methods, such as
polynomials or splines, would result in a limited performance. Artificial Neural Net-
works (ANN) seem to be suitable for this task, and are successfully used to improve
estimation accuracy, which is shown in the following sections.
monic signals were generated to obtain the training input data and target signal. Each
training signal was synthesized according to the following formula:
K
X 2πniFpitch R[n]
S[n] = sin( )· (8)
Fs i
i=1
where:
R – vector containing uniformly distributed (on the (0, 1) interval) pseudo-random
numbers.
Fpitch – fundamental frequency of the synthesized signal,
Fs – sampling frequency,
K – number of harmonics contained in the signal S. K is defined as follows:
floor(Fs /Fpitch ).
It can be observed that a synthetic signal is most likely to have harmonics with
decreasing energies, similar to musical instrument sounds. Three training processes
were performed, employing various window sizes (different lengths of training signals):
1024, 2048 and 4096 samples, while sampling frequency was equal to 44100. Each sig-
nal was weighted by the Hann window, this was because the Hann window was also
used in the SPA estimation process. A great number of synthetic signals were generated
to obtain training data for each window size, while fundamental frequencies were ran-
domly chosen from Fmin to 4500 Hz. Fmin is the lowest possible frequency in respect of
d, depending on the window size. The neural network used in the training process was
a feed-forward, back-propagation structure with three layers. First layer contained three
neurons, the hidden layer – four neurons and the output layer – one neuron. Hyperbolic
tangent sigmoid transfer function was chosen to activate the first two layers, whilst the
linear identity function was used to activate the last layer. Weights and biases, during the
training process, were updated according to Levenberg–Marquardt optimization [21].
Trained network was used in the estimation process, resulting in performance presented
in the following section.
Figures 4–8 presented estimation errors for all tested signals concerning each al-
gorithm, showing error fluctuations over frequency changes. It can be observed that
time-domain related algorithms show a decrease in accuracy of estimation when the
signal frequency increases, as opposed to frequency-domain related algorithms, where
the situation is the opposite.
Fig. 7. Pitch estimation error of the SPA algorithm (2nd order polynomial interpolation).
Table 2. Pitch estimation results for oboe (articulation: non legato, dynamics: mezzo forte).
Table 3. Pitch estimation results for oboe (articulation: portato, dynamics:mezzo forte).
Table 4. Pitch estimation results for oboe (articulation: double staccato, dynamics:mezzo forte).
Fig. 10. Pitch estimation results for baritone saxophone (articulation: non legato, dynamics: forte,
range: C2# – A4).
16 M. DZIUBIŃSKI and B. KOSTEK
Fig. 11. Pitch estimation results for bassoon (articulation: non legato, dynamics: forte, range:
A1# – C5).
Fig. 12. Pitch estimation results for trumpet (articulation: non legato, dynamics: forte, range:
E3 - G5#).
HIGH ACCURACY AND OCTAVE ERROR . . . 17
Fig. 13. Pitch estimation results for tuba F (articulation: non legato, dynamics: forte, range: F1 - C4#).
Fig. 14. Pitch estimation results for viola (articulation: non legato, dynamics: forte, range: C3 - A6).
18 M. DZIUBIŃSKI and B. KOSTEK
Fig. 15. Pitch estimation results for oboe (articulation: non legato, dynamics: forte, range: A3# - F6).
Fig. 16. Pitch estimation results for oboe (articulation: non legato, dynamics: piano, range: A3# - F6# ).
HIGH ACCURACY AND OCTAVE ERROR . . . 19
Fig. 17. Pitch estimation results for oboe (articulation: vibrato, dynamics: mezzo forte, range: A3# – F6).
Fig. 18. Pitch estimation results for oboe (articulation: single staccato, dynamics: mezzo forte, range:
A3# – G6).
20 M. DZIUBIŃSKI and B. KOSTEK
As seen from tables and figures presented, no octave related errors were detected
by the engineered algorithm. Different articulations and dynamics of sounds seemed
not to affect the octave error estimation accuracy of the SPA. Differences, sometimes
significant, between estimated pitch and tone frequency arise as the result of musicians
playing solo. Moreover, instruments were not tuned to exactly the same pitch before the
recordings.
6. Conclusion
The proposed algorithms have been tested on a variety of sounds with differentiated
articulations and dynamics, showing high resistance to octave errors (octave error was
not detected among all tested sounds). In addition, there is no limitation to harmonic
sounds in the analysis (while periodicity has to be maintained), which is the case with
other algorithms, such as, for example, CA and ACOLS algorithms. Moreover, energy
of harmonics does not have to be concentrated around a fundamental frequency, which is
an important issue for both: NCC and AMDF algorithms. The main disadvantage of the
SPA presented is its limited frequency range for small window sizes (lower boundary).
On the other hand, the NCC algorithm has an extended lower frequency limit. However,
in case of fast pitch fluctuations of low pitched sounds, the overlap can be decreased
significantly, while keeping large window sizes and resolution of calculated pitch track
may be preserved.
In addition, presented algorithm accuracy optimization seems to be very effective,
resulting in very precise pitch estimation. An optimized SPA algorithm gives far more
precise results than classic PDAs, these characteristics may be useful in sound separa-
tion and parameterization processes.
Acknowledgment
The research is sponsored by the Committee for Scientific Research, Warsaw, Grant
No. 4T11D 014 22, and by the Foundation for Polish Science, Poland.
References
[1] W. H ESS, Pitch determination of speech signal processing, Springer-Verlag, New York 1983.
[2] A. M. N OLL, Cepstrum pitch determination, J. Acoust. Soc. Am., 14, 293–309 (1967).
[3] L. R. R ABINER, On the use of autocorrelation analysis for pitch detection, IEEE Trans. on ASSP,
25, 24–33 (1977).
[4] X. Q UIAN , R. K IMARESAN, A variable frame pitch estimator and test results, IEEE Int. Conf. On
Acoustics, Speech, and Signal Processing, 1, Atlanta GA, 228–231, May (1996).
[5] D. TALKIN, A robust algorithm for pitch tracking (RAPT), Speech Coding And Synthesis, pp. 495-
518, Elsevier, 1995.
HIGH ACCURACY AND OCTAVE ERROR . . . 21