Journal of Computational and Applied Mathematics 227 (2009) 288–293
Contents lists available at ScienceDirect
Journal of Computational and Applied
Mathematics
journal homepage: www.elsevier.com/locate/cam
Wavelet-based cepstrum calculation
Fabrício Lopes Sanchez a , Sylvio Barbon Júnior b,∗ , Lucimar Sasso Vieira b ,
Rodrigo Capobianco Guido b , Everthon Silva Fonseca c , Paulo Rogério Scalassara c ,
Carlos Dias Maciel c , José Carlos Pereira c , Shi-Huang Chen d
a Department of Bioengineering, University of São Paulo, USP, São Carlos, São Paulo, 13566-590, Brazil
b Institute of Physics of São Carlos - University of São Paulo, USP, São Carlos, São Paulo, 13566-590, Brazil
c Department of Electrical Engineering, School of Engineering at São Carlos - University of São Paulo, USP, São Carlos, São Paulo 13566-590, Brazil
d Department of Computer Science and Information Engineering, Shu-Te University, Kaohsiung County, 824, Taiwan, ROC
article
info
Article history:
Received 4 April 2006
Received in revised form 30 September
2007
Keywords:
Wavelet transform
Cepstrum
Pitch period
Speech analysis
a b s t r a c t
In this paper we present a new wavelet-based algorithm for low-cost computation of the
cepstrum. It can be used for real time precise pitch determination in automatic speech and
speaker recognition systems. Many wavelet families are examined to determine the one
that works best. The results confirm the efficacy and accuracy of the proposed technique
for pitch extraction.
© 2008 Elsevier B.V. All rights reserved.
1. Introduction
Cepstral analysis is fairly well discussed in [1,2], and many other references. With it, a voiced speech segment can be
separated into two main parts: the resonances of the vocal tract, and the harmonic peaks which come from the source
pulmonary excitation ([1], pp. 30; [2], pp. 79). Cepstral analysis has been intensively used for speech recognition, speaker
recognition, and speaker verification and identification [1,3,4].
The cepstrum, which is the signal derived from the cepstral analysis of a speech utterance, is traditionally obtained by
means of the Discrete Fourier Transform (DFT); see Section 2. This approach gives precise results but with O(L2 ), i.e., the
order of computational complexity is quadratic in relation to the signal length (L). If the Fast Fourier Transform (FFT) is used,
the complexity reduces to O(L · log(L)). The proposed algorithm has a lower order of complexity, since its basic principle
is the convolution of the input signal with a wavelet filter of small support. Furthermore, the accuracy that we obtained
with the proposed algorithm is comparable to the one obtained using the DFT or FFT. For many real time applications, the
proposed approach is a valuable alternative to the existing algorithms based on the DFT or FFT for pitch extraction.
This paper is organized as follows. Section 2 presents a brief review on the cepstrum and its calculation based on the
DFT/FFT. The proposed algorithm is described in Section 3. A study on wavelets and their important characteristics for
precise cepstrum calculation appears in Section 4. Section 5 describes the tests and results, and, lastly, Section 6 presents
the conclusions.
∗ Corresponding address: Universidade de Sao Paulo, IFSC, Postal 369 - CEP, 13560-970 Sao Carlos, Sao Paulo, Brazil.
E-mail address: sbarbonjr@yahoo.com.br (S. Barbon Júnior).
0377-0427/$ – see front matter © 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.cam.2008.03.016
F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293
289
Fig. 1. [above]: log spectrum of a voiced speech signal; [below]: corresponding cepstrum obtained with the Fourier approach, where the period of pitch,
p, appears as a spike.
Table 1
The algorithm for obtaining the cepstrum, based on the Fourier approach
BEGINNING
STEP S1:
STEP S2:
STEP S3:
STEP S4:
STEP S5:
Input speech utterance, s[n], with length L;
s[n] ← DFT(s[n]);
For (int i = 0; i < L; i + +)
si ← log10 k(si )k2 ;
s[n] ← IDFT(s[n]), where IDFT is the inverse DFT;
For (int i = 0; i < L; i + +)
si ← (si )2 ;
END.
2. A brief review: The cepstrum
According to [1], if E is the excitation, and H represents the vocal tract resonance, then, the speech spectrum is S = E · H.
Cepstral analysis converts this product into a sum of spectra as follows:
S = E · H ⇒ log(S) = log(E · H) ⇒ log(S) = log(E) + log(H).
The spectrum of H varies slowly with frequency, because it consists mostly of smooth formant curves, and the spectrum
of E is more irregular, since it relates to the harmonics of the excitation. Therefore, the cepstral analysis linearly separates
the contributions given by E and H to form S. One particular application of cepstral analysis is in determining the period of
pitch, p, which corresponds to the time interval between two consecutive glottal closures that control the flow of air from E.
Another application is the determination of the formant frequencies, that are related to H. Quefrency, in seconds, is the scale
used to measure the cepstrum. Fig. 1 illustrates the explanations above, and the algorithm for obtaining the cepstrum based
on the DFT/FFT appears in Table 1.
3. The proposed algorithm
The proposed algorithm appears in Table 2 and further explanation follows. In steps S2 and S4, the full Discrete WaveletPacket Transform (DWPT) tree [5] is needed. For this algorithm, the DWPT requires the natural frequency ordering (NFO),
instead of the filter bank ordering (FBO). For each level of decomposition, the NFO is obtained by alternating the order in
which the pair of filters, low pass (h[n]) and high pass (g[n]), is applied to decompose each sub-band. A detailed explanation
of this procedure can be found in [5], pp. 107–111. Unlike for the Fourier-based cepstrum, the proposed algorithm does not
require an inverse DWPT (IDWPT).
290
F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293
Table 2
Pseudo-code for the proposed algorithm
BEGINNING
STEP S1:
STEP S2:
Input the speech utterance, s[n], with length L, that must be a power of 2;
s[n] ← DWPT (s[n], m), that is, turn s[n] into its full mth-level Discrete
log(L)
Wavelet-Packet Transform (DWPT), where m = log(2) ;
For (int i = 0; i < L; i + +)
si ← log10 (si )2 ;
repeat step S2, i.e., apply the DWPT again to s[n];
For (int i = 0; i < L; i + +)
si ← (si )2 ;
STEP S3:
STEP S4:
STEP S5:
END.
After carrying this out, the signal s[n] will become the cepstrum of its original version.
To obtain the period of pitch, p, of a speech utterance, a search can be performed within the range [a–b] to find the highest
value, i.e., the spike that represents the pitch. The values of a and b that represent samples of the cepstrum can be determined
as follows:
sampling_rate
,
a = round
400.0
sampling_rate
b = round
,
60.0
where 60 and 400 are the limits of frequencies, in hertz, for searching for the pitch. The kth sample between a and b that
contains the highest value is related to the frequency of pitch, f = 1p , according to the following:
f =
sampling_rate
k
Hz.
It is interesting to note that the Fourier-based approach converts the input signal from the time domain to the frequency
domain, scales it using the logarithmic function, and then returns the scaled signal to the time domain. On the other hand,
the proposed approach filters the input signal, scales it, filters again, and then calculates the energy of each sub-band. Since
the input signal is decomposed until the maximum level is reached, the leaves of the decomposition tree contain only one
sample each and the sample with the highest value within the range [a–b] is the one related to the pitch period.
The Fourier-based cepstrum of a signal is computationally complex as compared to that for the proposed method, when
the signal length is long. The order of complexity of the proposed approach is N · K · O(L), N being a constant that corresponds
to the wavelet filter lengths adopted, and K being the number of decomposition levels of the wavelet-packet tree.
4. The choice of the wavelet
The proposed algorithm requires the analysis of s[n] twice, according to the steps S2 and S4 of the previous section.
The process by which s[n] is analyzed is Mallat’s pyramidal algorithm [6]. It consists of low-pass and high-pass filtering, by
discrete convolutions [7], followed by down-samplings by 2. A generalization of Mallat’s algorithm is well described in the
literature, and can be found in [5,7]; therefore, this topic is not included here.
During the analysis of s[n], the length of h[k] and g[k], N, is responsible for frequency selectivity, Q , and time resolution,
R. As N increases, Q increases and R decreases. Therefore, a balanced value for N is needed [8]. Let us define this as being
the requirement 1. Another important consideration, say requirement 2, is that (almost) linear phase filters, h[k] and g[k],
are desirable to avoid distortion in the filtered signal, i.e., (almost) symmetrical or anti-symmetrical impulse responses are
preferable. Lastly, we can define requirement 3, that is, a flat frequency response in the pass and in the stop bands of the
filters is desirable to attenuate the Gibbs phenomenon [9].
The proposed algorithm does not require any synthesis of the speech signal, i.e., no inverse DWT (IDWT) is performed.
Therefore, it is sufficient to consider only the requirements 1, 2, and 3, which are related to the filters h[n] and g[n], and
P
P
disregard particular characteristics of the scaling and wavelet functions [7], φ(n) = k hk φ(2n−k) and ψ(n) = k gk φ(2n−k),
respectively.
In order to satisfy the requirement 1, we have to find N in such a way that it balances the constraints Q and R. Imposing
limits on N, say c and d, (c < d), a search can be performed within this range to allow optimal results in one particular
respect: R or Q .
Requirement 2 can only be satisfied by using finite impulse response (FIR) wavelet filters, since infinite impulse response
(IIR) ones, say Shannon, Meyer, and so on, do not exhibit linear phase [7]. Therefore, the only alternative among the wellknown FIR wavelet filters [7] is the Haar wavelet. Unfortunately, we also have to discard this last wavelet, since the support
of its filters is N = 2, and therefore it fails to satisfy the requirement 1. The only alternative is the use of a wavelet family
whose filters exhibit almost linear phase, i.e., have almost symmetrical or anti-symmetrical impulse responses. Fig. 2, which
gives an intuition of the impulse response shapes of such low-pass filters, shows that Symmlets or Coiflets filters are the
most appropriate for satisfying this particular requirement.
F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293
291
Fig. 2. From left to right: impulse response shapes of the wavelet filters Haar, Daubechies, Vaidyanathan, Beylkin, Coiflet, and Symmlet.
Fig. 3. [above]: Fourier-based cepstrum of /a/; [below]: time-aligned wavelet-based cepstrum of the same speech signal; [horizontal axis]: cepstrum signal
samples; [vertical axis]: amplitude. The highest spike close to the end of the horizontal axis corresponds to the pitch.
Lastly, requirement 3 can be satisfied by using Daubechies’ filters, since they have a maximally flat frequency response [6].
This family of filters, however, fail to satisfy requirement 2 because they do not have a linear phase response. Therefore, the
only alternative is, again, to use a family of wavelets whose filters present an almost flat response in the pass and stop bands.
Since Symmlets and Coiflets present this characteristic, and they also satisfy requirements 1 and 2, they can be used.
5. Tests and results
For evaluation of the proposed algorithm, synthetic utterances were used, particularly the following vowels sounds: /a/
as in the word dogma, /e/ as in the word men, and /i/ as in the word ship, with the pitch frequencies being, respectively,
116.79 Hz, 125.00 Hz, and 207.79 Hz. The signals were sampled at 16 000 Hz, 16-bit, PCM.
Figs. 3–5 show the cepstrums of the utterances above, obtained by the Fourier method, described in Section 2, and by the
proposed method. Table 3 contains the pitch frequency for the synthetic /e/ obtained with the proposed method, varying the
wavelet bases and filter lengths. Similar results, in terms of precision, were obtained with the vowels /a/ and /i/, according
to each wavelet used.
(2048)
All the tests above used L = 2048 samples. With that, the maximum level of the DWPT is j = loglog
(2) = 11, which gives a
16 000
frequency resolution s = 2211 ≈ 3.9 Hz after the first filtering step (S2) of the proposed algorithm. If L increases, j increases
and s decreases, i.e., a better frequency resolution is obtained. On the other hand, if L decreases, j decreases and s increases,
i.e., a worse frequency resolution is obtained.
292
F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293
Fig. 4. [above]: Fourier-based cepstrum of /e/; [below]: time-aligned wavelet-based cepstrum of the same speech signal; [horizontal axis]: cepstrum signal
samples; [vertical axis]: amplitude. The highest spike close to the end of the horizontal axis corresponds to the pitch.
Fig. 5. [above]: Fourier-based cepstrum of /i/; [below]: time-aligned wavelet-based cepstrum of the same speech signal; [horizontal axis]: cepstrum signal
samples; [vertical axis]: amplitude. The highest spike close to the middle of the horizontal axis corresponds to the pitch.
Table 3
Results obtained for the synthetic vowel /e/, created with pitch period at 125.00 Hz
W
S
P
W
S
P
W
S
P
W
S
P
Haar
Daubechies
Coiflet
Symmlet
2
24
18
16
125.98
125.98
125.00
125.00
Daubechies
Daubechies
Coiflet
Symmlet
4
36
24
24
125.98
125.98
125.00
125.98
Daubechies
Coiflet
Coiflet
Beylkin
8
6
30
18
125.98
125.98
125.98
125.98
Daubechies
Coiflet
Symmlet
Vaidyanathan
12
12
8
24
125.98
125.98
125.98
125.98
[W ]: wavelet family; [S]: filter support; [P]: pitch frequency, in hertz, obtained using the proposed algorithm.
6. Conclusions
We presented a new wavelet-based algorithm for cepstrum calculation, which has lower computational complexity than
the traditional method based on DFT or FFT, and can be used for pitch period extraction in real time. On the basis of our
theoretical assumptions, confirmed in practice, we concluded that a family of wavelets whose filters exhibit almost linear
phase responses, i.e., Symmlets or Coiflets, with a balanced value for N, i.e., 12 6 N 6 24, are the best candidates. This choice
F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293
293
takes into account the frequency selectivity and the time resolution obtained with the use of such filters. The length, L, of the
input speech signal also influences the results, 2048 being a reasonable choice in terms of both precision and convenience
for small delay in real time applications.
Acknowledgment
We wish to thank the State of São Paulo Research Foundation for the grants given for this work under process number
2005/00015-1.
References
[1]
[2]
[3]
[4]
[5]
[6]
L. Deng, O. O’Shaughnessy, Speech Processing: A Dynamic and Optimization-Oriented Approach, Marcel Dekker Inc., New York, USA, 2003.
J. Coleman, Introducing Speech and Language Processing, Cambridge University Press, Cambridge, UK, 2005.
W. Chou, B.H. Juang, Pattern Recognition in Speech and Language Processing, CRC Press, Boca Raton, USA, 2003.
T.F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice, Prentice Hall, Upper Saddle River, NJ, USA, 2002.
A. Jensen, A. Cour-Harbo, Ripples in Mathematics: The Discrete Wavelet Transform, Springer, Berlin, Germany, 2001.
P.S. Addison, The Illustrated Wavelet Transform Handbook: Introductory Theory and Applications in Science, Engineering, Medicine and Finance,
Institute of Physics Publishing, Edinburg, 2002.
[7] G. Strang, T. Nguyen, Wavelets and Filter Banks, Wellesley-Cambridge Press, Wellesley, MA, USA, 1997.
[8] R.C. Guido, C.D. Maciel, M. Monteiro, E.S. Fonseca, S. Panchapagesan, J.C. Pereira, L.S. Vieira, S. Barbon Jr., F.L. Sanchez, M.B.A. Guilherme, K.I.C. Sergio,
T.L. Scarpa, P.C. Fantinato, E.J.R. Moura, A study on the best wavelet for audio compression, in: 40th IEEE Asilomar Int. Conference on Signals, Systems
and Computers, Pacific-Grove, CA, USA, vol. 1, 2006, pp. 2115–2118.
[9] R.W. Hamming, Digital Filters, third ed., Dover Publications, New York, USA, 1998.