Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Wavelet-based cepstrum calculation

Journal of Computational and Applied Mathematics, 2009
...Read more
Journal of Computational and Applied Mathematics 227 (2009) 288–293 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam Wavelet-based cepstrum calculation Fabrício Lopes Sanchez a , Sylvio Barbon Júnior b,* , Lucimar Sasso Vieira b , Rodrigo Capobianco Guido b , Everthon Silva Fonseca c , Paulo Rogério Scalassara c , Carlos Dias Maciel c , José Carlos Pereira c , Shi-Huang Chen d a Department of Bioengineering, University of São Paulo, USP, São Carlos, São Paulo, 13566-590, Brazil b Institute of Physics of São Carlos - University of São Paulo, USP, São Carlos, São Paulo, 13566-590, Brazil c Department of Electrical Engineering, School of Engineering at São Carlos - University of São Paulo, USP, São Carlos, São Paulo 13566-590, Brazil d Department of Computer Science and Information Engineering, Shu-Te University, Kaohsiung County, 824, Taiwan, ROC article info Article history: Received 4 April 2006 Received in revised form 30 September 2007 Keywords: Wavelet transform Cepstrum Pitch period Speech analysis abstract In this paper we present a new wavelet-based algorithm for low-cost computation of the cepstrum. It can be used for real time precise pitch determination in automatic speech and speaker recognition systems. Many wavelet families are examined to determine the one that works best. The results confirm the efficacy and accuracy of the proposed technique for pitch extraction. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Cepstral analysis is fairly well discussed in [1,2], and many other references. With it, a voiced speech segment can be separated into two main parts: the resonances of the vocal tract, and the harmonic peaks which come from the source pulmonary excitation ([1], pp. 30; [2], pp. 79). Cepstral analysis has been intensively used for speech recognition, speaker recognition, and speaker verification and identification [1,3,4]. The cepstrum, which is the signal derived from the cepstral analysis of a speech utterance, is traditionally obtained by means of the Discrete Fourier Transform (DFT); see Section 2. This approach gives precise results but with O(L 2 ), i.e., the order of computational complexity is quadratic in relation to the signal length (L). If the Fast Fourier Transform (FFT) is used, the complexity reduces to O(L · log(L)). The proposed algorithm has a lower order of complexity, since its basic principle is the convolution of the input signal with a wavelet filter of small support. Furthermore, the accuracy that we obtained with the proposed algorithm is comparable to the one obtained using the DFT or FFT. For many real time applications, the proposed approach is a valuable alternative to the existing algorithms based on the DFT or FFT for pitch extraction. This paper is organized as follows. Section 2 presents a brief review on the cepstrum and its calculation based on the DFT/FFT. The proposed algorithm is described in Section 3. A study on wavelets and their important characteristics for precise cepstrum calculation appears in Section 4. Section 5 describes the tests and results, and, lastly, Section 6 presents the conclusions. * Corresponding address: Universidade de Sao Paulo, IFSC, Postal 369 - CEP, 13560-970 Sao Carlos, Sao Paulo, Brazil. E-mail address: sbarbonjr@yahoo.com.br (S. Barbon Júnior). 0377-0427/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.cam.2008.03.016
F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293 289 Fig. 1. [above]: log spectrum of a voiced speech signal; [below]: corresponding cepstrum obtained with the Fourier approach, where the period of pitch, p, appears as a spike. Table 1 The algorithm for obtaining the cepstrum, based on the Fourier approach BEGINNING STEP S1: Input speech utterance, s[n], with length L; STEP S2: s[n]← DFT(s[n]); STEP S3: For (int i = 0; i < L; i ++) s i log 10 k(s i )k 2 ; STEP S4: s[n]← IDFT(s[n]), where IDFT is the inverse DFT; STEP S5: For (int i = 0; i < L; i ++) s i (s i ) 2 ; END. 2. A brief review: The cepstrum According to [1], if E is the excitation, and H represents the vocal tract resonance, then, the speech spectrum is S = E · H. Cepstral analysis converts this product into a sum of spectra as follows: S = E · H log(S) = log(E · H) log(S) = log(E) + log(H). The spectrum of H varies slowly with frequency, because it consists mostly of smooth formant curves, and the spectrum of E is more irregular, since it relates to the harmonics of the excitation. Therefore, the cepstral analysis linearly separates the contributions given by E and H to form S. One particular application of cepstral analysis is in determining the period of pitch, p, which corresponds to the time interval between two consecutive glottal closures that control the flow of air from E. Another application is the determination of the formant frequencies, that are related to H. Quefrency, in seconds, is the scale used to measure the cepstrum. Fig. 1 illustrates the explanations above, and the algorithm for obtaining the cepstrum based on the DFT/FFT appears in Table 1. 3. The proposed algorithm The proposed algorithm appears in Table 2 and further explanation follows. In steps S2 and S4, the full Discrete Wavelet- Packet Transform (DWPT) tree [5] is needed. For this algorithm, the DWPT requires the natural frequency ordering (NFO), instead of the filter bank ordering (FBO). For each level of decomposition, the NFO is obtained by alternating the order in which the pair of filters, low pass (h[n]) and high pass (g[n]), is applied to decompose each sub-band. A detailed explanation of this procedure can be found in [5], pp. 107–111. Unlike for the Fourier-based cepstrum, the proposed algorithm does not require an inverse DWPT (IDWPT).
Journal of Computational and Applied Mathematics 227 (2009) 288–293 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam Wavelet-based cepstrum calculation Fabrício Lopes Sanchez a , Sylvio Barbon Júnior b,∗ , Lucimar Sasso Vieira b , Rodrigo Capobianco Guido b , Everthon Silva Fonseca c , Paulo Rogério Scalassara c , Carlos Dias Maciel c , José Carlos Pereira c , Shi-Huang Chen d a Department of Bioengineering, University of São Paulo, USP, São Carlos, São Paulo, 13566-590, Brazil b Institute of Physics of São Carlos - University of São Paulo, USP, São Carlos, São Paulo, 13566-590, Brazil c Department of Electrical Engineering, School of Engineering at São Carlos - University of São Paulo, USP, São Carlos, São Paulo 13566-590, Brazil d Department of Computer Science and Information Engineering, Shu-Te University, Kaohsiung County, 824, Taiwan, ROC article info Article history: Received 4 April 2006 Received in revised form 30 September 2007 Keywords: Wavelet transform Cepstrum Pitch period Speech analysis a b s t r a c t In this paper we present a new wavelet-based algorithm for low-cost computation of the cepstrum. It can be used for real time precise pitch determination in automatic speech and speaker recognition systems. Many wavelet families are examined to determine the one that works best. The results confirm the efficacy and accuracy of the proposed technique for pitch extraction. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Cepstral analysis is fairly well discussed in [1,2], and many other references. With it, a voiced speech segment can be separated into two main parts: the resonances of the vocal tract, and the harmonic peaks which come from the source pulmonary excitation ([1], pp. 30; [2], pp. 79). Cepstral analysis has been intensively used for speech recognition, speaker recognition, and speaker verification and identification [1,3,4]. The cepstrum, which is the signal derived from the cepstral analysis of a speech utterance, is traditionally obtained by means of the Discrete Fourier Transform (DFT); see Section 2. This approach gives precise results but with O(L2 ), i.e., the order of computational complexity is quadratic in relation to the signal length (L). If the Fast Fourier Transform (FFT) is used, the complexity reduces to O(L · log(L)). The proposed algorithm has a lower order of complexity, since its basic principle is the convolution of the input signal with a wavelet filter of small support. Furthermore, the accuracy that we obtained with the proposed algorithm is comparable to the one obtained using the DFT or FFT. For many real time applications, the proposed approach is a valuable alternative to the existing algorithms based on the DFT or FFT for pitch extraction. This paper is organized as follows. Section 2 presents a brief review on the cepstrum and its calculation based on the DFT/FFT. The proposed algorithm is described in Section 3. A study on wavelets and their important characteristics for precise cepstrum calculation appears in Section 4. Section 5 describes the tests and results, and, lastly, Section 6 presents the conclusions. ∗ Corresponding address: Universidade de Sao Paulo, IFSC, Postal 369 - CEP, 13560-970 Sao Carlos, Sao Paulo, Brazil. E-mail address: sbarbonjr@yahoo.com.br (S. Barbon Júnior). 0377-0427/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.cam.2008.03.016 F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293 289 Fig. 1. [above]: log spectrum of a voiced speech signal; [below]: corresponding cepstrum obtained with the Fourier approach, where the period of pitch, p, appears as a spike. Table 1 The algorithm for obtaining the cepstrum, based on the Fourier approach BEGINNING STEP S1: STEP S2: STEP S3: STEP S4: STEP S5: Input speech utterance, s[n], with length L; s[n] ← DFT(s[n]); For (int i = 0; i < L; i + +) si ← log10 k(si )k2 ; s[n] ← IDFT(s[n]), where IDFT is the inverse DFT; For (int i = 0; i < L; i + +) si ← (si )2 ; END. 2. A brief review: The cepstrum According to [1], if E is the excitation, and H represents the vocal tract resonance, then, the speech spectrum is S = E · H. Cepstral analysis converts this product into a sum of spectra as follows: S = E · H ⇒ log(S) = log(E · H) ⇒ log(S) = log(E) + log(H). The spectrum of H varies slowly with frequency, because it consists mostly of smooth formant curves, and the spectrum of E is more irregular, since it relates to the harmonics of the excitation. Therefore, the cepstral analysis linearly separates the contributions given by E and H to form S. One particular application of cepstral analysis is in determining the period of pitch, p, which corresponds to the time interval between two consecutive glottal closures that control the flow of air from E. Another application is the determination of the formant frequencies, that are related to H. Quefrency, in seconds, is the scale used to measure the cepstrum. Fig. 1 illustrates the explanations above, and the algorithm for obtaining the cepstrum based on the DFT/FFT appears in Table 1. 3. The proposed algorithm The proposed algorithm appears in Table 2 and further explanation follows. In steps S2 and S4, the full Discrete WaveletPacket Transform (DWPT) tree [5] is needed. For this algorithm, the DWPT requires the natural frequency ordering (NFO), instead of the filter bank ordering (FBO). For each level of decomposition, the NFO is obtained by alternating the order in which the pair of filters, low pass (h[n]) and high pass (g[n]), is applied to decompose each sub-band. A detailed explanation of this procedure can be found in [5], pp. 107–111. Unlike for the Fourier-based cepstrum, the proposed algorithm does not require an inverse DWPT (IDWPT). 290 F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293 Table 2 Pseudo-code for the proposed algorithm BEGINNING STEP S1: STEP S2: Input the speech utterance, s[n], with length L, that must be a power of 2; s[n] ← DWPT (s[n], m), that is, turn s[n] into its full mth-level Discrete log(L) Wavelet-Packet Transform (DWPT), where m = log(2) ; For (int i = 0; i < L; i + +) si ← log10 (si )2 ; repeat step S2, i.e., apply the DWPT again to s[n]; For (int i = 0; i < L; i + +) si ← (si )2 ; STEP S3: STEP S4: STEP S5: END. After carrying this out, the signal s[n] will become the cepstrum of its original version. To obtain the period of pitch, p, of a speech utterance, a search can be performed within the range [a–b] to find the highest value, i.e., the spike that represents the pitch. The values of a and b that represent samples of the cepstrum can be determined as follows:   sampling_rate , a = round 400.0   sampling_rate b = round , 60.0 where 60 and 400 are the limits of frequencies, in hertz, for searching for the pitch. The kth sample between a and b that contains the highest value is related to the frequency of pitch, f = 1p , according to the following: f = sampling_rate k Hz. It is interesting to note that the Fourier-based approach converts the input signal from the time domain to the frequency domain, scales it using the logarithmic function, and then returns the scaled signal to the time domain. On the other hand, the proposed approach filters the input signal, scales it, filters again, and then calculates the energy of each sub-band. Since the input signal is decomposed until the maximum level is reached, the leaves of the decomposition tree contain only one sample each and the sample with the highest value within the range [a–b] is the one related to the pitch period. The Fourier-based cepstrum of a signal is computationally complex as compared to that for the proposed method, when the signal length is long. The order of complexity of the proposed approach is N · K · O(L), N being a constant that corresponds to the wavelet filter lengths adopted, and K being the number of decomposition levels of the wavelet-packet tree. 4. The choice of the wavelet The proposed algorithm requires the analysis of s[n] twice, according to the steps S2 and S4 of the previous section. The process by which s[n] is analyzed is Mallat’s pyramidal algorithm [6]. It consists of low-pass and high-pass filtering, by discrete convolutions [7], followed by down-samplings by 2. A generalization of Mallat’s algorithm is well described in the literature, and can be found in [5,7]; therefore, this topic is not included here. During the analysis of s[n], the length of h[k] and g[k], N, is responsible for frequency selectivity, Q , and time resolution, R. As N increases, Q increases and R decreases. Therefore, a balanced value for N is needed [8]. Let us define this as being the requirement 1. Another important consideration, say requirement 2, is that (almost) linear phase filters, h[k] and g[k], are desirable to avoid distortion in the filtered signal, i.e., (almost) symmetrical or anti-symmetrical impulse responses are preferable. Lastly, we can define requirement 3, that is, a flat frequency response in the pass and in the stop bands of the filters is desirable to attenuate the Gibbs phenomenon [9]. The proposed algorithm does not require any synthesis of the speech signal, i.e., no inverse DWT (IDWT) is performed. Therefore, it is sufficient to consider only the requirements 1, 2, and 3, which are related to the filters h[n] and g[n], and P P disregard particular characteristics of the scaling and wavelet functions [7], φ(n) = k hk φ(2n−k) and ψ(n) = k gk φ(2n−k), respectively. In order to satisfy the requirement 1, we have to find N in such a way that it balances the constraints Q and R. Imposing limits on N, say c and d, (c < d), a search can be performed within this range to allow optimal results in one particular respect: R or Q . Requirement 2 can only be satisfied by using finite impulse response (FIR) wavelet filters, since infinite impulse response (IIR) ones, say Shannon, Meyer, and so on, do not exhibit linear phase [7]. Therefore, the only alternative among the wellknown FIR wavelet filters [7] is the Haar wavelet. Unfortunately, we also have to discard this last wavelet, since the support of its filters is N = 2, and therefore it fails to satisfy the requirement 1. The only alternative is the use of a wavelet family whose filters exhibit almost linear phase, i.e., have almost symmetrical or anti-symmetrical impulse responses. Fig. 2, which gives an intuition of the impulse response shapes of such low-pass filters, shows that Symmlets or Coiflets filters are the most appropriate for satisfying this particular requirement. F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293 291 Fig. 2. From left to right: impulse response shapes of the wavelet filters Haar, Daubechies, Vaidyanathan, Beylkin, Coiflet, and Symmlet. Fig. 3. [above]: Fourier-based cepstrum of /a/; [below]: time-aligned wavelet-based cepstrum of the same speech signal; [horizontal axis]: cepstrum signal samples; [vertical axis]: amplitude. The highest spike close to the end of the horizontal axis corresponds to the pitch. Lastly, requirement 3 can be satisfied by using Daubechies’ filters, since they have a maximally flat frequency response [6]. This family of filters, however, fail to satisfy requirement 2 because they do not have a linear phase response. Therefore, the only alternative is, again, to use a family of wavelets whose filters present an almost flat response in the pass and stop bands. Since Symmlets and Coiflets present this characteristic, and they also satisfy requirements 1 and 2, they can be used. 5. Tests and results For evaluation of the proposed algorithm, synthetic utterances were used, particularly the following vowels sounds: /a/ as in the word dogma, /e/ as in the word men, and /i/ as in the word ship, with the pitch frequencies being, respectively, 116.79 Hz, 125.00 Hz, and 207.79 Hz. The signals were sampled at 16 000 Hz, 16-bit, PCM. Figs. 3–5 show the cepstrums of the utterances above, obtained by the Fourier method, described in Section 2, and by the proposed method. Table 3 contains the pitch frequency for the synthetic /e/ obtained with the proposed method, varying the wavelet bases and filter lengths. Similar results, in terms of precision, were obtained with the vowels /a/ and /i/, according to each wavelet used. (2048) All the tests above used L = 2048 samples. With that, the maximum level of the DWPT is j = loglog (2) = 11, which gives a 16 000 frequency resolution s = 2211 ≈ 3.9 Hz after the first filtering step (S2) of the proposed algorithm. If L increases, j increases and s decreases, i.e., a better frequency resolution is obtained. On the other hand, if L decreases, j decreases and s increases, i.e., a worse frequency resolution is obtained. 292 F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293 Fig. 4. [above]: Fourier-based cepstrum of /e/; [below]: time-aligned wavelet-based cepstrum of the same speech signal; [horizontal axis]: cepstrum signal samples; [vertical axis]: amplitude. The highest spike close to the end of the horizontal axis corresponds to the pitch. Fig. 5. [above]: Fourier-based cepstrum of /i/; [below]: time-aligned wavelet-based cepstrum of the same speech signal; [horizontal axis]: cepstrum signal samples; [vertical axis]: amplitude. The highest spike close to the middle of the horizontal axis corresponds to the pitch. Table 3 Results obtained for the synthetic vowel /e/, created with pitch period at 125.00 Hz W S P W S P W S P W S P Haar Daubechies Coiflet Symmlet 2 24 18 16 125.98 125.98 125.00 125.00 Daubechies Daubechies Coiflet Symmlet 4 36 24 24 125.98 125.98 125.00 125.98 Daubechies Coiflet Coiflet Beylkin 8 6 30 18 125.98 125.98 125.98 125.98 Daubechies Coiflet Symmlet Vaidyanathan 12 12 8 24 125.98 125.98 125.98 125.98 [W ]: wavelet family; [S]: filter support; [P]: pitch frequency, in hertz, obtained using the proposed algorithm. 6. Conclusions We presented a new wavelet-based algorithm for cepstrum calculation, which has lower computational complexity than the traditional method based on DFT or FFT, and can be used for pitch period extraction in real time. On the basis of our theoretical assumptions, confirmed in practice, we concluded that a family of wavelets whose filters exhibit almost linear phase responses, i.e., Symmlets or Coiflets, with a balanced value for N, i.e., 12 6 N 6 24, are the best candidates. This choice F.L. Sanchez et al. / Journal of Computational and Applied Mathematics 227 (2009) 288–293 293 takes into account the frequency selectivity and the time resolution obtained with the use of such filters. The length, L, of the input speech signal also influences the results, 2048 being a reasonable choice in terms of both precision and convenience for small delay in real time applications. Acknowledgment We wish to thank the State of São Paulo Research Foundation for the grants given for this work under process number 2005/00015-1. References [1] [2] [3] [4] [5] [6] L. Deng, O. O’Shaughnessy, Speech Processing: A Dynamic and Optimization-Oriented Approach, Marcel Dekker Inc., New York, USA, 2003. J. Coleman, Introducing Speech and Language Processing, Cambridge University Press, Cambridge, UK, 2005. W. Chou, B.H. Juang, Pattern Recognition in Speech and Language Processing, CRC Press, Boca Raton, USA, 2003. T.F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice, Prentice Hall, Upper Saddle River, NJ, USA, 2002. A. Jensen, A. Cour-Harbo, Ripples in Mathematics: The Discrete Wavelet Transform, Springer, Berlin, Germany, 2001. P.S. Addison, The Illustrated Wavelet Transform Handbook: Introductory Theory and Applications in Science, Engineering, Medicine and Finance, Institute of Physics Publishing, Edinburg, 2002. [7] G. Strang, T. Nguyen, Wavelets and Filter Banks, Wellesley-Cambridge Press, Wellesley, MA, USA, 1997. [8] R.C. Guido, C.D. Maciel, M. Monteiro, E.S. Fonseca, S. Panchapagesan, J.C. Pereira, L.S. Vieira, S. Barbon Jr., F.L. Sanchez, M.B.A. Guilherme, K.I.C. Sergio, T.L. Scarpa, P.C. Fantinato, E.J.R. Moura, A study on the best wavelet for audio compression, in: 40th IEEE Asilomar Int. Conference on Signals, Systems and Computers, Pacific-Grove, CA, USA, vol. 1, 2006, pp. 2115–2118. [9] R.W. Hamming, Digital Filters, third ed., Dover Publications, New York, USA, 1998.
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Dimitris Askounis
National Technical University of Athens
Sotirios Xydis
National Technical University of Athens
Galal Nadim
Fayoum University
Forrest Brewer
University of California, Santa Barbara