Spectral Energy Based Voice Activity Detection For Real-Time Voice Interface
Spectral Energy Based Voice Activity Detection For Real-Time Voice Interface
Spectral Energy Based Voice Activity Detection For Real-Time Voice Interface
ABSTRACT
Voice activity detection (VAD) is a main process of speech recognition tasks in which every voice region
is detected to extract acoustic feature parameters from the region. This paper proposes an efficient VAD
approach for applying to real-time voice interface systems. Even though diverse VAD approaches have
been successfully applied for speech applications, they may operate inefficiently according to
environmental conditions. In this study, we attempt to enhance the conventional VAD method based on
signal energy within time and spectral domain. In addition, an efficient end-point detection method is also
proposed. We successfully verified the efficiency of the proposed approach via a set of VAD experiments,
comparing with the performance of some conventional VAD methods including zero crossing rate.
Approaches
zcr f ( x(t ) x(t
1))
2.2.1 Cepstral distance based VAD T 1 t 1
In a voice detection method, cepstral (1)
distance was used based on a Euclidean distance
[10]. To extract features of the cepstrum, speech where x(t ) is the t-th signal among T signals, and
signals are applied by the Fast Fourier Transform the indicator function f () is 1 if its argument is
(FFT) as logarithmic scale and implemented by the lower than zero. In other words, if two consecutive
Inverse Fast Fourier Transform (IFFT). Cepstral signals, x(t ) and x(t 1) , have different signs,
features are obtained as a result of the IFFT. The
the function indicates 1. Then, the zcr calculates
features are extracted by multiplying the cepstral
the frequency of sign-changes from T signals. In
window in the cepstrum domain.
general, speech regions represent more frequent
This method assumes that speech regions
sign-changes compared to non-speech regions.
indicate larger cepstral distance between speech
Thus, a frame indicating relatively higher value of
signal frames, while non-speech regions draw
zcr tends to be a speech region.
relatively smaller cepstral distance between frames.
The zero-crossing rate based VAD is very
Based on this property, if a frame indicates larger
distance, it is categorized as a speech region. On simply implemented, because the rate can be
the other hand, non-speech regions demonstrate a directly estimated sample values from time
smaller cepstral distance between frames than a domain.
pre-determined threshold.
The correctness of the threshold greatly
affects the accuracy of the decision of speech and
non-speech regions, as the decision criterion is
only dependent on the threshold. For this reason,
the cepstral distance based VAD approach is
useful for limited speech data.
Figure 10: DET Curve for the Performance Comparison of VAD Approaches
A curve approximating the origin gives approaches; the proposed low spectral energy
superior performance indicating low FAR and based approach (Low power spectrum), the overall
FRR. In this experiment, we compared four VAD spectral energy based approach (Power spectrum),
the conventional time concentrates on R International
domain energy based spectral energy of E Conference on
approach (Frame frequency regions in F Engineering &
energy), and the which human voice R MIS (ICEMIS),
conventional zero- components exist. E 2
crossing rate based Spectral energy is N 0
approach (Zero- estimated within a C 1
crossing rate). certain range of E 6
As shown frequency bins for S .
in this figure, the every frame and the : [3] J. Park, G. Jang, J.
proposed spectral value is used as a Kim, and S. Kim,
energy based method criterion to determine [1] Y. Moon, K. Kim, “Acoustic
showed better VAD if the frame is a and D. Shin, interference
performance, speech or a non- “Voice of the cancellation for a
indicating lower FRR speech frame. internet of voice-driven
and FAR. Two For things: an interface in smart
energy based validation of the exploration of TVs”, IEEE
approaches (Power proposed approach, multiple voice Transactions on
spectrum, Frame we conducted several effects in smart Consumer
energy) achieved VAD experiments homes”, Electronics, Vol.
similar using real-time input Distributed, 59, 2013, pp.
performances, but speech signals. The Ambient and 244-
the power spectrum proposed spectral Pervasive 2
energy provided energy-based method Interactions, 4
slightly better exhibited superior Vol. 9
criterion for VAD. VAD performance 9 .
Zero-crossing rate compared to the 7 [4] M.H. Cohen, J.P.
showed the worst conventional 4 Giangola, and J.
performance. approaches. 9 Balogh, Voice
, user interface
5 6. design, Addison-
. DISCUSSIO 2 Wesley
N (FUTURE 0 Professional,
C WORKS) 1 2004.
O Even though 6 [5] L.R. Rabiner, and
N the proposed approach , B. H. Juang,
C was successfully Fundamentals of
L verified using real- p speech
U time speech data, p recognition, 1993.
S further verification . [6] S. Yiming, and
I under noise W. Rui, “Voice
O environments is 2 activity detection
N required to be 7 based on the
In this employed for real- 0 improved dual-
study, we proposed world voice interface - threshold
an efficient voice applications in which 2 method”,
activity detection various environmental 7 Proceedings of
(VAD) to apply for noises contaminate the 8 IEEE
input speech signals. . International
real- time voice
In future work, we Conference on
interface systems. [2] A.A. Arriany
will validate our Intelligent
The conventional and M.S.
method using various Transportation
VAD approaches are Musbah,
types of noises for in Big Data and
easy to implement, “Applying voice
further verification Smart City
but tend to be recognition
and then apply the (ICITBS), 2015,
vulnerable to technology for
method to a speech pp.
environmental noises. smart home
recognition task. 9
The proposed networks”,
9
approach Proceedings of
6 speech sections
- using scale of
9 cepstrum
9 distance”,
9 Proceedings of
. KIIT
[7] X. Yang, B. Tan, Conference,
J. Ding, J. Zhang, 2
and J. 0
Gong, 1
“Comparative 2
study on voice ,
activity
detection p
algorithm”, p
Proceedings of .
IEEE
International 4
Conference in 8
Electrical and 9
Control -
Engineering, 4
2010, pp. 59- 9
602. 2
[8] J. Ramirez, .
J.M. Górriz, [11] R.G. Bachu, S.
and J.C. Kopparthi, B. Adapa,
Segura, “Voice and B.D.
activity Barkana,
detection. “Voiced/unvoice
fundamentals d decision for
and speech speech signals
recognition based on zero-
system crossing rate
robustness”, and energy”,
Robust Speech Advanced
Recognition and Techniques in
Understanding, Computing
2007. Sciences and
[9] M.H. Moattar, Software
and M.M. Engineering,
Homayounpour, 2010, pp.
“A simple but 2
efficient real- 7
time voice 9
activity -
detection 2
algorithm”, 8
Proceedings of 2
IEEE European .
Signal
Processing
Conference,
2009.
[10] J.S. Choi, “A
detection
method of
speech/non-
[12] Y.K. Lau, and C.K. Chan, “Speech recognition
based on zero crossing rate and energy”, IEEE
Transactions on Acoustics, Speech, and Signal
Processing, Vol. 33, 1985, pp. 320-323.
[13] L.R. Rabiner, and M.R. Sambur, “An
algorithm for determining the endpoints of
isolated utterance”, Bell System Technical
Journal, Vol. 54, 1975, pp. 297-315.
[14] J. Hong, S. Park, S. Jeong, and M. Hahn,
“Robust feature extraction for voice activity
detection in nonstationary noisy
environments”, Journal of The Korean Society
of Speech Sciences, Vol. 5, 2013, pp. 11-16.