Voice Recognition Using MFCC Algorithm
Voice Recognition Using MFCC Algorithm
Voice Recognition Using MFCC Algorithm
Abstract---Human voice plays a very important role as a vital biometric parameter for authentication and
identification. Voice recognition is a biometric technology used to identify one particular person's voice. It provides
enhanced security, convenient authentication and considerable cost saving. It can be performed using many
algorithms and speech models. Mel Frequency Cepstral Coefficients (MFCC) algorithm is generally preferred as a
feature extraction technique to perform voice recognition as it involves generation of coefficients from the voice of
the user that are unique to every user.
Keywords— Voice, Biometric technology, Feature extraction, Authentication, MFCC
I. INTRODUCTION
Voice can combine what people say and how they say it by two-factor authentication in a single action. Other
identifications like fingerprints, handwriting, iris, retina, face scans can also help in biometrics but voice identification is
needed as an authentication that is both secure and unique. Voice can combine two factors, namely, personal voice
recognition and telephone recognition. Voice recognition systems are cheap and easily understood by users. In today's
smart world, voice recognition plays a very critical role in many aspects. Voice based banking, home automation and
voice recognition based gadgets are some of the many applications of voice recognition.[1]
Output
Mel-coefficients
B. PRE EMPHASIS
B. FRAME BLOCKING
The input speech signal is segmented into frames of 15~20 ms with overlap of 50% of the frame size. Usually the
frame size (in terms of sample points) is equal to power of two in order to facilitate the use of FFT. If this is not the case,
zero padding is done to the nearest length of power of two. If the sample rate is 16 kHz and the frame size is 256 sample
points, then the frame duration is 256/16000 = 0.016 sec = 16 ms. Additional, for 50% overlap meaning 128 points, then
the frame rate is 16000/(256-128) = 125 frames per second. Overlapping is used to produce continuity within frames.
C.HAMMING WINDOW
Each frame has to be multiplied with a hamming window in order to keep the continuity of the first and the last points
in the frame. If the signal in a frame is denoted by
x(n), n = 0,…N-1, then the signal after Hamming windowing is,
x(n) * w(n) (3)
where w(n) is the Hamming window defined by
w(n) = 0.54 - 0.46 * cos (2πn/(N-1)) (4)
where 0 ≤ n ≤ N-1
Magnitude
Frequency(Hz)
_________________________________________________________________________________________________
© 2014, IJIRAE- All Rights Reserved Page - 159
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Volume 1 Issue 10 (November 2014) www.ijirae.com
The following are the major steps involved in the implementation of the MFCC algorithm:
Fig.4(a) Speaker 1(Male) template Fig.4(b) Speaker 1(Male) Real Time Input
Fig.5(a) Speaker 2(Female) template Fig.5(b) Speaker 2(Female) Real Time Input
_________________________________________________________________________________________________
© 2014, IJIRAE- All Rights Reserved Page - 160
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Volume 1 Issue 10 (November 2014) www.ijirae.com
III. CONCLUSION
It was observed that MFCCs for every individual user was unique. Certain variations were observed due to difference
in the locality of the recording area. These MFCCs are then compared, that is, the MFCCs of the template and real time
input are compared for every user. In programming, Euclidean distance is used to compare the template and real time
input. In this manner, MFCC algorithm is used for voice recognition.
ACKNOWLEDGMENT
We take this opportunity to thank our project guide, Ms. Savitha Upadhya for her guidance and support throughout the
course duration. Her efforts to clear our concepts and to help us code the entire algorithm was valuable for the
development of this project. Her role as a the project coordinator helped us to meet all our deadlines.
We would like to express our gratitude towards our Head of the Department Dr. KTV Reddy, our Principal Dr. Rollin
Fernandes and all the professors of EXTC Department for their support, encouragement and suggestions. We would also
like to thank all the lab assistants because without their assistance in permitting the use of all the laboratory equipments,
this project would not have been completed in the stipulated time duration.
Last but not least, we would like to thank our family members and our classmates for their valuable suggestions and
constant motivation.
REFERENCES
[1]http://biometrics.pbworks.com/w/page/14811349/Advantages%20and%20disadvantages%20of%20technologies
[2] http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
[3] L. R.Rabiner and R. W.Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, NJ, 1978.
_________________________________________________________________________________________________
© 2014, IJIRAE- All Rights Reserved Page - 161