Speech To Text Conversion STT System Using Hidden Markov Model HMM
Speech To Text Conversion STT System Using Hidden Markov Model HMM
ISSN 2277-8616
1. Introduction
Human interact with each other in several ways such as facial
expression, eye contact, gesture, mainly speech. The speech
is primary mode of communication among human being and
also the most natural and efficient form of exchanging
information among human in speech [1]. Speech-to-text
conversion (STT) system is widely used in many application
areas. In the educational field, STT or speech recognition
system is the most effective on deaf or dumb students. The
recognition of speech is one the most challenges in speech
processing. Speech Recognition can be defined as the
process of converting speech signal to a sequence of words
by means of Algorithm implemented as a computer program
[1]. Basically, speech to text conversion (STT) system is
distinguished into two types, such as speaker dependent and
speaker independent systems [2]. This paper presents the
speaker dependent speech recognition system. Speech
recognition is very complexity case when processing on
randomly varying analogue signal such as speech signals.
Thus, in speech recognition system, feature extraction is the
main part of the system. There are various methods of feature
extractions. In recent researches, many feature extraction
techniques are commonly used such as Principal Component
Analysis (PCA), Linear Discriminant Analysis (LDA),
Independent Component Analysis (ICA), Linear Predictive
Coding (LPC), Cepstral Analysis and Mel-frequency cepstral
(MFCCs), Kernal based feature extraction based approach,
Wavelet Transform and spectral subtraction [3]. In this paper,
Mel Frequency Cepstral Coefficients (MFCC) method is used.
It is based on the characteristics of the human ear's hearing,
which uses a nonlinear frequency unit to simulate the human
auditory system. Mel frequency scale is widely used to extract
features of the speech. Mel-frequency cepstral features
provide the rate of recognition to be efficient for speech
recognition as well as emotion recognition system through
speech [4]. Moreover, Vector Quantization (VQ), Artificial
Neural Network (ANN), Hidden Markov Model (HMM),
Dynamic Time Warping (DTW) and various techniques are
used by the researchers in recognition. Among them, HMM
recognizer is currently dominant in many applications.
Nowadays, STT system is fluently used in many control
systems, mobile phones, computers and so forth. Therefore,
2. Methodology
A. End Point Detection
Classification of speech into voiced or unvoiced sounds
provides a useful basis for subsequent processing. A threeway classification into silence/unvoiced/voiced extends the
possible range of further processing to tasks such as stop
consonant identification and endpoint detection for isolated
utterances [5]. In noisy environment, speech samples
containing unwanted signals and background noise are
removed by end point detection method. End point detection
method is based on the short-term log energy and short-term
zero crossing rate [6]. The logarithmic short-term energy and
zero crossing rates are calculated in the following equation [1]
and [2]
2
N
n=1 log(s(n)
Elog =
ZCR=
1
2
N
n=1 |sgn
sgn s n =
+1
-1
[1]
s n+1 -sgn s n |
[2]
s n 0
s(n)<0
Wheres(n) is the speech signal, Elog is the logarithmic shortterm energy and ZCR is the short-term zero crossing rate.
B. Mel Frequency Cepstral Coefficient (MFCC)
Feature extraction is the most important part of the entire
system. The aim of feature extraction is to reduce the data
size of the speech signal before pattern classification or
recognition. The steps of Mel frequency Cepstral Coefficients
(MFCCs) calculation are framing, windowing, Discrete
Fourier Transform (DFT), Mel frequency filtering, logarithmic
function and Discrete Cosine Transform (DCT).Fig.1 shows
the block diagram of MFCC process.
349
IJSTR2015
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 4, ISSUE 06, JUNE 2015
probability (A) and symbol emission probability (B). In HMMbased speech recognition system, there exist three main
problems called evaluation, decoding and learning problems.
The training and testing algorithm of HMM are discussed in
details [8]. The probability of observations or likelihood given
the model determines the expected recognized word. It is
calculated by the following equation [3]
Speech
Framing
Windowing
ISSN 2277-8616
Discrete Fourier
Transform (DFT)
L=P O =
N
i=1 T (i)
[3]
MFCC
Discrete Cosine
Transform (DCT)
Logarithmic
Mel Frequency
Filtering
3. Implementation
The flowchart of speech to text conversion is illustrated in Fig
.2. To convert input speech to text output, the four main steps
are developed by using MATLAB.These steps are speech
database, preprocessing, feature extraction and recognition.
Firstly, five audio files are recorded with the help of computer.
Each audio file contains ten different pronunciation audio files.
So, there are total of fifty audio files are recorded in speech
database. The speech signals at low frequencies have more
energy than at high frequencies. Therefore, the energies of
signal are necessary to be boost at high frequencies.
According to the saturation of environment, the unwanted
noise may affect the recognition rate worse. This problem can
be overcome by end point detection method. After
preprocessing stage is finished, the speech samples are
extracted to features or coefficients by the use of Mel
Frequency Cepstral Coefficient (MFCC). Finally, these MFCC
coefficients are used as the input of Hidden Markov Model
(HMM) recognizer to classify the desired spoken word. The
desired text output can be generated by HMM method even if
the test audio file is included in the existing speech database.
Start
Record Speech
Removal Noise
Calculate MFCCs
No
Yes
Desired Text Output
End
IJSTR2015
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 4, ISSUE 06, JUNE 2015
4. Simulation Results
ISSN 2277-8616
0.2
0.1
Original Signal (Apple)
0.3
Amplitude
0.2
-0.1
Amplitude
0.1
-0.2
-0.1
-0.3
-0.2
-0.4
4
5
6
Number of samples
-0.3
-0.4
10
4
x 10
3
Number of samples
0.3
x 10
0.25
Original Signal (Banana)
0.2
0.2
0.15
0.15
0.05
Amplitude
0.1
Amplitude
0.1
0.05
0
-0.05
-0.05
-0.1
-0.1
-0.15
-0.15
-0.2
-0.2
0.5
1.5
2
2.5
3
Number of samples
3.5
4.5
6
Number of samples
10
12
4
x 10
x 10
0.2
0.15
0.1
Amplitude
0.05
0
-0.05
-0.1
-0.15
-0.2
3
Number of samples
Train data
Number of
test
Number of
correct test
Error
Percentage
of Accuracy
Apple
50
31
19
62%
Banana
50
32
18
64%
Computer
50
33
17
66%
Flower
50
31
19
62%
Key
50
30
20
60%
x 10
351
IJSTR2015
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 4, ISSUE 06, JUNE 2015
Number of
test
Number of
correct
test
Error
Percentage
of
Accuracy
Apple
50
37
13
74%
Banana
50
36
14
72%
Computer
50
36
14
72%
Flower
50
39
11
78%
Key
50
35
15
70%
ISSN 2277-8616
5. Conclusion
Train data
Number of
test
Number of
correct
test
Error
Percentage
of
Accuracy
Apple
50
43
86%
Banana
50
42
84%
Acknowledgement
Computer
50
45
90%
Flower
50
46
92%
Key
50
43
86%
The author would like to thank to Dr. Hla Myo Tun, Associate
Professor and Head of the Department of Electronic
Engineering, Mandalay Technological University for his help
and for his guidance, support and encouragement.
References
In Table .1, the percentage of recognition rate for apple and
flower is 62 %. For banana, the recognition rate is slightly
increased to 64% and the recognition rate of computer have
the best result of 66%.Whereas, the accuracy of key has the
least of 60%.For number of states (N=4),the percentage of
recognition rate is increased around 70 for all audio files. This
is shown in Table .2. According to the Table .3, the number of
states (N=5) gives the better accuracy than any other states.
The recognition rate of individual spoken word is nearly from
84 to 92%.
87.60%
73.20%
62.80%
N=3
N=4
N=5