CNN Basic
CNN Basic
https://doi.org/10.1007/s00500-021-06291-2 (0123456789().,-volV)(0123456789().
,- volV)
FOCUS
Abstract
Nowadays, deep neural network has become the prime approach for enhancing speech signals as it yields good results
compared to the traditional methods. This paper describes the transformation in the enhanced speech signal by applying the
deep convolutional neural network (Deep CNN), which can model nonlinear relationships and compare it with the Wiener
filtering method, which is the best technique for speech enhancement among the traditional methods. Denoising is
performed in the frequency domain and converted back to the time domain to analyze performance metrics such as speech
quality and speech intelligibility. The speech quality is analyzed based on the signal to noise ratio (SNR) and perceptual
evaluation of speech quality (PESQ). Speech intelligibility is analyzed by short-time objective intelligibility (STOI). Both
the methods evaluated the denoised speech, and the analysis made on the results shows that the SNR of the conventional
Wiener filtering method is much improved when compared with Deep CNN. However, the PESQ and STOI of Deep CNN-
based enhanced speech outperform the Wiener filtering method. The performance metrics indicate that Deep CNN achieves
better results than the conventional technique.
Keywords Deep convolutional neural network Noisy speech Speech enhancement Speech quality Intelligibility
123
D. Hepsiba, J. Justin
predicting the variations because of dynamic nature of clean speech and the SNR to build the Wiener filter in the
noisy speech signals. The statistical assumptions need to be frequency domain (Xia and Bao 2014). Modeling of the
made in the unsupervised models and that does not time and frequency correlation dimensions by applying the
improve the performance of the denoised speech. The improved minima controlled recursive averaging (IMCRA)
supervised models are data driven and it eliminates the and also incorporating the long short-term memory
statistical assumptions that are made on the clean and noisy (LSTM) of recurrent neural network (RNN) architecture
speech signals. and CNN exhibits good results in terms of the performance
Nowadays, the enhancement techniques incorporate the metrics (Yuan 2020). Cycle consistent training (Meng et al.
taxonomy of artificial intelligence by which the machine 2018) for enhancement optimizes clean to noisy and noisy
learning (Srinivasan et al. 2006) and deep learning tech- to clean speech mapping simultaneously.
nique (Kolbk et al. 2017; Wang and Chen 2018; Chai et al. The different DNN-based speech enhancement
2019) is widely applied to improve the clarity of speech methodologies adopted vary based on neural network
(intelligibility) and it increases the listening capability architecture, training the target and selection of training
(based on quality) so that it is perceived. It is imperative as features. Nowadays, the deep learning models that are
the listeners are interested and focused on listening to the becoming popular in the field of speech enhancement are
speech signal with excellent quality and intelligibility. the CNN (Zheng et al. 2020; Li et al. 2020), LSTM (Li
Denoising is a fundamental strategy that is implemented et al. 2019), and RNN (Xian et al. 2021), which incorporate
for applications that deal with speech signals such as the transformation function to convert the spectral features
telecommunication (Rix et al. 2001), speaker recognition in of the noisy speech signal and clean speech signal. As CNN
biometrics (Jain et al. 2004), hearing aids (Healy et al. is widely used for image processing and recognition, it
2017), hands-free communication (Thiergart and Taseska would be a good solution for the problems faced with the
2014) and many more. degradation of speech signals due to background noise. The
The drawbacks of the unsupervised techniques could be SNR-aware (Fu et al. 2016) CNN for the enhancement
overcome by applying the deep neural network that deals process shows that the CNN suits well for extracting the
with training the network with massive data in multiple time–frequency features and moves forward in achieving
noise conditions. The data-driven approach (Zhao et al. the goal. Loss functions based (Fu et al. 2018; Li et al.
2018) of the deep neural network makes it more efficient 2020) on the performance metric STOI are used for mod-
and is responsive to untrained conditions and unseen eling the utterance as a whole.
noises. In the recent past, the commonly used techniques CNN implemented to perform end-to-end speech
for supervised speech enhancement (Nossier et al. 2021) enhancement (Du et al. 2017) task can estimate the phase
technique include the mapping in the frequency domain or of clean speech that improves the quality and intelligibility
time–frequency masking. The speech signal is converted of speech. Some of the speech enhancement methods
from the frequency domain to the time domain. These perform direct enhancement on the raw speech waveforms
methodologies enable the reconstruction of the speech by mapping (Fu et al. 2017; Pandey and Wang 2019) and
signal from frequency domain to time domain with the are referred to as the waveform-based approaches. The
phase of the noisy signal (Li et al. 2019). fully convolutional neural network (Park and Lee 2017) is
The order of the content of this research paper is as one among them that allows direct mapping and feature
follows: the recent work carried out in speech enhancement selection from the convolutional encoder-decoder model
is discussed in the 2nd Section. A clear explanation of the (Lan et al. 2020). Obtaining the mean absolute error loss
proposed Deep CNN system and a comparison with the for the training of CNN is done by taking the magnitude of
Wiener filter is given in the 3rd Section. Section 4 dis- the enhanced STFT and clean STFT (Pandey and Wang
cusses the dataset used, features extracted, algorithm and 2019). In some cases, a combination of the CNN and RNN
its description. The description of the results obtained and model (Hsieh et al. 2020) works out to be more suitable to
the conclusion are mentioned in the 5th and 6th Section, capture the local and sequential correlations (Wang et al.
respectively. 2021). Another approach uses sequence to sequence model
(Kameoka et al. 2020) using LSTM RNN to model the
encoder by encoding the input sequence and decoder to
2 Related works decode the output sequence for voice conversion.
The mapping function created based on the noisy and
Similar works carried out in the speech enhancement area clean speech signal by the nonlinear-based regression
helps in removing the background noise that affects the model (Xu et al. 2013) shows that the ability to handle the
speech signal are the weighted noise encoder for enhancing unseen noise is diminished. In the ILMSAF-based speech
the speech signal by considering the power spectrum of enhancement, the performance of the network is reduced
123
Enhancement of single channel speech quality and intelligibility in multiple noise conditions using…
for the volvo noise (Li et al. 2016; Sungheetha and Rajesh 3.2 Wiener filtering
2021; Kumar 2021). As the task is to enhance the speech
signal by removing the noise, the CNN is applied for the The presence of noise is unavoidable in real-world sce-
speech enhancement as it was observed that it gives narios of speech processing. The most fundamental
improved results compared to multi-layer perceptron methodology in noise reduction of a speech signal is the
(Grais and Plumbley 2017). optimal Wiener filter. The Wiener filter acts as a linear
The CNN is robust and suits well for speech enhance- filter that could be utilized to separate the clean speech
ment. Therefore, in the proposed work, the Deep CNN is signal from the noisy speech signal by reducing the MSE
designed to give outperforming results. Deep CNN takes between the estimated signal and the original signal. As the
the noisy speech signal as the input and converts it into the Wiener filter can achieve noise reduction, it also has the
frequency domain to train the network. It is because the disadvantage of losing the speech signal’s integrity.
noise and the clean speech signal can be discriminated only Therefore, the speech misrepresentation should be man-
in the frequency domain. The training is performed until aged in such a way by adequately manipulating the Wiener
the mean squared error is minimum between the clean filter or to have explicit knowledge of the speech signal. In
speech signal and the denoised or enhanced speech signal. any speech communication system, the speech signal could
be distorted by background noise and reverberations.
Therefore, noise reduction methodologies and speech
3 Speech enhancement system enhancing techniques are needed to obtain the desired
speech signal from the corrupted ones.
In today’s scenario, the best of all techniques are the Deep C ðx Þ C ðxÞ
algorithms, as they can handle a lot of data and design a Rð x Þ ¼ ¼ ð2Þ
SðxÞ C ðxÞ þ BðxÞ
model by themselves. In this work, the Deep CNN is
designed to perform speech enhancement and a compara- where C ðxÞ—Signal Spectrum, BðxÞ—Noise Power
tive study is done by analyzing its performance with the Spectrum, SðxÞ—Noisy Speech Spectrum
best conventional technique, i.e., the Wiener filter as shown
C ðxÞ SðxÞ BðxÞ
in Fig. 1. Therefore, the best conventional Wiener filter and RWiener ðxÞ ¼ ¼ ð3Þ
SðxÞ SðxÞ
Deep CNN are taken for comparison. The comparison
results show that each technique is best in its way. Es —Estimation of enhanced signal
c
Es ðx; kÞ ¼ RWiener ðxÞSðx; kÞ ð4Þ
3.1 Model of speech signal qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
d^½n ¼ IFFT c
Es ðx; kÞ ð5Þ
The noisy speech signal is acquired from adding the clean
speech signal with the different types of noise as given in
By combining the magnitude of the clear speech spectral
Eq. 1. The task is to retrieve the clean speech signal from
data with the phase of the noisy speech, the estimate of the
the noisy speech signal by eliminating the noise.
enhanced speech is obtained. It is given as,
cðnÞ: Clean Speech Signal
d^½n ¼ d^½n\s½n ð6Þ
bðnÞ: Noise Signal
sðnÞ: Noisy Speech Signal ^
d—Estimate of Enhanced Speech.
sðnÞ ¼ cðnÞ þ bðnÞ ð1Þ
3.3 Speech denoising and enhancement using
deep convolutional neural network
123
D. Hepsiba, J. Justin
CNN is designed with multiple hidden layers with rectified signal is generated for feeding the Deep CNN. The noisy
linear unit (ReLU) activation function for speech data set is created by mixing the clean speech with the
enhancement. The input applied to the Deep CNN system different noise types such as washing machine noise,
is the frames of the noisy speech signal, and the expected rainbow noise, jet airplane noise and train whistle noise
output is the denoised speech signal. with different noise levels such as 0 dB, 5 dB, 10 dB and
The clean and noisy speech signal is converted to the 15 dB.
frequency domain using STFT. The magnitude spectrum of The dataset contains 400 utterances and it is split into
the clean speech signal is taken as the target. The noisy 3:1 for training and testing. Deep CNN is trained with 300
speech signal is taken as the predictor and presented to the sentences and tested with 100 sentences. The training set is
Deep CNN for denoising the speech as shown in Fig. 2. created by mixing the noise with the clean speech signal at
The regression network uses the magnitude of the noisy different noise levels. From the testing set, the noisy speech
speech signal to reduce the mean square error between the signal is randomly chosen to check the denoising ability of
denoised speech signal and the clean speech signal. The the network.
output from the Deep CNN gives the denoised signal in the
frequency domain. The denoised speech signal is converted 4.2 Feature extraction
to the time domain using the output magnitude spectrum
from the Deep CNN network and the phase of the noisy The first step is to convert the speech signal from the time
speech signal. domain to the frequency domain using STFT to extract
features. The magnitude STFT vectors of the clean speech
and the noisy speech are input features to the Deep CNN
4 Algorithm description Model. Therefore, the speech signal is divided into a 10 ms
frame with no frameshift. In converting from the time
domain to frequency domain using STFT, the hamming
123
Enhancement of single channel speech quality and intelligibility in multiple noise conditions using…
123
D. Hepsiba, J. Justin
Table 1 PESQ description Fig. 4 Performance improvement comparison for noise levels 0 dB, c
PESQ Score Description
5 dB, 10 dB, 15 dB a washing machine noise, b rainbow noise,
4–5 Excellent c train whistle noise, d jet airplane noise
3–4 Good
2–3 Fair
1–2 Poor
signal is improved compared to the SNR of the noisy
0–1 Bad
signal.
The noisy signals are taken for different noise levels
5 Results and discussions such as 0 dB, 5 dB, 10 dB and 15 dB for the different
noise types and were added with the clean speech signal to
The clean speech signal is added with different noise types form the noisy speech signal. For analyzing the enhanced
such as washing machine noise, rainbow noise, train speech signal, the performance metrics considered are
whistle noise and jet airplane noise with different noise SNR, PESQ and STOI.
levels such as 0 dB, 5 dB, 10 dB and 15 dB. The noisy The performance metrics are calculated as follows:
speech signal generated by adding washing machine noise • Signal to Noise Ratio (SNR)
is given as input to the Wiener filter and DNN-based
speech enhancement system. The SNR of the denoised
Table 2 Comparison of SNR, PESQ and STOI of noisy signal and denoised signal using Wiener filter and deep CNN
Noise level Washing machine noise Rainbow noise Train whistle noise Jet airplane noise
(dB)
SNR SNR SNR SNR
Noisy Wiener Deep Noisy Wiener Deep Noisy Wiener Deep Noisy Wiener Deep
signal filter CNN signal filter CNN signal filter CNN signal filter CNN
0 19.4898 34.7837 30.5801 19.3679 37 .2929 30.5738 19.2925 39.801 35.7682 19.5242 36.4482 33.2925
5 23.0452 35.3452 33.4311 22.9899 37.7086 32.5593 23.0805 38.0964 36.0604 23.1936 38.9859 34.2398
10 25.5831 36.5445 34.8537 25.5454 37.9403 34.6552 25.6373 38.9824 36.0091 25.5068 44.103 35.1301
15 26.6226 38.3302 35.7433 26.7152 37.4407 35.5169 26.6514 39.346 36.1191 27.0348 42.8054 36.1123
Noise level Washing machine Rainbow Train whistle Jet airplane
(dB)
PESQ PESQ PESQ PESQ
Noisy Wiener Deep Noisy Wiener Deep Noisy Wiener Deep Noisy Wiener Deep
signal filter CNN signal filter CNN signal filter CNN signal filter CNN
0 2.0374 1.8916 2.3326 1.91 1.812 2.0966 1.8663 2.1188 2.6372 2.2741 1.5465 2.3693
5 2.2703 2.2672 2.4992 2.2918 1.9707 2.3908 2.5498 2.2221 2.7768 2.5966 1.6103 2.6944
10 2.6184 2.2699 2.6496 2.4612 1.995 2.56 2.7473 2.4913 2.8795 2.6331 1.7719 2.8764
15 2.6983 2.3078 2.816 2.6623 1.9265 2.6776 2.9022 2.4074 2.9699 2.6785 1.8318 2.7983
0 0.5166 0.0173 0.5809 0.5252 0.0039 0.5812 0.7598 0.0058 0.7923 0.6334 0.0718 0.6726
5 0.6569 0.0278 0.6814 0.648 0.0459 0.6609 0.8047 0.0535 0.8164 0.6951 0.0724 0.7074
10 0.7284 0.0174 0.7501 0.7291 0.0263 0.744 0.7781 0.0783 0.7814 0.7459 0.0214 0.7685
15 0.7542 0.048 0.8099 0.7331 0.0686 0.7704 0.8256 0.0177 0.8912 0.7463 0.0397 0.7693
123
Enhancement of single channel speech quality and intelligibility in multiple noise conditions using…
123
D. Hepsiba, J. Justin
(c) 10dB – Train Whistle Noise (d) 15dB – Jet Airplane Noise
Fig. 5 Spectrogram analysis of clean speech, noisy speech and denoised speech signal a 0 dB—washing machine noise, b 5 dB—rainbow noise,
c 10 dB—train whistle noise and d 15 dB—jet airplane noise
123
Enhancement of single channel speech quality and intelligibility in multiple noise conditions using…
International Telecommunications Union (ITU). The observed that the Wiener filter has good capability in
PESQ value ranges as per Table 1 given below. improving the quality of the speech signal. When the
• Short Time Objective Intelligibility (STOI) intelligibility of the speech signal is considered, the per-
formance of the Wiener filter is deficient. However, the
DNN shows a drastic increase in terms of the clarity of the
STOI is a subjective intelligibility measurement, speech signal.
larger the value better the speech intelligibility. The The denoised signal shown in Fig. 4 represents that the
STOI value ranges between 0 and 1. SNR of the noisy signal is much improved in the Wiener
The audio of the noisy speech signal was inferior in filter compared to the Deep CNN. However, in terms of the
quality as well as intelligibility. When the signals were fed other performance metric representing the quality of
to the Deep CNN system for speech enhancement, the speech, i.e., the PESQ of the denoised speech signal is
performance of the denoised speech was well improved in much improved in Deep CNN compared to the Wiener
terms of quality which were clearly observed by the values filter. When the intelligibility of the denoised speech is
of SNR and PESQ. Also, the intelligibility was improved, analyzed, it is evident that the STOI scores of the Deep
which was analyzed from the STOI scores. Table 2 shows CNN give an excellent improvement in the clarity of
the quality (SNR and PESQ) and intelligibility (STOI) of speech. The spectrograms of the clean speech, noisy speech
noisy signals and improvement in the denoised signal’s and denoised speech for the different types of noise and
performance metrics. noise levels are shown in Fig. 5.
In order to analyze the quality, SNR and PESQ are
considered and to evaluate the clarity of speech; the metric
STOI is taken. The subjective quality of the spoken speech 6 Conclusion
signal is analyzed by PESQ. The value of PESQ ranges
between - 0.5 to 4.5. The higher the value of PESQ on the The proposed single channel speech enhancement system
scale indicates the improvement in quality of the denoised estimates the magnitude of the speech signal in the fre-
speech. STOI refers to the subjective intelligibility of quency domain. The Deep CNN-based single channel
speech and it ranges between 0 and 1. The improvement in speech enhancement system is compared with the tradi-
the STOI value is indicated by the higher value. tional Wiener filtering method. Evaluation is carried out on
As per the observations from the performance metrics multiple noise conditions to analyze the denoising capa-
shown in Table 2, the SNR of the denoised signal through bility of the speech enhancement system, and the results
Wiener filtering shows good improvement compared to indicate that the Deep CNN-based system outperforms in
Deep CNN model for different noise levels as well as terms of quality and intelligibility compared to the best
different noise types. The PESQ value of the Wiener filter performing Wiener filtering traditional technique. The
is in the poor range (1–2) for the rainbow and jet airplane quality of the denoised speech signal based on the SNR
noise as per PESQ scores given in Table 1. But the PESQ shows a drastic improvement for the Wiener filtered
value of the washing machine noise and train whistle noise denoised signal. However, the Deep CNN yields excellent
of the Wiener filter is in the fair (2–3) range. results in terms of quality and intelligibility that are ana-
For the Deep CNN, the PESQ values for all the noise lyzed based on the scores of PESQ and STOI. Thus, it
levels and noise types it falls in the fair (2–3) category of should be recorded that the performance of Deep CNN
mean opinion score. As the Wiener filter focusses more on outperforms the traditional Wiener filter technique.
the quality of the speech signal, it gives good result in
terms of SNR and moderate results for PESQ. But the Funding No funding.
intelligibility of speech is compromised which reduces the
clarity of the speech signal. The STOI scores show that the
Wiener filter is not capable of improving the intelligibility. Declarations
On the other hand, the Deep CNN shows drastic results in
Conflict of interest We don’t have any conflict of interest.
the STOI values, which in turn represents the intelligibility
of the denoised speech signal. Human and animal rights statement Humans/animals are not
The consolidated results in Table 2 show the improve- involved in this research work.
ment in the performance metrics of Deep CNN compared
Data availability statements The datasets analyzed during the current
to the conventional Wiener filtering algorithm for denois- study are available in the University of Edinburgh, Centre for Speech
ing speech signal. The Wiener filtering method shows Technology Research (CSTR). https://datashare.is.ed.ac.uk/handle/
outstanding results on the SNR and the PESQ. It is clearly 10283/2791.
123
D. Hepsiba, J. Justin
123
Enhancement of single channel speech quality and intelligibility in multiple noise conditions using…
Xu Y, Jun Du, Dai L-R, Lee C-H (2013) An experimental study on Zheng N, Shi Y, Rong W, Kang Y (2020) Effects of skip connections
speech enhancement based on deep neural networks. IEEE in CNN-based architectures for speech enhancement. J Signal
Signal Process Lett 21(1):65–68 Process Syst 92:875–884
Yuan W (2020) A time–frequency smoothing neural network for
speech enhancement. Speech Commun 124:75–84 Publisher’s Note Springer Nature remains neutral with regard to
Zhao H, Zarar S, Tashev I, Lee C (2018) Convolutional-recurrent jurisdictional claims in published maps and institutional affiliations.
neural networks for speech enhancement. In: International
conference on acoustics, speech, and signal processing, pp
2401–2405
123