0% found this document useful (0 votes)

7 views

2020-An Improved Speech Segmentation and Clustering Algorithm

This document presents a new speech segmentation and clustering algorithm based on self-organizing maps (SOM) and k-means clustering. It first proposes a new speech endpoint detection algorithm using spectrum centroid features and histogram threshold selection to improve accuracy. It then describes issues with traditional k-means clustering and self-organizing neural networks. The paper proposes an improved k-means speaker clustering algorithm that uses the trained SOM network to predict cluster numbers and initialize cluster centers for k-means clustering, addressing limitations of both algorithms. Experimental results show the new approach effectively improves speech clustering accuracy.

Uploaded by

Mohammed Nabil

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

2020-An Improved Speech Segmentation and Clustering Algorithm

Uploaded by

Mohammed Nabil

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Hindawi

Mathematical Problems in Engineering

Volume 2020, Article ID 3608286, 19 pages
https://doi.org/10.1155/2020/3608286

Research Article
An Improved Speech Segmentation and Clustering Algorithm
Based on SOM and K-Means

Nan Jiang1 and Ting Liu 2

1
Criminal Investigation Police University of China, Shenyang 110854, China
2
Liaoning University, Shenyang 110036, China

Correspondence should be addressed to Ting Liu; liuting_tinka@sina.cn

Received 14 May 2020; Accepted 27 July 2020; Published 12 September 2020

Academic Editor: Thomas Hanne

Copyright © 2020 Nan Jiang and Ting Liu. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
This paper studies the segmentation and clustering of speaker speech. In order to improve the accuracy of speech endpoint
detection, the traditional double-threshold short-time average zero-crossing rate is replaced by a better spectrum centroid feature,
and the local maxima of the statistical feature sequence histogram are used to select the threshold, and a new speech endpoint
detection algorithm is proposed. Compared with the traditional double-threshold algorithm, it effectively improves the detection
accuracy and antinoise in low SNR. The k-means algorithm of conventional clustering needs to give the number of clusters in
advance and is greatly affected by the choice of initial cluster centers. At the same time, the self-organizing neural network
algorithm converges slowly and cannot provide accurate clustering information. An improved k-means speaker clustering
algorithm based on self-organizing neural network is proposed. The number of clusters is predicted by the winning situation of the
competitive neurons in the trained network, and the weights of the neurons are used as the initial cluster centers of the k-means
algorithm. The experimental results of multiperson mixed speech segmentation show that the proposed algorithm can effectively
improve the accuracy of speech clustering and make up for the shortcomings of the k-means algorithm and self-organizing neural
network algorithm.

1. Introduction Speaker segmentation and clustering technology as an

important front-end processing technology can get the
Speech segmentation is an essential basic work in speech speaker change information in the audio, which can facilitate
recognition and speech synthesis, and its quality has a huge the subsequent speech processing applications, such as
impact on the follow-up speech recognition. Although the speech recognition and further machine translation, and
manual segmentation and annotation have high accuracy, grammar analysis. Most of the current voice processing
they are time-consuming and require skilled domain experts technology is for a single [5–7], etc. only for audio ﬁles
to complete, so automatic speech segmentation has become containing a person speaking, but when the audio contains
a hot research topic in speech processing [1]. multiple people speaking, it cannot meet the demand. At
The speaker segmentation and clustering in this paper is present, the speaker segmentation and clustering system has
to segment the continuous audio containing the speech of achieved good performance on two-person telephone
several speakers into several speech segments, so that each conversation data, but it still faces many challenges in
speech segment only contains the speech of one speaker. complex scenes such as conference, television broadcast, and
Then, the speech segments of the same speaker are grouped multiperson dialogue. The existing problems include the
together and marked with distinctive labels, which deter- following: the number of speakers is uncertain and there is
mines who is speaking and when. This task is also known as no prior information about the number of speakers, the
speaker diarization [2–4]. rotation of speakers is fast, and the length of each speaker’s
2 Mathematical Problems in Engineering

speech is uncertain; there are a variety of noise in speech. transform point detection and speech segmentation. The
How to solve these problems effectively and improve the transform point detection is the key step of the segmentation
robustness of the segmentation clustering system has be- module. The commonly used speaker speech segmentation
come an important research direction, which is also the methods are silence-based methods, metric-based methods,
main research content of this paper. and model-based methods.
In the work of speech recognition, the result of speech References [8, 15] proposed improved endpoint detec-
endpoint detection greatly affects the accuracy and rate of tion algorithms based on the combination of the energy and
speech recognition segmentation [8]. Accurate endpoint frequency band variance method and hybrid feature, re-
detection can reduce a lot of computation for the feature spectively, in 2019. Reference [11] studied the speech end-
extraction in the follow-up speech recognition and also point detection method based on the fractal dimension
make the acoustic model more accurate, so as to improve the method of adaptive threshold in 2020. In reference [17],
accuracy of speech segmentation and recognition. Accurate cepstrum feature is used for endpoint detection, and
endpoint detection of speech signal in complex background cepstrum distance instead of short-time energy is used as
is a very important research branch in the field of speech threshold judgment, while speech detection based on the
recognition [9]. hidden Markov model is improved to adapt to noise
The so-called endpoint detection is to locate the speech changes. Reference [18] proposed a strong noise immunity
segment in a section of original sound data and find the VAD algorithm based on the wavelet analysis and neural
start- and endpoints of the speech segment [10, 11]. In order network. The advantage of the algorithm based on silence is
to eliminate the influence of channel and background noise, that the operation is relatively simple, and the effect is better
accurately determine the start- and endpoints of the sound when the background noise is not complex, but its limita-
segment, eliminate the silent segment in the speech signal, tions are exposed in the complex background, so some more
and make the energy of the whole speech signal concentrate effective algorithms have been proposed.
on the sound segment, instead of being disturbed by Document [19] studies the speaker transformation point
background noise and silent segments, it can effectively detection with variable window length and realizes the
improve the accuracy of speech segmentation and recog- online detection of transformation points, but its calculation
nition. The performance, robustness, and processing time of is relatively large. Delacourt and Wellekens proposed a two-
a speech recognition system can be greatly improved by step speech segmentation algorithm, which first uses a fixed
accurate and efficient endpoint detection. The traditional window to segment the speech initially and then merges the
endpoint detection methods are mainly based on the segmented speech segments. For different databases, this
characteristics of speech such as short-time energy, zero- method has achieved good segmentation results. The ad-
crossing rate, etc. [12], but these characteristics are limited to vantage of speech segmentation based on distance scale is
the situation of no noise or high signal-to-noise ratio, and that it does not need any prior knowledge of speech, and the
will lose the effect when the signal-to-noise ratio is low computational cost is low; the disadvantage is that it needs to
[13, 14]. set threshold according to experience, so the robustness and
According to the different ways of combination between stability are poor, and it is easy to detect many redundant
segmentation and clustering, the current mainstream speech segmentation points.
segmentation and clustering can be divided into two cate- The model method is to train the models of different
gories [15]: one is asynchronous strategy, that is, first seg- speakers from the corpus and then use the trained model to
mentation and then clustering; in this strategy, segmentation classify the speech frame by frame, so as to detect the change
and clustering are implemented step by step; the other is the points of speakers. Commonly used methods are as follows:
synchronization strategy, that is, to complete speaker universal background model (UBM) [20, 21], support vector
clustering while segmenting the speech of different speakers. machine (SVM) [22], and deep neural networks, (DNNs)
ELISA proposed a typical speaker classification system in the [23]. The advantage of the model-based segmentation
literature [16], which combines two typical methods: one is method is that it has higher accuracy than the distance-based
based on asynchronous strategy, represented by the CLIPS method, but the disadvantage is that it requires prior
system, which first automatically cuts audio into many small knowledge, and the calculation cost is very high.
segments, so that each segment contains only one speaker, In the literature [24], the Gaussian mixture model is used
and then merges the same speakers through clustering; the in class modeling, which achieves high clustering purity.
other method is based on synchronous strategy, using the Document [25] studies the speaker clustering method based
hidden Markov model (HMM) to achieve speaker clustering on the k-means algorithm, but the clustering results are
while segmentation. The LIA system is the representative of greatly affected by the choice of initial cluster centers; if the
this kind of method. These two kinds of systems have their choice is not appropriate, it may fall into local optimal
own advantages and disadvantages. The former is relatively solution, and the number of clusters K value needs to be
simple, but the errors after each clustering may accumulate. given in advance.
The latter can correct the errors after each clustering, but it To sum up, in order to improve the accuracy of speech
costs a lot of computing time, and cannot get enough endpoint detection, this paper proposes a new speech
training model. endpoint detection algorithm, which replaces the traditional
Speech segmentation is an important part of asyn- double-threshold short-time average zero-crossing rate with
chronous segmentation clustering, which includes speaker a better spectral centroid feature, smoothes the feature curve
Mathematical Problems in Engineering 3

by median filter, and selects the threshold value by counting exceed the high threshold, it is confirmed that they
the local maxima of the feature sequence histogram. enter the real speech segment; otherwise, the current
Compared with the traditional double-threshold algorithm, state is restored to the silent state.
the proposed speech endpoint detection algorithm still has (3) The endpoint of the speech segment can be detected
higher detection accuracy and noise immunity in low SNR. reversely according to the above method. To sum up,
The k-means algorithm has the advantages of conve- the flowchart of double-threshold endpoint detec-
nient, fast calculation, and accurate results, but it needs to tion is shown in Figure 1.
give the number of clusters in advance, and the results are
greatly affected by the choice of the initial cluster center, so it
is easy to fall into local optimum. The self-organizing neural 2.2. Defects of Conventional Double-Threshold Method for
network (SOM) has the advantages of strong explanatory, Endpoint Detection. The ability to resist noise is weak. Noise
strong learning ability, and visualization, but the conver- environment is the main factor affecting the detection re-
gence speed is slow, and it cannot provide accurate clus- sults, and different SNR and different noise will affect the
tering information, clustering accuracy for nonlarge volume accuracy of detection. Some noises contain rich high fre-
of samples is poor. In order to seek better clustering means, quency components, and correspondingly the zero-crossing
the self-organizing neural network is introduced into rate is relatively high. If the noise is too large, it will lead to a
speaker clustering, and an improved k-means speaker higher zero-crossing rate than vowels and initials in the
clustering algorithm based on self-organizing neural net- noise of some silent segments. In the low SNR environment,
work is designed. The network is used to predict the number the detection results are extremely unstable.
of clusters and the initial cluster centers of the k-means The threshold value is usually set by experience. It is
algorithm. The number of clusters is predicted by the extremely imprecise to use a fixed threshold to detect dif-
winning situation of the neurons in the competitive layer of ferent speakers or different situations of speech.
the trained network. The weight of neurons is used as the Both the short-time energy and the short-time average
initial cluster centers of the k-means algorithm to cluster zero-crossing rate are extracted in the time domain, so the
speakers. The experimental results of multispeaker mixed calculation process is simple, and the actual characteristics of
speech segmentation show that the improved clustering speech are not fully expressed.
algorithm can not only make up for the shortcomings of the The double-threshold method is generally used in speech
two algorithms but also improve the clustering accuracy. recognition, which can only detect the beginning of a speech
but cannot detect the internal pause of the speech.
2. Speaker Speech Segmentation Based on Endpoint detection is used for speech segmentation, and
Improved Double-Threshold the time domain of the corpus is larger than the short-time
domain in speech recognition, so it is necessary to detect all
Endpoint Detection the segmentation points in a long audio. Obviously, the
2.1. Endpoint Detection Principle of Traditional Double- traditional method cannot meet the requirements.
Threshold Method. The double-threshold endpoint detection
method combines the short-time energy and the short-time
2.3. Improved Design of Double-Threshold Endpoint Detection
average zero-crossing rate. Before the start of endpoint
Algorithm. In view of the defects of the traditional double-
detection, two thresholds are set, respectively, for the short-
threshold method endpoint detection algorithm, the fol-
time energy and the short-time average zero-crossing rate,
lowing three aspects are carried out to improve the detection
and the thresholds are set empirically. The first is a low
method:
threshold, small value, more sensitive to signal changes,
and more easily exceeded; the second is the high threshold, (i) In view of the limitation of the short-time average
the value is large, and the threshold must reach a certain zero-crossing rate feature, the spectral centroid
signal strength can be exceeded. Exceeding the low feature is used to replace it. The spectrum centroid is
threshold does not mean the beginning of speech, which combined with short-time energy to detect
may be caused by short-term noise, and only exceeding the (ii) In order to improve the antinoise performance of
high threshold can basically determine the beginning of the double-threshold method, the curves of the two
speech signal. features are smoothed by the median
The whole speech signal can be divided into several
(iii) In order to solve the problem of poor accuracy
segments: silence segment, transition segment, voice seg-
caused by the threshold selection based on expe-
ment, and end segment. The basic steps of endpoint de-
rience, an algorithm is proposed to select the
tection are as follows:
threshold reasonably by analyzing the whole feature
(1) In the silence segment, if one of the features of short- sequence
time energy or zero-crossing rate exceeds the low
threshold, it will be marked as the beginning of the
detection speech and enter the transition segment. 2.3.1. Spectral Centroid Characteristics. Spectral centroid is
(2) In the transition stage, if the energy or zero-crossing a parameter describing the property of timbre. Different
rate characteristics of consecutive frames of speech from short-time energy and short-time average zero-
4 Mathematical Problems in Engineering

reﬂect the frequency information of the signal more accu-

Start endpoint detection rately than the short-time average zero-crossing rate in the
time domain. If the nonspeech segment includes simple
ambient noise, then the spectral centroid of the noise is
Mark detection speech start point
usually lower than that of the speech segment.

2.3.2. Median Filter Smoothing Method. After extracting the

Calculating the short-time energy and features of short-time energy and spectral centroid, it is
the short-time average zero-crossing defective to set the threshold value on the feature curve
rate features of the current frame directly when detecting speech, because when the signal-to-
noise ratio is low, the fluctuation of the feature curve in
nonspeech segments is large, and the low threshold value
will easily lead to misjudgment, while the high threshold
One of the features
No value will lead to undetected. Therefore, it is necessary to
exceeds the low Take the next
threshold frame of data reduce the fluctuation of the feature curve in the nonspeech
segment, and median filtering can be used to smooth the
curve.
Yes Median filtering is a nonlinear smoothing technique
based on statistical ordering theory. The basic idea is to find
Mark speech start point the closest element to its surroundings for any signal element
(sound or image). The principle is to replace the value of a
point in the signal sequence with the median value of each
point in its neighborhood, so as to eliminate the isolated
The next three frames all No Unmark voice start
noise point.
have a feature that exceeds
the high threshold
2.3.3. Threshold Selection Algorithm. After the median
Yes smoothing filtering, the short-time energy and spectral
centroid characteristic curves are smoothed. The traditional
The speech signal starting point is double-threshold method is to set the threshold by expe-
detected, further processing
is performed rience, but the speech characteristics of different people or
different situations are very different, using the same
Figure 1: Traditional flowchart of the double-threshold method threshold to filter speech is very inaccurate.
VAD. Therefore, a new algorithm is designed, which can select
the threshold dynamically and reasonably to improve the
detection accuracy in the case of noise.
crossing rate, spectrum centroid is a characteristic pa- First, the histogram of the smoothed feature sequence is
rameter extracted in frequency domain. First, short-time calculated. The histogram is an accurate graphical repre-
Fourier transform must be done to the signal, and then sentation of the distribution of data and the estimation of the
time-frequency analysis must be done. After getting the probability distribution of variables. In order to establish the
spectrogram of the signal, the spectral centroid Ci of the histogram, the first step is to segment the range of values,
speech of the i-th frame is usually at equal intervals, and then count the number of
times the data appear in each segment.
􏽐N
k�1 (k + 1)Xi (k) Taking the spectral centroid characteristic sequence as an
Ci � . (1) example, the minimum and maximum values of spectral
􏽐Nk�1 Xi (k)
centroid characteristic coefficients are first found out, the
In the formula, Xi (k) is the k-th discrete Fourier range from the minimum to the maximum is divided into L
transform (DFT) coefficient in the spectrogram of the speech sections averagely, the frequency of spectral centroid coef-
of the i-th frame, and the visible spectral centroid represents ficients appearing in each section is counted, and finally the
the center of spectral gravity and is the concentration point histogram is drawn. Let the value of item I(i � 1, 2, . . . , L) in
of spectral energy, generally speaking, the smaller the the histogram be f(i).
spectral centroid, the more concentrated the energy is in the The local maximum value M of the statistical histogram
low frequency range. is due to the fact that in a certain position, and if the
The main reasons to select the combination of short- probability of the occurrence of the characteristic sequence
term energy and spectral centroid features for endpoint is much greater than that of the adjacent position, then it is
detection are as follows: for simple cases (background noise very likely that the place is the transition from nonspeech to
is not very high), the short-term energy of speech segment is speech. The basic principle is as follows: if a segment appears
usually greater than that of nonspeech segment. Spectral more times than the adjacent segments in the histogram, the
centroid is a feature in the frequency domain, which can characteristic coefficient value corresponding to the center
Mathematical Problems in Engineering 5

of the segment is a local maximum. Figures 2 and 3 show 2000

histograms of short-time energy and spectral centroid sig-
nature sequences, respectively.
The speciﬁc statistical methods are as follows: 1500
Set a step length step, and judge from the ﬁrst item in
the histogram to the (L-step) item in turn, when

Frequency
i ≤ step, if it appears 1000
mean(f(1: i)) < f(i) && mean(f(i + 1: i + step)) < f(i).
(2)
500

Then, the characteristic coeﬃcient corresponding to the

center of the i-th segment in the histogram is a local
maximum. When i > step, if appears 0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
mean(f(i − step: i − 1)) < f(i) && mean Class width
(3)
· (f(i + 1: i + step)) < f(i). Figure 2: Histogram of characteristic series of short-time energy.

Then, the characteristic coeﬃcient corresponding to

the center of the i-th segment in the histogram is a 1400
local maximum. According to the above statistical
1200
method, let the number of detected maximum values
be n and the threshold value of the characteristic 1000
sequence be T. The calculation of T is divided into the
following three cases:
Frequency

800
(1) n � 0; then,
600
􏽐N C
T � k�1 k , (4)
400
4N
200
where Ck is the k-th value of the feature sequence,
this expression means that if the local maximum is
0
not detected from the beginning to the end, the 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
threshold value is replaced by (1/4) of the average Class width
value of the feature sequence, but this case is not Figure 3: Spectral centroid characteristic sequence histogram.
common.
(2) n � 1; then,
T � M, (5) the audio signal are higher than the threshold value, it is
judged that the frame is a speech signal.
where M is the only detected local maximum, and
this is not often the case. Usually, more than two
local maxima are detected. 2.3.4. Speaker Speech Segmentation Based on Improved
(3) n � 2; then, Double-Threshold Method. The improved detection process
is as follows:
W · M1 + M2
T� . (6) (i) The speech signal is collected, and the time domain
W+1
waveform is obtained.
Arrange all the detected maximum values in (ii) The speech is divided into frames and windows, and
descending order of frequency. In equation (3) above, the short-time Fourier transform is performed to
M1 and M2 are the first two maximum values. W is a obtain the spectrogram of the signal.
user-defined parameter, and the higher the W, the (iii) The short-time energy feature En is extracted in the
closer the threshold value is to the first maximum time domain and the spectral centroid feature Cn
value M1 . are extracted in the frequency domain.
The thresholds of short-time energy and spectral centroid (iv) The short-time energy feature and the spectrum
characteristics, denoted as T1 and T2 , respectively, are centroid feature are smoothed by median filtering
calculated by this method. When two features in a frame of twice.
6 Mathematical Problems in Engineering

(v) The histograms of the above two feature se- Original speech signal
quences are calculated, respectively, and the local
maxima of the histograms are counted, and the
threshold values of the two features are calcu- Framing,windowing,
lated. The threshold value of short-time energy and FFT
feature is T1 , and that of spectral centroid feature
is T2 . Short-time energy En, spectral centroid Cn,
and twice smoothing filter processing
(vi) If the short-time energy feature of a frame is greater
than T1 and the spectrum centroid feature of the
frame is greater than T2 , the frame is marked as a
Using local maximum method to calculate
speech frame; otherwise, it is marked as a non- two characteristic thresholds: T1, T2 and set n = 1
speech frame.
(vii) Postprocessing stage (use according to the situa-
tion): extend the two ends of each voice segment by No
2 windows, and finally merge the continuous En > T1 && Cn > T2
segments as the final voice segment.
Yes
The speaker speech segmentation algorithm based on the
improved double-threshold method is shown in Figure 4. Speech frame
Nonspeech frame
Among them, the postprocessing stage is mainly to
take into account the extremely short pauses that some-
n=n+1
times occur in speech, eliminating these pauses and
merging the speech can reduce the voice segments and
Yes
reduce the complexity of the results. However, in a few n≤N
cases, these short pauses may also be the change point of No
the speaker, which will lead to wrong merging and affect
the next stage of speech clustering. Therefore, the post- Marking segment point
processing method is used when the audio contains only
one person’s voice, but not when there is a multiperson Postprocessing stage
conversation.
Figure 4: Speaker segmentation algorithm flowchart based on
improved double-threshold method VAD.
2.4. Comparative Experimental Analysis. The experiment of
endpoint detection of speech signal is carried out by using
Matlab software, and the data are recorded by the Newsmy
recorder. The experiment sample is a 1.5 s speech, and the 0.6
content is the Chinese pronunciation of Ni Hao. The output
is a standard Windows WAV audio file, and the file name is 0.4
Hello. wav, sampling frequency is FS � 8 kHz and mono-
phonic, using 16 bit encoding. For the original speech, we 0.2
Amplitude

use the traditional double-threshold method and the im-

proved method to carry out endpoint detection experiments, 0
and make a comparative analysis.
Extract the time domain waveform of the Hello. wav raw –0.2
audio ﬁle as shown in Figure 5.
–0.4
Firstly, the speech signal is divided into frames and
windowed. The frame length is wlen � 200 (each frame has
–0.6
200 sampling points), the frame shift is inc � 100, and the 0 0.2 0.4 0.6 0.8 1 1.2
window function is Hanning window. At the sampling rate
Time (s)
of fs � 8 khz, the total number of sampling points of the
speech sequence is 12001, which is divided into 119 frames, Figure 5: Original speech signal.
and the corresponding time of each frame is 25 ms. Calculate
the energy of each frame of speech, and extract the short-
time energy characteristics of speech. Figure 6 shows the
short-time energy map of speech.
The short-time average zero-crossing rate of each frame
is calculated, and the zero-crossing rate feature is extracted. 2.4.1. Analysis of Endpoint Detection Based on Traditional
Figure 7 shows the short-time average zero-crossing rate Double-Threshold Method. Combining the short-time en-
characteristics of speech. ergy with the short-time average zero-crossing rate, based on
Mathematical Problems in Engineering 7

1 and ends at about 1.03 s, which is consistent with the actual

situation. The results show that the original double-
0.8 threshold endpoint detection method can achieve good
detection results in extremely low noise environment.
Amplitude

0.6
2.4.2. Analysis of Endpoint Detection Based on Improved
0.4 Double-Threshold Method. First, the spectral centroid of
each frame is calculated, and the spectral centroid feature is
0.2 extracted; then, the short-time energy and spectral centroids
are smoothed by median filtering twice, and the threshold
0 values of the two features are calculated simultaneously. The
0 0.2 0.4 0.6 0.8 1 1.2 endpoint detection results of the improved double-threshold
Time (s) method are shown in Figure 9:
Figure 6: Short-time energy of original speech signal.
Figures 9(a) and 9(b) show short-time energy and
spectral centroid feature images, respectively, and the
solid line part is the original feature curve, and the dashed
100 line part is the feature curve after two times of smooth
filtering. The ordinate corresponding to the black thick
95
bar in the figure is the characteristic threshold value se-
90 lected after calculation. If the feature curve exceeds the
bar, the feature exceeds the threshold, and only when both
Amplitude

85 features exceed the threshold can the frame be judged as a

voice frame.
80
Figure 9(c) shows the endpoint detection result of the
75 improved algorithm, wherein the beginning of the speech
is marked with a solid line and the end is marked with a
70 dashed line. It can be seen from the picture that the
method detects two segments of speech, which appear at
65
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.74–1.06 s and 1.08–1.26 s. In practice, there is a slight
pause between the word you and the word good in the
Time (s)
audio. The overall appearance time of speech detected
Figure 7: Short-time average zero-over rate of original speech by the two methods is basically the same, but the
signal. original double-threshold method only detects the
overall speech segment, while the improved method
1 can accurately detect the pause in the middle of the
0.8
speech segment. Therefore, the improved method can
better meet the needs of speaker segmentation in this
paper.
0.4
Amplitude

2.4.3. Comparison of Detection Accuracy between Two

0
Methods. Aiming at the original audio file of Hello. wav
and the audio file with different degrees of Gaussian white
–0.4 noise, the original double-threshold method and the
improved double-threshold method are used to detect
–0.8 endpoints. The formula for endpoint detection accuracy is
as follows:
–1
0 0.2 0.4 0.6 0.8 1 1.2
total frame number − number of error frames
Time (s) Accuracy � .
total frame number
Figure 8: Results of double-threshold method VAD.
(7)

the traditional double-threshold endpoint detection algo- For the speech with different noise levels, the endpoint
rithm, the location of hello speech in the time domain detection accuracy is calculated, and the accuracy results are
waveform is detected, and the detection result is shown in shown in Figure 10.
Figure 8. From the accuracy of Figure 10, we can see that both
In Figure 8, the beginning of the speech is marked with a detection algorithms can accurately detect the endpoints of
solid line and the end is marked with a dashed line. It can be speech in the case of silence or very small noise. When
seen from the picture that the speech starts at about 0.38 s different levels of noise are imposed on the audio files, with
8 Mathematical Problems in Engineering

0.15

Amplitude
0.1
0.05

0 10 20 30 40 50
Time (s)

Original
Filter
(a)
0.5
Amplitude

0
0 10 20 30 40 50
Time (s)

Original
Filter
(b)
Amplitude

0.1
0
–0.1
–0.2
0 0.2 0.4 0.6 0.8 1
Time (s)
(c)

Figure 9: Results of the improved double-threshold method VAD: (a) short-time energy, (b) spectral centroid, and (c) results of improved
double-threshold method VAD.

120.00% 3. Improved K-Means Speaker Speech

97.78% 98.52% 98.52%
Detection accuracy (%)

100.00% Clustering Based on Self-Organizing

80.00% 97.04% Neural Network
60.00% 76.30%
Clustering technology belongs to the typical unsupervised
40.00% 51.11% learning, that is, the given data only have features without
20.00% labels, and it is classified by the internal relationship and
0.00% similarity between the data. On the contrary, supervised
5 10 15 learning means that has given the training data contains
SNR (db) labels and features, and we can find the relationship between
features and labels through training, so that we can judge the
Primitive Algorithm label when facing new data. A comparison of the system
Improved Algorithm components of the two learning methods is shown in
Figure 11.
Figure 10: Accuracy of endpoint detection based on two algo- Figures 11(a) and 11(b) show system compositions of a
rithms in Gaussian white noise.
supervised learning mode and an unsupervised learning
mode. It can be seen that as an unsupervised learning
method, clustering does not need to set the output in ad-
the continuous reduction of the SNR, the noise is contin- vance, there is no human interference, and its purpose is to
uously enhanced, and the detection accuracy of the tradi- bring similar objects together, regardless of what this class is.
tional double-threshold method is significantly reduced, In this paper, the clustering technology is applied to the
while the improved algorithm can still maintain a high classification of the speaker’s speech, and the speech of the
detection accuracy. same person is classified into one class by clustering.
Mathematical Problems in Engineering 9

Supervised learning Unsupervised learning

X o X o
Input system Output Input system Output
e
Error signal

Error analysis

Expected output

(a) (b)

Figure 11: Supervised learning and unsupervised learning.

The k-means algorithm and self-organizing neural

network (SOM) algorithm are widely used in clustering
analysis. The k-means algorithm has the advantages of
convenient, fast calculation, and accurate results, but it Competitive
needs to give the number of clusters in advance, and the layer
results are greatly aﬀected by the choice of initial cluster
centers, so it is easy to fall into local optimum. The self- Input
organizing neural network has the advantages of strong layer
explanatory, strong learning ability, and visualization, but its
convergence speed is slow, and it cannot provide accurate
clustering information, and the clustering accuracy is poor Figure 12: Typical SOM network model.
in nonlarge volume of samples, so it is not suitable for
speaker speech clustering.
Therefore, in order to seek better clustering means,
this paper introduces self-organizing neural network Competitive layer
into speaker clustering and uses it to improve the k-
means algorithm, through the network to predict the
number of k-means algorithm clustering and the
initial cluster center, in order to overcome the short-
comings of these two methods and improve the clus-
tering accuracy.

Input layer
3.1. Self-Organization Neural Network. The self-organizing
feature map (SOM) neural network is based on the phe-
nomenon of lateral inhibition in the biological neural sys-
tem. The basic idea is that for a specific input pattern, each
X1 X2 X3
neuron competes for the opportunity to respond, and ul-
timately only one neuron wins, and the winning neuron Figure 13: Two-dimensional SOM network model.
represents the classification of the input pattern. Therefore,
the self-organizing neural network can be easily associated
with clustering. layer. The number of neurons in the input layer is generally
The structure of self-organizing neural network is the number of samples.
generally a two-layer network: input layer + competition Competition layer: simulate the responding cerebral
layer, and there is no hidden layer, sometimes there are cortex, which is responsible for comparative analysis of
lateral connections between neurons in the competitive input, looking for rules and classification. The output of
layer. A typical self-organizing neural network structure is competition layer represents the classification of the pattern,
shown in Figure 12. and the number of neurons is usually the number of
Input layer: simulate the retina which perceives external categories.
information, receives information, plays the role of obser- Another structure is the two-dimensional form, which
vation, and transmits the input mode to the competition has a more cortical image, as shown in Figure 13.
10 Mathematical Problems in Engineering

Each neuron in the competition layer links laterally with (2) Find the winning neuron
its nearby neurons in a certain way, forming a plane, similar Comparing X 􏽢 with the weights Wj of all neurons in
to a checkerboard. In this kind of structure, the neurons in the competition layer, the most similar neuron is the
the competition layer are arranged as a two-dimensional winning neuron, and its weight is W 􏽢 j∗ .
node matrix, and the neurons in the input layer and output
layer are connected with each other according to the weights.
��
��X 􏽢 j∗ �� min 􏼚��X
�􏽢 − W 􏽢 j ��􏼛.
􏽢 −W (9)
3.2. Competitive Learning Rule. The self-organizing neural j∈{1,2,...,m}

network follows the rule of competitive learning, that is,

competitive winning neurons will inhibit the losing neurons. As said before, the normalized similarity is the largest;
Because it belongs to unsupervised learning, there is no that is, the inner product is the largest:
output of the desired pattern in the sample. There is no a
priori knowledge about which class an input element should
be classiﬁed into, so it is necessary to classify according to the 􏽢 Tj∗ X
W 􏽢 � 􏽢 Tj X
max 􏼒W 􏽢 􏼓. (10)
similarity between samples, which is the basis of self-or- j∈{1,2,...,m}
ganizing neural network clustering.
The basic steps of competitive learning rules are as It is equivalent to ﬁnding the point with the smallest
follows: angle in the unit circle.
(1) Vector normalization (3) Network weight adjustment
The input vector X of the self-organizing neural According to the learning rule, the output of the
network and the weights Wj( j � 1, 2, ..., m) of each winning neuron is 1, and the output of the other
neuron in the competition layer are all normalized to neurons is 0; that is,
􏽢 and W
obtain X 􏽢j :

􏽢 � X ,
X 1, j � j∗ ,
‖X‖ yj (t + 1) � 􏼨 (11)
(8) 0, j ≠ j∗ .
W
􏽢 j � �� j ��.
W ��W �� Only the winning neuron has the right to adjust the
� j�
weight vector as follows:

⎪
⎧
⎨ Wj∗ (t + 1) � W􏽢 j∗ (t) + ΔWj∗ � W
􏽢 j∗ (t) + η(t)􏼐X
􏽢 −W
􏽢 j∗ 􏼑, j � j∗ ,
⎪ (12)
⎩ W (t + 1) � W 􏽢 j (t), j ≠ j∗ .
j

Among them, 0 < η(t) < 1 is the learning rate, which neurons is from near to far, from excitement to inhibition, so
generally decreases with time; that is, the degree of ad- the nearby neurons also need to adjust their weights to
justment becomes smaller and smaller, and gradually tends varying degrees under its influence. Take the winning
to cluster centers renormalization. neuron as the center, set a neighborhood radius R, and this
After adjustment, the weight vector is no longer a unit range is called the dominant neighborhood. In the algo-
vector, so it needs to be normalized again, and the network rithm, the neurons in the winning neighborhood adjust their
needs to be retrained until the learning rate η(t) attenuates weights according to the distance from the winning neuron.
to zero, and the algorithm ends At the beginning, the radius of the winning neighborhood is
In the testing phase, the inner product of the given object set to be very large, and as the number of training increases,
and the weights of each neuron is calculated, and the most the size shrinks until it is zero, as shown in Figure 14.
similar neuron is assigned to which class.
The Kohonen algorithm is usually used for the two-
dimensional self-organizing neural network structure. This 3.3. Design of Improved k-Means Speaker Clustering Algorithm
algorithm is an improvement of the above competitive Based on Self-Organizing Neural Network. The operation of
learning rules. The main difference between Kohonen al- the self-organizing neural network is divided into two stages:
gorithm and competitive learning rule is that the way of training and testing. In the training stage, the input training set
lateral inhibition of neuron weight adjustment is different. In samples, for a specific input, the competition layer will have a
the competitive learning rule, only the winning neuron has neuron to produce the largest response to win. The neural
the right to adjust the weight. In Kohonen algorithm, the network adjusts the weights by training samples in a self-or-
influence of the winning neuron on the surrounding ganizing way and finally makes some neurons in the
Mathematical Problems in Engineering 11

Nj∗(0) features of each speech segment to form a feature set

Xi (� 1, 2, . . . , n) as the input of the system
(2) Training self-organizing neural network
Nj∗(1) (a) Let us start with a rough estimate of the number
of categories K, assuming no more than nine
Nj∗(2)
speakers in this experiment k ≤ 9. Nine neurons
are set in the competition layer of the network,
and the 3 × 3 layout is adopted. The number of
neurons in the input layer is n.
Figure 14: Contraction of superior neighborhood. (b) Initialization: the speech segment feature vector
is normalized to obtain X 􏽢 i (i � 1, 2, . . . , n), and
the neuron weight Wj (j � 1, 2, . . . , 9) in the
competition layer sensitive to the input of a specific pattern competition layer is assigned a smaller random
class, and the corresponding weights become the center of each number, and normalized to obtain hat
input pattern. Thus, the characteristic graph of the distribution W􏽢 j (j � 1, 2, . . . , 9). The initial values of the
of the reactive class is formed in the competitive layer. dominant neighborhood 􏽮Nj∗ 􏽯(0) and the
The k-means method has two shortcomings: the number learning rate \η are set. Let the training time t � 1
of clusters needs to be given in advance and the selection of because the self-organizing neural network is
initial clustering centers is very dependent on the algorithm. used in the front end of the k-means method in
The self-organizing neural network has the advantages of this algorithm, reduce the training time, we do
strong learning ability, strong interpretation, visualization, not need to wait for the network to converge
and so on. However, it also has the limitations of long completely. We only need to set a relatively small
training time, slow convergence rate, and unsatisfactory number of iterations (100 times in the
clustering results for small data. experiment).
In this paper, the self-organizing neural network is (c) Finding the winning neuron: for the i-th input
introduced into the k-means algorithm. The improved al- object, calculate the inner product of X 􏽢 i and W 􏽢j
gorithm can not only make up for the slow convergence find out the neuron corresponding to the
of self-organizing neural network but also improve the maximum inner product, which is the winning
k-means algorithm: neuron j∗ :
(1) Predictive clustering number
􏽢 Tj∗ X
W 􏽢i � 􏽢 jX
max 􏼒W
T
􏽢 i 􏼓, (13)
Firstly, a self-organizing neural network is used to j∈{1,2,...,m}
train the speech feature set for a short period of time,
and a discrimination method is designed to deter-
where Wj is the neuron weight.
mine the number of classes K according to the
winning situation of the neurons in the competitive (d) Defining the dominant neighborhood Nj∗ (t) taking
layer of the network. j∗ as the center and determining the dominant
neighborhood Nj∗ (t) at time t, generally the initial
(2) Finding initial clustering center
neighborhood Nj∗ (0) is larger (about 50%–80% of
The weight of the neuron is used as the initial the total nodes), and Nj∗ (t) decreases with the in-
clustering center, and the k-means algorithm is used crease of training time.
to complete the speech segment clustering. In the
(e) Adjusting the weights: adjusting the weights of all
improved algorithm, self-organizing neural network
neurons in the superior neighborhood N:
is used to get the initial value of the k-means al-
gorithm, which makes it unnecessary to wait for the
Wj (t + 1) � Wj (t) + η(t, N)􏽨xi − Wj (t)􏽩,
complete convergence of the network and reduces (14)
the training time of the network. For the trained i � 1, 2, . . . , n, j ∈ Nj ∗ (t),
network, the more times the neurons in the com-
petition layer win, the closer to the actual clustering where the learning rate η(t, N) is a function of the
center. Therefore, the number K of clusters can be training time t and the topological distance N between
predicted and the initial cluster centers can be cal- the winning neuron j∗ and superior neighborhood
culated by the winning situation of neurons. neuron j.This function generally has the following
The specific steps of the algorithm are as follows: rules:
(1) Sample input
t↑ → η↓,
Based on the improved double threshold endpoint (15)
detection algorithm, the long audio is segmented N↑ → η↓.
into $n$ short time speech segments containing only
one person's speech, thus extracting the MFCC Example:
12 Mathematical Problems in Engineering

η(t, N) � η(t)e− N , (16) 1 j

N
μj � 􏽘X , (22)
Nj i�1 i
where η(t) can take the monotone decreasing
function of t, also called annealing function.
where Nj is the number of samples in each category
(f ) t � t + 1, steps (c) to (e) are repeated until η(t) ≤ ηmin Cj .
or the maximum number of training times is
(e) t � t + 1, and the error square sum criterion function
reached, and step (3) is entered.
is used to explicitly determine whether the algorithm
(3) K value decision is ﬁnished or not:
The winning times of each neuron in the competition ko
layer after training were as follows: Pj (j � 1, 2, . . ., 9). 2
J � 􏽘 􏽘 􏼐 x − μj 􏼑 . (23)
Let k � 0 and j � 1, if j�1 x∈Cj

4
Pj > mean􏼂P1 , P2 , . . . , P9 􏼃. (17) If |J(n) − J(n − 1)| < ξ is satisfied, or the number of
3
iterations t � T, the algorithm ends and goes to step
(6). Otherwise, go to step (b).
Then, the number of categories k � k + 1 and
j � j + 1. (6) Algorithm output
Continue to judge according to formula (17), and the Output cluster partition C � 􏽮C1 , C2 , . . . , Cko 􏽯, and
final number of categories is obtained as follows: the algorithm ends.
k � k0.The idea here is that the more times a neuron To sum up, the flow of the improved k-means
wins, the closer it is to the actual clustering center. speaker clustering algorithm based on the self-or-
Neurons with fewer wins (less than the average ganizing neural network is shown in Figure 15.
number of wins) are considered to be far away from
the cluster center and ignored.
(4) Initial cluster center prediction 3.4. Experimental Analysis. The experiment sample is several
Retraining the self-organizing neural network: at this minute long multiperson dialogue audio, uses Newsmy re-
time, k0 neurons are set in the competition layer, and cording pen to record, simulates the multiperson meeting the
other things remain unchanged. When the network situation. The output is a standard Windows WAV audio file,
training is finished, the weight value Wl(l � 1, 2, . . ., and sampling frequency is fs � 8 kHz, monophonic, using 16 bit
k0) of each neuron is obtained, which is used as the encoding. Recording audio requires crosstalk, and in order to
initial clustering center in the k-means method. ensure the purity of voice and improve clustering accuracy,
(5) K-means speaker cluster speak clearly and do not send out cough and other noise.
The experiment process is shown in Figure 16. The k-
(a) The input of the algorithm is as follows: MFCC means speaker clustering algorithm, the self-organizing
feature set Xi (i � 1, 2, . . . , n) of speech segment, neural network speaker clustering algorithm, and the im-
class number K0, and initial clustering center μj : proved k-means speaker clustering algorithm based on the
self-organizing neural network are used to cluster the speech
μj � Wj , j � 1, 2, . . . , ko 􏼁. (18) segments, and the effectiveness of the improved algorithm is
verified by comparative analysis.
(b) Class partition C is initialized to Select an audio file named Recording 1.wav, which lasts
for 3 minutes and contains the voices of two men and one
Cj � φ, j � 1, 2, . . . , ko . (19) woman.
Extract the time domain waveform of the Recording 1.
wav audio file as shown in Figure 17.
(c) The distance between each sample Xi and each cluster Firstly, the speech signal is preprocessed, including pre-
center μj is calculated as emphasis, and subframe and window processing. Frame
�� 2 length wlen � 200, frame shift inc � 100, and window
dij � ��Xi − μj �� . (20) function is Hanning window. The duration of the audio
2
sequence is 180 s, and the total number of sampling points is
1419856 at the sampling rate of FS � 8 khz. The sequence is
For Xi , it is assigned to the class λi corresponding to the
divided into 14197 frames, and the corresponding time of
smallest dij , and the class division is updated:
each frame is 25 ms. Through the time domain waveform, we
can see that the audio has a number of voice segments, and
Cλi � Cλi ∪ 􏼈Xi 􏼉. (21) there is a short gap between the voice segments.
The short-time energy and spectral centroid character-
(d) For j � 1, 2, . . . , ko , recalculate the cluster centers for istics of each frame of speech are calculated from beginning
all sample points in Cj : to end. The audio is segmented based on the improved
Mathematical Problems in Engineering 13

Input MFCC feature set

Initialize the SOM network,

feature set, and neuron
weight normalization

Set the number of iterations to 100,

training time t = 1

Look for winning neurons and

adjust their weights, t = t + 1

No
t > 100
Yes

Number of categories K = Kn

The neuron weight Wi (i = 1, 2…, ko) obtained

by retraining SOM network is used as the
initial clustering center of k-means

Number of categories K = Kn

Output clustering results

Figure 15: Flowchart of the SOM + k-means speaker clustering algorithm.

K-means speaker
clustering

Segmented Short speech SOM speaker Comparative analysis and

Input audio
speech segment clustering draw the conclusion

Improved k-means
speaker clustering
based on SOM

Figure 16: Flowchart of speaker clustering experiment.

double-threshold endpoint detection method, and the seg- In the clustering experiment, MFCC (mel-frequency
mented speech waveform is shown in Figure 18. cepstrum coeﬃcient) is used as the basis to distinguish
As can be seen from Figure 18, the audio is divided into a diﬀerent speakers.
number of speech segments. In the picture, the speech The average of MFCC vectors of all frames in the speech
segments are shown in dark colors, and the silence between is used to represent the MFCC feature of the whole speech;
the speech segments is shown in light gray. After speech that is, the MFCC feature vector is obtained by calculating
segmentation, a total of 96 short-time speech segments are the average of the feature matrix according to the column.
obtained, and each of which contains only one person’s For these 96 speech segments, the MFCC feature vectors
speech. of each speech segment are extracted, respectively, and the
14 Mathematical Problems in Engineering

times of abnormal clustering due to improper selection of

0.4
initial clustering centers, which greatly reduces the average
0.2 clustering accuracy. The average clustering accuracy of these
Amplitude

0 50 k-means was 84.5%. Table 2 describes the abnormal

–0.2 clustering results when the initial clustering center is not
–0.4 selected properly. The suﬃx “×” is the wrong clustering item,
–0.6 and the accuracy of this clustering is 52.1%.
0 1 2 3 4 5 6 7 It can be seen that k-means speaker clustering is greatly
aﬀected by the selection of initial clustering centers. The
Time (s)
instability of clustering results directly leads to the reduction
Figure 17: Partial audio waveform. of the average clustering accuracy.

3.4.2. Speaker Clustering Based on Self-Organizing Neural

Network. The number of neurons in the input layer is 96, the
0.4
number of neurons in the competition layer is 3, the number
0.2 of iterations is 500, and the learning rate is η(t) � 0.1. The
Amplitude

0 self-organizing neural network speaker clustering algorithm

–0.2 described in Section 3.4.2 is used to cluster the MFCC
feature set Xi (i � 1, 2, . . ., 96) (Table 3).
–0.4
It can be seen that the accuracy of the self-organizing
–0.6 neural network algorithm is lower than the k-means algo-
0 1 2 3 4 5 6 7 rithm when the initial clustering centers are selected
Time (s) appropriately.
However, because of its stable clustering results, the
Figure 18: Segmented speech (results of the improved double-
average clustering accuracy is higher than the k-means al-
threshold method VAD).
gorithm. Therefore, we try to combine the two algorithms
and use the self-organizing neural network to improve the k-
feature set is synthesized. In data processing, different means algorithm, so that the clustering results are stable and
evaluation indexes usually have different dimensional units, can ensure a higher accuracy.
which affects the analysis results. In order to eliminate the
influence of different dimensions and make them compa-
rable, it is necessary to normalize the feature set. The 3.4.3. Improved k-Means Speaker Clustering Based on Self-
normalized data are between −1 and 1, which are in the same Organizing Neural Network. MFCC feature set Xi (i �
order of magnitude, and are suitable for comprehensive 1, 2, . . . , 96) is clustered by the improved k-means speaker
evaluation. The normalized MFCC feature set is used as the clustering algorithm based on the self-organizing neural
input sample of the clustering system. network.
The MFCC feature set of the audio is shown in Figure 19. First, the number of categories is predicted: assuming
The feature set has 12 columns, and the dimension of that the number of speakers in audio is unknown, the self-
MFCC is 12. Each line represents a sample, that is, a total of organizing neural network is used to predict the number of
96 short-duration speech segments. speakers. Let the number of classes be k ≤ 9, 9 neurons are set
up in the competition layer of the network, and the layout is
3 × 3. The number of neurons in the input layer is 96, and a
3.4.1. K-Means Speaker Cluster. For feature set Xi(i � 1, 2, small number of iterations (100) is set.
. . ., 96), k-means speaker clustering algorithm, self-orga- After training, count the winning times P of each neuron
nizing neural network speaker clustering algorithm, and in the competition layer. Calculating 4/3 of the average
improved k-means speaker clustering algorithm based on number of wins is 14.22. There are three neurons with more
self-organizing neural network are used for clustering than 14.22 wins, and the number of wins is 22, 20 and 18,
experiments. respectively. It shows that the three neurons are closer to the
First, listen to the segmented audio and attach dis- actual clustering center, while the neurons with less than
tinguishing labels for different speakers to facilitate subse- 14.22 wins are far away from the actual clustering center,
quent comparative analysis. Among them, Zhang San’s which can be ignored. Therefore, the number of predicted
pronunciation is expressed by “a,” Li Si’s pronunciation is categories is k � 3. (In order to accurately predict the number
expressed by “b,” and Wang Wu’s pronunciation is indicated of categories, the mode number can be obtained by multiple
by “c.” Table 1 shows the speech category. discriminations.)
Based on the k-means speaker clustering algorithm, the After predicting that the number of speakers is three, the
MFCC feature set is clustered Xi (i � 1, 2, . . ., 96), K value is self-organizing neural network is retrained. At this time, it is
set to 3, and the cluster center is initialized randomly. When changed to set 3 neurons in the competition layer, and other
the initial clustering centers are selected properly, the things remain unchanged. At the end of the network
highest clustering accuracy can reach 94.8%. However, after training, the weight value 􏼈W1 , W2 , W3 􏼉 of each neuron is
50 times of k-means clustering for this sample, there are 12 obtained as shown in Figure 20.
Mathematical Problems in Engineering 15

1 2 3 4 5 6 7 8 9 10 11 12

1 –0.2318 –0.6216 –0.8246 0.8197 –1 0.0115 0.0276 –0.7478 –0.3292 –1 –0.0117 –0.3540
2 –0.4504 –0.4722 –0.2641 0.6031 –0.4363 0.0251 –0.0120 –0.3220 –0.2556 –0.3009 –0.0195 –0.5682
3 0.0611 0.4078 0.4520 0.9593 –0.1953 0.2274 0.2702 0.4878 –0.1718 0.5584 –0.1076 –0.1543
4 0.0127 0.6880 –0.1491 –0.3808 –0.0066 0.0899 0.2659 0.5310 –0.4424 0.2297 –0.4673 –0.0847
5 0.1407 0.7320 0.4009 0.6745 0.1440 0.1724 0.0330 0.4351 –0.2905 0.5446 0.2133 –0.2824
6 –0.7343 –0.4021 –0.4176 –0.1656 –0.5628 –0.1922 –0/0962 –0.5104 –0.5630 –0.3270 –0.7263 –0.5573
7 –0.3879 0.7239 –0.1195 –0.3951 –0.3567 0.3155 –0.0020 0.6301 –0.0082 0.4713 –0.0835 0.0816
8 –0.9601 –0.7837 –0.3593 0.0225 –0.2102 0.0413 –0.1888 –0.6212 –0.4318 –0.3511 –0.1205 –0.2959
9 –0.8367 –1 –0.0377 0.4291 –0.6289 –0.1104 –0.4750 –0.9491 –0.3937 0.0958 –0.5412 –0.2736
10 –0.0155 0.2727 0.2820 0.6068 –0.1523 0.4739 0.3752 0.2055 –0.5360 0.1430 0.0281 –0.2739
11 –0.5196 –0.6370 –0.3923 0.1216 –0.5501 –0.0138 –0.2554 –0.5492 –0.3250 –0.4713 0.0153 –0.0793
12 0.2645 0.5404 –0.5999 0.3577 0.3054 –0.3061 0.2822 0.8176 –0.2828 0.6644 –0.4664 –1
13 0.1543 0.6268 –0.0959 –0.0056 –0.0863 –0.1137 0.3584 0.8135 –0.7636 0.2058 –3.0007 –0.5701
14 –0.3113 0.7890 0.1734 –0.0671 –0.6751 –0.1598 0.6133 0.5194 –0.4182 1 –0.3457 –0.5522
15 0.1356 0.7456 –0.1363 –0.1853 0.4774 –0.2183 0.0697 0.8824 –0.9199 0.4200 –0.8365 –0.3673

88 –0.5569 0.2714 –0.8230 0.2283 –0.0786 –0.7724 –0.2758 –0.2648 –0.6207 –0.3815 –0.2345 0.1627
89 –0.2697 –0.2357 –0.7891 –0.0462 0.3934 –0.3903 –0.1434 –0.5120 –0.0433 –0.8132 0.0258 –0.0983
90 0.0043 –0.0438 –0.8671 0.1035 –0.2169 –0.9042 –0.2299 –0.4181 –0.5621 –0.4401 –0.3949 0.1063
91 0.9374 –0.3438 –0.9019 0.2494 –0.9633 –1 0.4109 0.0358 –0.2700 –0.9540 –0.6541 –0.5618
92 0.4810 0.5058 0.5543 0.4640 –0.0819 0.2065 0.2168 9.4952e 0.1676 0.1380 0.0483 0.0595
93 0.2873 –0.2742 –0.2113 0.5668 0.0048 –0.2358 –0.2286 –0.0165 –0.2277 –0.0424 0.0560 0.0929
94 –0.2994 0.0670 –0.6855 0.0121 –0.2985 –0.6471 0.0806 –0.2228 –0.5288 –0.1501 –0.1191 0.6426
95 –0.6023 0.0626 –0.4080 –0.2971 –0.0838 –0.7251 –0.1777 –0.3717 –0.0202 –0.6585 –0.3076 0.5214
96 0.0239 –0.4341 –0.7470 0.2458 0.0398 –0.7044 –0.0350 –0.4216 0.0658 –0.4074 –0.4117 –0.0400

Figure 19: MFCC feature set.

Table 2: K-means speaker clustering results with improper initial

Table 1: Speech category table. values.

Laber 1 2 3 4 5 6 7 8 9 10 Laber 1 2 3 4 5 6 7 8 9 10
Categories a a b c b a c a a b Categories a a b b× b a b× a a b
Laber 11 12 13 14 15 16 17 18 19 20 Laber 11 12 13 14 15 16 17 18 19 20
Categories a c c c c c b b a c Categories a b× b× c c b× b b a b×
Laber 21 22 23 24 25 26 27 28 29 30 Laber 21 22 23 24 25 26 27 28 29 30
Categories c b a a a a b a a a Categories b× b a a a a b a a a
Laber 31 32 33 34 35 36 37 38 39 40 Laber 31 32 33 34 35 36 37 38 39 40
Categories a a b c c c c c c b Categories a a b b× b× c c b× b× b
Laber 41 42 43 44 45 46 47 48 49 50 Laber 41 42 43 44 45 46 47 48 49 50
Categories a a a a b b a b c a Categories a a a c× b b c× b b× c×
Laber 51 52 53 54 55 56 57 58 59 60 Laber 51 52 53 54 55 56 57 58 59 60
Categories a b c a a c c a a a Categories c× b b× c× a b× b× a c× c×
Laber 61 62 63 64 65 66 67 68 69 70 Laber 61 62 63 64 65 66 67 68 69 70
Categories c c c b b b a a a a Categories b× b× b× b b b c× c× c× c×
Laber 71 72 73 74 75 76 77 78 79 80 Laber 71 72 73 74 75 76 77 78 79 80
Categories c c c c c a a a c c Categories b× b× b× b× b× a ca a× b× b×
Laber 81 82 83 84 85 86 87 88 89 90 Laber 81 82 83 84 85 86 87 88 89 90
Categories a c c b c c a a a a Categories a b× b× b b× b× c× c× a c×
Laber 91 92 93 94 95 96 Laber 91 92 93 94 95 96
Categories a b a a a a Categories a b a c× c× a
16 Mathematical Problems in Engineering

Table 3: SOM speaker clustering results.

Laber 1 2 3 4 5 6 7 8 9 10
Categories a a b c b a b× a a b
Laber 11 12 13 14 15 16 17 18 19 20
Categories a c c b× c b× b b a c
Laber 21 22 23 24 25 26 27 28 29 30
Categories c b a a a a b a a a
Laber 31 32 33 34 35 36 37 38 39 40
Categories a a b c a× c a× c c b
Laber 41 42 43 44 45 46 47 48 49 50
Categories a a a b× b b b× b c a
Laber 51 52 53 54 55 56 57 58 59 60
Categories a b c a a c c a a a
Laber 61 62 63 64 65 66 67 68 69 70
Categories c a× a× b b b a a a a
Laber 71 72 73 74 75 76 77 78 79 80
Categories c c a× c c a a b× c c
Laber 81 82 83 84 85 86 87 88 89 90
Categories a c c b c c a a a a
Laber 91 92 93 94 95 96
Categories a b a a a a

1 2 3 4 5 6 7 8 9 10 11 12

1 –0.4599 –0.2687 –0.4762 0.0375 –0.3098 –0.4637 –0.3003 –0.4836 –0.4448 –0.4555 –0.1798 0.0254
2 0.1069 0.5795 0.2901 0.4029 –0.0095 0.2195 0.1038 0.2853 –0.0727 0.2792 0.1259 –0.2594
3 0.3370 0.6511 –0.3395 –0.1476 0.1696 –0.2267 0.4405 0.5404 –0.4869 0.4999 –0.5759 –0.1712

Figure 20: Weights of neurons after training.

Table 4: SOM + k-means speaker clustering results.

Laber 1 2 3 4 5 6 7 8 9 10
Categories a a b c b a c a a b
Laber 11 12 13 14 15 16 17 18 19 20
Categories a c c c c c b b a c
Laber 21 22 23 24 25 26 27 28 29 30
Categories c b a a a a b a a a
Laber 31 32 33 34 35 36 37 38 39 40
Categories a a b b× c c c b× c b
Laber 41 42 43 44 45 46 47 48 49 50
Categories a a a a b b a b c a
Laber 51 52 53 54 55 56 57 58 59 60
Categories c× b c a a b× c a a a
Laber 61 62 63 64 65 66 67 68 69 70
Categories c c c b b b a a a c×
Laber 71 72 73 74 75 76 77 78 79 80
Categories c c c c c a a a c c
Laber 81 82 83 84 85 86 87 88 89 90
Categories a c c b c c a a a a
Laber 91 92 93 94 95 96
Categories a b a a a a

In Figure 20, the three rows of the matrix correspond to implementation of the k-means speaker clustering algorithm
the values of W1 , W2 , W3 . The weight value 􏼈W1 , W2 , W3 􏼉 is and the end of the experiment are shown (Table 4).
saved as the initial clustering center of the k-means algorithm. To sum up, for Recording 1. wav, the improved k-means
Finally, the initial clustering center of the k-means al- speaker clustering algorithm based on self-organizing neural
gorithm is set as follows: μj � Wj (j � 1, 2, 3) The network has achieved good clustering results. It eﬀectively
Mathematical Problems in Engineering 17

0.6
0.4
0.2

Amplitude
0
–0.2
–0.4
–0.6
–0.8
0 10 20 30 40 50 60
Time (s)

ZhangSan
LiSi
WangWu
Figure 21: Diagram of SOM + k-means speaker clustering results. For clarity, a partial enlargement is shown in Figure 22.

0.6
0.4
Amplitude

0.2
0
–0.2
–0.4

28 30 32 34 36 38
Time (s)

ZhangSan
LiSi
WangWu
Figure 22: Diagram of SOM + k-means speaker clustering results (local magniﬁcation).

makes up for the shortcomings of the self-organizing neural Recording 6: contains three voices, three men
network algorithm and k-means algorithm. Recording 7: contains three voices, three women
The clustering effect is shown in Figure 21, which dis-
Recording 8: contains four voices, two men and two
tinguishes different speakers by different colors, and the
women
image is intuitive. Among them, Zhang San’s voice is red, Li
Si’s voice is blue, Wang Wu’s voice is green, and the mute Recording 9: contains four voices, two men and two
segment is gray. women
Recording 10: contains four voices, three men and one
woman
3.4.4. Comparative Analysis. Nine audio files are selected to Based on the improved double-threshold endpoint
verify and analyze the above experimental results, added detection method in this paper, three algorithms are
audio file Recording 1. wav, a total of 10 recordings. The used to perform speaker clustering experiments. The
contents of each audio file are as follows: accuracy of each algorithm is shown in Table 5.
Recording 1: contains three voices, two men and one It can be seen from the table that the clustering ac-
woman curacy of self-organizing neural network algorithm is low,
Recording 2: contains two voices, two men but the average accuracy of k-means algorithm is often
Recording 3: contains two voices, two women lower than that of self-organizing neural network algorithm
because of its instability. With the increase of the number of
Recording 4: contains two voices, one male and one
speakers in the audio samples, or the decrease of gender
female
differences, the clustering accuracy has a downward trend.
Recording 5: contains three voices, one male and two However, in the same audio samples, the clustering accuracy
female
18 Mathematical Problems in Engineering

Table 5: Comparison of speaker clustering results based on three self-organizing neural network algorithms both decrease
algorithms. to 80%. However, the clustering accuracy of the improved
Improved k- k-means algorithm based on the self-organizing neural
K-means SOM speaker network is still maintained at 85%–89%. The improved k-
means speaker
speaker clustering means speaker clustering algorithm based on the self-
clustering based
clustering (%) (%)
on SOM (%) organizing neural network improves the clustering ac-
Sound curacy, which not only makes up for the defects of the self-
84.5 88.5 94.8 organizing neural network algorithm that the conver-
recording 1
Sound gence is slow and cannot provide accurate clustering
84.3 89.2 95.1 information, but also makes up for the defects of the k-
recording 2
Sound means algorithm that the number of clusters needs to be
85.2 83.7 94.9 given in advance and is greatly affected by the selection of
recording 3
Sound initial clustering centers.
86.0 90.2 96.1
recording 4
Sound Data Availability
82.2 85.5 93.6
recording 5
Sound
All of the data used to support the findings of this study are
81.0 82.2 90.2 included within the article.
recording 6
Sound
82.0 81.5 89.8 Conflicts of Interest
recording 7
Sound The authors declare that they have no conflicts of interest.
74.8 77.2 85.5
recording 8
Sound
recording 9
73.8 76.8 86.0 Acknowledgments
Sound This work was supported in part by the Natural Science
73.3 78.0 84.8
recording 10 Foundation of Liaoning Province (2019-ZD-0168 and 2020-
KF-12-11), Major Training Program of Criminal Investi-
gation Police University of China (3242019010), and Key
Research and Development Projects of the Ministry of
of the improved k-means algorithm based on the self-orga- Science and Technology (2017YFC0821005).
nizing neural network is always higher than the other two
algorithms. References
To sum up, compared with the k-means speaker clus-
tering algorithm, the improved algorithm can not only [1] J. Yang, Z. P. Li, and P. Su, “Review of speech segmentation
predict the number of categories but also select the initial and endpoint detection,” Journal of Computer Applications,
clustering center reasonably, so that the clustering results are vol. 40, no. 1, pp. 1–7, 2020.
[2] Q. Fan, “Implementation and Performance Research of Speaker
stable. Compared with the self-organizing neural network
Logging System”, Ph.D. Beijing Normal University, Beijing,
speaker clustering algorithm, the improved algorithm re- China, 2011.
duces the number of iterations of the network, makes [3] D. Z. Yang, J. M. Xu, J. Liu et al., “Reliable mute model and
convergence faster, and greatly improves the clustering speech activity detection in speaker logs,” Journal of Zhejiang
accuracy. Therefore, the improved k-means speaker clus- University (Engineering), vol. 50, no. 1, pp. 151–157, 2016.
tering algorithm based on the self-organizing neural net- [4] I. K. Sethi, “Video classification using speaker identification,”
work is better than the self-organizing neural network SPIE, vol. 3022, pp. 218–225, 1997.
algorithm and k-means algorithm. [5] F. Zheng, L. T. Li, and H. Zhang, “Voiceprint recognition
technology and its application status,” Information Security
Research, vol. 2, no. 1, pp. 44–57, 2016.
4. Conclusion [6] X. K. Li, Y. L. Zheng, N. Yuan et al., “Research on voiceprint
recognition method based on deep learning,” Journal of
The improved speech endpoint detection algorithm pro- Engineering of Heilongjiang University, vol. 9, no. 1, pp. 64–70,
posed in this paper can effectively eliminate the isolated 2018.
noise points and enhance the antinoise performance of the [7] A. Hannun, C. Case, J. Casper et al., “Deep speech: scaling up
algorithm. The threshold value is selected by the local end-to-end speech recognition,” Computer Science, vol. 17,
maximum of the histogram of the statistical feature se- pp. 1–12, 2014.
[8] H. Z. Chen and Z. J. Zhang, “A speech endpoint detection
quence, which improves the accuracy of speech detection. It
method based on energy and frequency band variance,”
enhances the ability of antinoise and meets the requirements Science Technology and Engineering, vol. 19, no. 26, pp. 249–
of speaker segmentation better. Through the comparative 254, 2019.
analysis of the clustering accuracy of 10 recordings, it can be [9] N. Seman, Z. Abu Bakar, and N. Abu Bakar, “An evaluation of
seen that with the increase of the number of speakers in endpoint detection measures for Malay speech recognition of
the audio samples, the clustering accuracy of k-means and an isolated words,” in Proceedings of the 2010 International
Mathematical Problems in Engineering 19

Symposium on Information Technology, vol. 10, pp. 1628–1635,

Kuala Lumpur, Malaysia, June 2010.
[10] S. Morita, M. Unoki, X. Lu, and M. Akagi, “Robust voice
activity detection based on concept of modulation transfer
function in noisy reverberant environments,” Journal of
Signal Processing Systems, vol. 82, no. 2, pp. 163–173, 2016.
[11] Y. Zheng and S. Gao, “Speech endpoint detection based on
fractal dimension with adaptive threshold,” Journal of
Northeastern University (Natural Science), vol. 41, no. 1,
pp. 7–11, 2020.
[12] W. U. Di, H. Zhao, C. Huang et al., “Speech endpoint de-
tection in low-SNRs environment based on perception
spectrogram structure boundary parameter,” Journal of Signal
Processing Systems, vol. 39, no. 4, pp. 392–399, 2014.
[13] M. Eshaghi and M. R. Karami Mollaei, “Voice activity de-
tection based on using wavelet packet,” Digital Signal Pro-
cessing, vol. 20, no. 4, pp. 1102–1115, 2010.
[14] Y. Y. Lu, N. Zhou, K. Xiao et al., “Improved speech endpoint
detection algorithm in strong noise environment,” Journal of
Computer Applications, vol. 34, no. 5, pp. 1386–1390, 2014.
[15] J. T. Liu and N. Jiang, “Research on speech segmentation and
clustering based on mixed features,” Electro-Optic Technology
Application, vol. 34, no. 5, pp. 37–41, 2019.
[16] S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, and
L. Besacier, “Step-by-step and integrated approaches in
broadcast news speaker diarization,” Computer Speech &
Language, vol. 20, no. 2-3, pp. 303–330, 2006.
[17] G. R. Hu and X. D. Wei, “Endpoint detection of noisy speech
based on cepstrum feature,” Journal of Electronics, vol. 28,
no. 10, pp. 95–97, 2000.
[18] L. Li and J. Zhu, “Research on speech endpoint detection
based on wavelet analysis and neural network,” Journal of
Electronic Measurement and Instrument, vol. 27, no. 6,
pp. 528–534, 2013.
[19] P. Delacourt and C. J. Wellekens, “DISTBIC: a speaker-based
segmentation for audio data indexing,” Speech Communica-
tion, vol. 32, no. 1-2, pp. 111–126, 2000.
[20] Z. P. Zhang, L. N. Zhang, and S. He, “Research on continuous
adaptive algorithm based on GMM-UBM speaker model,”
Communication Power Supply Technology, vol. 33, no. 2,
pp. 81–83, 2016.
[21] C. B. Huo, C. J. Zhang, and H. M. Zhao, “Research on speaker
veriﬁcation system based on GMM-UBM,” Journal of
Liaoning University of Technology: Natural Science Edition,
vol. 3, pp. 149–151, 2012.
[22] B. Fergani, M. Davy, and A. Houacine, “Speaker diarization
using one-class support vector machines,” Speech Commu-
nication, vol. 50, no. 5, pp. 355–365, 2008.
[23] W. X. Zhu, “Research on Speaker Segmentation and Clustering
in Multi-Person Conversation Scene”, Ph.D. University of
Science and Technology of China, Hefei, China, 2017.
[24] H. Qiu, “Research on Speaker Clustering Based on GMM and
Hierarchical Clustering”, Peking University, Beijing, China,
2004.
[25] J. L. Ma, X. X. Jing, and H. Y. Yang, “Application of principal
component analysis and K-means clustering in speaker rec-
ognition,” Computer Application, vol. 35, no. s1, pp. 127–129,
2015.