2020-An Improved Speech Segmentation and Clustering Algorithm
2020-An Improved Speech Segmentation and Clustering Algorithm
Research Article
An Improved Speech Segmentation and Clustering Algorithm
Based on SOM and K-Means
1
Criminal Investigation Police University of China, Shenyang 110854, China
2
Liaoning University, Shenyang 110036, China
Copyright © 2020 Nan Jiang and Ting Liu. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
This paper studies the segmentation and clustering of speaker speech. In order to improve the accuracy of speech endpoint
detection, the traditional double-threshold short-time average zero-crossing rate is replaced by a better spectrum centroid feature,
and the local maxima of the statistical feature sequence histogram are used to select the threshold, and a new speech endpoint
detection algorithm is proposed. Compared with the traditional double-threshold algorithm, it effectively improves the detection
accuracy and antinoise in low SNR. The k-means algorithm of conventional clustering needs to give the number of clusters in
advance and is greatly affected by the choice of initial cluster centers. At the same time, the self-organizing neural network
algorithm converges slowly and cannot provide accurate clustering information. An improved k-means speaker clustering
algorithm based on self-organizing neural network is proposed. The number of clusters is predicted by the winning situation of the
competitive neurons in the trained network, and the weights of the neurons are used as the initial cluster centers of the k-means
algorithm. The experimental results of multiperson mixed speech segmentation show that the proposed algorithm can effectively
improve the accuracy of speech clustering and make up for the shortcomings of the k-means algorithm and self-organizing neural
network algorithm.
speech is uncertain; there are a variety of noise in speech. transform point detection and speech segmentation. The
How to solve these problems effectively and improve the transform point detection is the key step of the segmentation
robustness of the segmentation clustering system has be- module. The commonly used speaker speech segmentation
come an important research direction, which is also the methods are silence-based methods, metric-based methods,
main research content of this paper. and model-based methods.
In the work of speech recognition, the result of speech References [8, 15] proposed improved endpoint detec-
endpoint detection greatly affects the accuracy and rate of tion algorithms based on the combination of the energy and
speech recognition segmentation [8]. Accurate endpoint frequency band variance method and hybrid feature, re-
detection can reduce a lot of computation for the feature spectively, in 2019. Reference [11] studied the speech end-
extraction in the follow-up speech recognition and also point detection method based on the fractal dimension
make the acoustic model more accurate, so as to improve the method of adaptive threshold in 2020. In reference [17],
accuracy of speech segmentation and recognition. Accurate cepstrum feature is used for endpoint detection, and
endpoint detection of speech signal in complex background cepstrum distance instead of short-time energy is used as
is a very important research branch in the field of speech threshold judgment, while speech detection based on the
recognition [9]. hidden Markov model is improved to adapt to noise
The so-called endpoint detection is to locate the speech changes. Reference [18] proposed a strong noise immunity
segment in a section of original sound data and find the VAD algorithm based on the wavelet analysis and neural
start- and endpoints of the speech segment [10, 11]. In order network. The advantage of the algorithm based on silence is
to eliminate the influence of channel and background noise, that the operation is relatively simple, and the effect is better
accurately determine the start- and endpoints of the sound when the background noise is not complex, but its limita-
segment, eliminate the silent segment in the speech signal, tions are exposed in the complex background, so some more
and make the energy of the whole speech signal concentrate effective algorithms have been proposed.
on the sound segment, instead of being disturbed by Document [19] studies the speaker transformation point
background noise and silent segments, it can effectively detection with variable window length and realizes the
improve the accuracy of speech segmentation and recog- online detection of transformation points, but its calculation
nition. The performance, robustness, and processing time of is relatively large. Delacourt and Wellekens proposed a two-
a speech recognition system can be greatly improved by step speech segmentation algorithm, which first uses a fixed
accurate and efficient endpoint detection. The traditional window to segment the speech initially and then merges the
endpoint detection methods are mainly based on the segmented speech segments. For different databases, this
characteristics of speech such as short-time energy, zero- method has achieved good segmentation results. The ad-
crossing rate, etc. [12], but these characteristics are limited to vantage of speech segmentation based on distance scale is
the situation of no noise or high signal-to-noise ratio, and that it does not need any prior knowledge of speech, and the
will lose the effect when the signal-to-noise ratio is low computational cost is low; the disadvantage is that it needs to
[13, 14]. set threshold according to experience, so the robustness and
According to the different ways of combination between stability are poor, and it is easy to detect many redundant
segmentation and clustering, the current mainstream speech segmentation points.
segmentation and clustering can be divided into two cate- The model method is to train the models of different
gories [15]: one is asynchronous strategy, that is, first seg- speakers from the corpus and then use the trained model to
mentation and then clustering; in this strategy, segmentation classify the speech frame by frame, so as to detect the change
and clustering are implemented step by step; the other is the points of speakers. Commonly used methods are as follows:
synchronization strategy, that is, to complete speaker universal background model (UBM) [20, 21], support vector
clustering while segmenting the speech of different speakers. machine (SVM) [22], and deep neural networks, (DNNs)
ELISA proposed a typical speaker classification system in the [23]. The advantage of the model-based segmentation
literature [16], which combines two typical methods: one is method is that it has higher accuracy than the distance-based
based on asynchronous strategy, represented by the CLIPS method, but the disadvantage is that it requires prior
system, which first automatically cuts audio into many small knowledge, and the calculation cost is very high.
segments, so that each segment contains only one speaker, In the literature [24], the Gaussian mixture model is used
and then merges the same speakers through clustering; the in class modeling, which achieves high clustering purity.
other method is based on synchronous strategy, using the Document [25] studies the speaker clustering method based
hidden Markov model (HMM) to achieve speaker clustering on the k-means algorithm, but the clustering results are
while segmentation. The LIA system is the representative of greatly affected by the choice of initial cluster centers; if the
this kind of method. These two kinds of systems have their choice is not appropriate, it may fall into local optimal
own advantages and disadvantages. The former is relatively solution, and the number of clusters K value needs to be
simple, but the errors after each clustering may accumulate. given in advance.
The latter can correct the errors after each clustering, but it To sum up, in order to improve the accuracy of speech
costs a lot of computing time, and cannot get enough endpoint detection, this paper proposes a new speech
training model. endpoint detection algorithm, which replaces the traditional
Speech segmentation is an important part of asyn- double-threshold short-time average zero-crossing rate with
chronous segmentation clustering, which includes speaker a better spectral centroid feature, smoothes the feature curve
Mathematical Problems in Engineering 3
by median filter, and selects the threshold value by counting exceed the high threshold, it is confirmed that they
the local maxima of the feature sequence histogram. enter the real speech segment; otherwise, the current
Compared with the traditional double-threshold algorithm, state is restored to the silent state.
the proposed speech endpoint detection algorithm still has (3) The endpoint of the speech segment can be detected
higher detection accuracy and noise immunity in low SNR. reversely according to the above method. To sum up,
The k-means algorithm has the advantages of conve- the flowchart of double-threshold endpoint detec-
nient, fast calculation, and accurate results, but it needs to tion is shown in Figure 1.
give the number of clusters in advance, and the results are
greatly affected by the choice of the initial cluster center, so it
is easy to fall into local optimum. The self-organizing neural 2.2. Defects of Conventional Double-Threshold Method for
network (SOM) has the advantages of strong explanatory, Endpoint Detection. The ability to resist noise is weak. Noise
strong learning ability, and visualization, but the conver- environment is the main factor affecting the detection re-
gence speed is slow, and it cannot provide accurate clus- sults, and different SNR and different noise will affect the
tering information, clustering accuracy for nonlarge volume accuracy of detection. Some noises contain rich high fre-
of samples is poor. In order to seek better clustering means, quency components, and correspondingly the zero-crossing
the self-organizing neural network is introduced into rate is relatively high. If the noise is too large, it will lead to a
speaker clustering, and an improved k-means speaker higher zero-crossing rate than vowels and initials in the
clustering algorithm based on self-organizing neural net- noise of some silent segments. In the low SNR environment,
work is designed. The network is used to predict the number the detection results are extremely unstable.
of clusters and the initial cluster centers of the k-means The threshold value is usually set by experience. It is
algorithm. The number of clusters is predicted by the extremely imprecise to use a fixed threshold to detect dif-
winning situation of the neurons in the competitive layer of ferent speakers or different situations of speech.
the trained network. The weight of neurons is used as the Both the short-time energy and the short-time average
initial cluster centers of the k-means algorithm to cluster zero-crossing rate are extracted in the time domain, so the
speakers. The experimental results of multispeaker mixed calculation process is simple, and the actual characteristics of
speech segmentation show that the improved clustering speech are not fully expressed.
algorithm can not only make up for the shortcomings of the The double-threshold method is generally used in speech
two algorithms but also improve the clustering accuracy. recognition, which can only detect the beginning of a speech
but cannot detect the internal pause of the speech.
2. Speaker Speech Segmentation Based on Endpoint detection is used for speech segmentation, and
Improved Double-Threshold the time domain of the corpus is larger than the short-time
domain in speech recognition, so it is necessary to detect all
Endpoint Detection the segmentation points in a long audio. Obviously, the
2.1. Endpoint Detection Principle of Traditional Double- traditional method cannot meet the requirements.
Threshold Method. The double-threshold endpoint detection
method combines the short-time energy and the short-time
2.3. Improved Design of Double-Threshold Endpoint Detection
average zero-crossing rate. Before the start of endpoint
Algorithm. In view of the defects of the traditional double-
detection, two thresholds are set, respectively, for the short-
threshold method endpoint detection algorithm, the fol-
time energy and the short-time average zero-crossing rate,
lowing three aspects are carried out to improve the detection
and the thresholds are set empirically. The first is a low
method:
threshold, small value, more sensitive to signal changes,
and more easily exceeded; the second is the high threshold, (i) In view of the limitation of the short-time average
the value is large, and the threshold must reach a certain zero-crossing rate feature, the spectral centroid
signal strength can be exceeded. Exceeding the low feature is used to replace it. The spectrum centroid is
threshold does not mean the beginning of speech, which combined with short-time energy to detect
may be caused by short-term noise, and only exceeding the (ii) In order to improve the antinoise performance of
high threshold can basically determine the beginning of the double-threshold method, the curves of the two
speech signal. features are smoothed by the median
The whole speech signal can be divided into several
(iii) In order to solve the problem of poor accuracy
segments: silence segment, transition segment, voice seg-
caused by the threshold selection based on expe-
ment, and end segment. The basic steps of endpoint de-
rience, an algorithm is proposed to select the
tection are as follows:
threshold reasonably by analyzing the whole feature
(1) In the silence segment, if one of the features of short- sequence
time energy or zero-crossing rate exceeds the low
threshold, it will be marked as the beginning of the
detection speech and enter the transition segment. 2.3.1. Spectral Centroid Characteristics. Spectral centroid is
(2) In the transition stage, if the energy or zero-crossing a parameter describing the property of timbre. Different
rate characteristics of consecutive frames of speech from short-time energy and short-time average zero-
4 Mathematical Problems in Engineering
Frequency
i ≤ step, if it appears 1000
mean(f(1: i)) < f(i) && mean(f(i + 1: i + step)) < f(i).
(2)
500
800
(1) n � 0; then,
600
N C
T � k�1 k , (4)
400
4N
200
where Ck is the k-th value of the feature sequence,
this expression means that if the local maximum is
0
not detected from the beginning to the end, the 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
threshold value is replaced by (1/4) of the average Class width
value of the feature sequence, but this case is not Figure 3: Spectral centroid characteristic sequence histogram.
common.
(2) n � 1; then,
T � M, (5) the audio signal are higher than the threshold value, it is
judged that the frame is a speech signal.
where M is the only detected local maximum, and
this is not often the case. Usually, more than two
local maxima are detected. 2.3.4. Speaker Speech Segmentation Based on Improved
(3) n � 2; then, Double-Threshold Method. The improved detection process
is as follows:
W · M1 + M2
T� . (6) (i) The speech signal is collected, and the time domain
W+1
waveform is obtained.
Arrange all the detected maximum values in (ii) The speech is divided into frames and windows, and
descending order of frequency. In equation (3) above, the short-time Fourier transform is performed to
M1 and M2 are the first two maximum values. W is a obtain the spectrogram of the signal.
user-defined parameter, and the higher the W, the (iii) The short-time energy feature En is extracted in the
closer the threshold value is to the first maximum time domain and the spectral centroid feature Cn
value M1 . are extracted in the frequency domain.
The thresholds of short-time energy and spectral centroid (iv) The short-time energy feature and the spectrum
characteristics, denoted as T1 and T2 , respectively, are centroid feature are smoothed by median filtering
calculated by this method. When two features in a frame of twice.
6 Mathematical Problems in Engineering
(v) The histograms of the above two feature se- Original speech signal
quences are calculated, respectively, and the local
maxima of the histograms are counted, and the
threshold values of the two features are calcu- Framing,windowing,
lated. The threshold value of short-time energy and FFT
feature is T1 , and that of spectral centroid feature
is T2 . Short-time energy En, spectral centroid Cn,
and twice smoothing filter processing
(vi) If the short-time energy feature of a frame is greater
than T1 and the spectrum centroid feature of the
frame is greater than T2 , the frame is marked as a
Using local maximum method to calculate
speech frame; otherwise, it is marked as a non- two characteristic thresholds: T1, T2 and set n = 1
speech frame.
(vii) Postprocessing stage (use according to the situa-
tion): extend the two ends of each voice segment by No
2 windows, and finally merge the continuous En > T1 && Cn > T2
segments as the final voice segment.
Yes
The speaker speech segmentation algorithm based on the
improved double-threshold method is shown in Figure 4. Speech frame
Nonspeech frame
Among them, the postprocessing stage is mainly to
take into account the extremely short pauses that some-
n=n+1
times occur in speech, eliminating these pauses and
merging the speech can reduce the voice segments and
Yes
reduce the complexity of the results. However, in a few n≤N
cases, these short pauses may also be the change point of No
the speaker, which will lead to wrong merging and affect
the next stage of speech clustering. Therefore, the post- Marking segment point
processing method is used when the audio contains only
one person’s voice, but not when there is a multiperson Postprocessing stage
conversation.
Figure 4: Speaker segmentation algorithm flowchart based on
improved double-threshold method VAD.
2.4. Comparative Experimental Analysis. The experiment of
endpoint detection of speech signal is carried out by using
Matlab software, and the data are recorded by the Newsmy
recorder. The experiment sample is a 1.5 s speech, and the 0.6
content is the Chinese pronunciation of Ni Hao. The output
is a standard Windows WAV audio file, and the file name is 0.4
Hello. wav, sampling frequency is FS � 8 kHz and mono-
phonic, using 16 bit encoding. For the original speech, we 0.2
Amplitude
0.6
2.4.2. Analysis of Endpoint Detection Based on Improved
0.4 Double-Threshold Method. First, the spectral centroid of
each frame is calculated, and the spectral centroid feature is
0.2 extracted; then, the short-time energy and spectral centroids
are smoothed by median filtering twice, and the threshold
0 values of the two features are calculated simultaneously. The
0 0.2 0.4 0.6 0.8 1 1.2 endpoint detection results of the improved double-threshold
Time (s) method are shown in Figure 9:
Figure 6: Short-time energy of original speech signal.
Figures 9(a) and 9(b) show short-time energy and
spectral centroid feature images, respectively, and the
solid line part is the original feature curve, and the dashed
100 line part is the feature curve after two times of smooth
filtering. The ordinate corresponding to the black thick
95
bar in the figure is the characteristic threshold value se-
90 lected after calculation. If the feature curve exceeds the
bar, the feature exceeds the threshold, and only when both
Amplitude
the traditional double-threshold endpoint detection algo- For the speech with different noise levels, the endpoint
rithm, the location of hello speech in the time domain detection accuracy is calculated, and the accuracy results are
waveform is detected, and the detection result is shown in shown in Figure 10.
Figure 8. From the accuracy of Figure 10, we can see that both
In Figure 8, the beginning of the speech is marked with a detection algorithms can accurately detect the endpoints of
solid line and the end is marked with a dashed line. It can be speech in the case of silence or very small noise. When
seen from the picture that the speech starts at about 0.38 s different levels of noise are imposed on the audio files, with
8 Mathematical Problems in Engineering
0.15
Amplitude
0.1
0.05
0 10 20 30 40 50
Time (s)
Original
Filter
(a)
0.5
Amplitude
0
0 10 20 30 40 50
Time (s)
Original
Filter
(b)
Amplitude
0.1
0
–0.1
–0.2
0 0.2 0.4 0.6 0.8 1
Time (s)
(c)
Figure 9: Results of the improved double-threshold method VAD: (a) short-time energy, (b) spectral centroid, and (c) results of improved
double-threshold method VAD.
Error analysis
Expected output
(a) (b)
Input layer
3.1. Self-Organization Neural Network. The self-organizing
feature map (SOM) neural network is based on the phe-
nomenon of lateral inhibition in the biological neural sys-
tem. The basic idea is that for a specific input pattern, each
X1 X2 X3
neuron competes for the opportunity to respond, and ul-
timately only one neuron wins, and the winning neuron Figure 13: Two-dimensional SOM network model.
represents the classification of the input pattern. Therefore,
the self-organizing neural network can be easily associated
with clustering. layer. The number of neurons in the input layer is generally
The structure of self-organizing neural network is the number of samples.
generally a two-layer network: input layer + competition Competition layer: simulate the responding cerebral
layer, and there is no hidden layer, sometimes there are cortex, which is responsible for comparative analysis of
lateral connections between neurons in the competitive input, looking for rules and classification. The output of
layer. A typical self-organizing neural network structure is competition layer represents the classification of the pattern,
shown in Figure 12. and the number of neurons is usually the number of
Input layer: simulate the retina which perceives external categories.
information, receives information, plays the role of obser- Another structure is the two-dimensional form, which
vation, and transmits the input mode to the competition has a more cortical image, as shown in Figure 13.
10 Mathematical Problems in Engineering
Each neuron in the competition layer links laterally with (2) Find the winning neuron
its nearby neurons in a certain way, forming a plane, similar Comparing X with the weights Wj of all neurons in
to a checkerboard. In this kind of structure, the neurons in the competition layer, the most similar neuron is the
the competition layer are arranged as a two-dimensional winning neuron, and its weight is W j∗ .
node matrix, and the neurons in the input layer and output
layer are connected with each other according to the weights.
�� � � �
��X j∗ ���� � min ����X
� − W j ����.
−W (9)
3.2. Competitive Learning Rule. The self-organizing neural j∈{1,2,...,m}
� X ,
X 1, j � j∗ ,
‖X‖ yj (t + 1) � (11)
(8) 0, j ≠ j∗ .
W
j � �� j ��.
W ��W �� Only the winning neuron has the right to adjust the
� j�
weight vector as follows:
⎪
⎧
⎨ Wj∗ (t + 1) � W j∗ (t) + ΔWj∗ � W
j∗ (t) + η(t)X
−W
j∗ , j � j∗ ,
⎪ (12)
⎩ W (t + 1) � W j (t), j ≠ j∗ .
j
Among them, 0 < η(t) < 1 is the learning rate, which neurons is from near to far, from excitement to inhibition, so
generally decreases with time; that is, the degree of ad- the nearby neurons also need to adjust their weights to
justment becomes smaller and smaller, and gradually tends varying degrees under its influence. Take the winning
to cluster centers renormalization. neuron as the center, set a neighborhood radius R, and this
After adjustment, the weight vector is no longer a unit range is called the dominant neighborhood. In the algo-
vector, so it needs to be normalized again, and the network rithm, the neurons in the winning neighborhood adjust their
needs to be retrained until the learning rate η(t) attenuates weights according to the distance from the winning neuron.
to zero, and the algorithm ends At the beginning, the radius of the winning neighborhood is
In the testing phase, the inner product of the given object set to be very large, and as the number of training increases,
and the weights of each neuron is calculated, and the most the size shrinks until it is zero, as shown in Figure 14.
similar neuron is assigned to which class.
The Kohonen algorithm is usually used for the two-
dimensional self-organizing neural network structure. This 3.3. Design of Improved k-Means Speaker Clustering Algorithm
algorithm is an improvement of the above competitive Based on Self-Organizing Neural Network. The operation of
learning rules. The main difference between Kohonen al- the self-organizing neural network is divided into two stages:
gorithm and competitive learning rule is that the way of training and testing. In the training stage, the input training set
lateral inhibition of neuron weight adjustment is different. In samples, for a specific input, the competition layer will have a
the competitive learning rule, only the winning neuron has neuron to produce the largest response to win. The neural
the right to adjust the weight. In Kohonen algorithm, the network adjusts the weights by training samples in a self-or-
influence of the winning neuron on the surrounding ganizing way and finally makes some neurons in the
Mathematical Problems in Engineering 11
4
Pj > meanP1 , P2 , . . . , P9 . (17) If |J(n) − J(n − 1)| < ξ is satisfied, or the number of
3
iterations t � T, the algorithm ends and goes to step
(6). Otherwise, go to step (b).
Then, the number of categories k � k + 1 and
j � j + 1. (6) Algorithm output
Continue to judge according to formula (17), and the Output cluster partition C � C1 , C2 , . . . , Cko , and
final number of categories is obtained as follows: the algorithm ends.
k � k0.The idea here is that the more times a neuron To sum up, the flow of the improved k-means
wins, the closer it is to the actual clustering center. speaker clustering algorithm based on the self-or-
Neurons with fewer wins (less than the average ganizing neural network is shown in Figure 15.
number of wins) are considered to be far away from
the cluster center and ignored.
(4) Initial cluster center prediction 3.4. Experimental Analysis. The experiment sample is several
Retraining the self-organizing neural network: at this minute long multiperson dialogue audio, uses Newsmy re-
time, k0 neurons are set in the competition layer, and cording pen to record, simulates the multiperson meeting the
other things remain unchanged. When the network situation. The output is a standard Windows WAV audio file,
training is finished, the weight value Wl(l � 1, 2, . . ., and sampling frequency is fs � 8 kHz, monophonic, using 16 bit
k0) of each neuron is obtained, which is used as the encoding. Recording audio requires crosstalk, and in order to
initial clustering center in the k-means method. ensure the purity of voice and improve clustering accuracy,
(5) K-means speaker cluster speak clearly and do not send out cough and other noise.
The experiment process is shown in Figure 16. The k-
(a) The input of the algorithm is as follows: MFCC means speaker clustering algorithm, the self-organizing
feature set Xi (i � 1, 2, . . . , n) of speech segment, neural network speaker clustering algorithm, and the im-
class number K0, and initial clustering center μj : proved k-means speaker clustering algorithm based on the
self-organizing neural network are used to cluster the speech
μj � Wj , j � 1, 2, . . . , ko . (18) segments, and the effectiveness of the improved algorithm is
verified by comparative analysis.
(b) Class partition C is initialized to Select an audio file named Recording 1.wav, which lasts
for 3 minutes and contains the voices of two men and one
Cj � φ, j � 1, 2, . . . , ko . (19) woman.
Extract the time domain waveform of the Recording 1.
wav audio file as shown in Figure 17.
(c) The distance between each sample Xi and each cluster Firstly, the speech signal is preprocessed, including pre-
center μj is calculated as emphasis, and subframe and window processing. Frame
�� ��2 length wlen � 200, frame shift inc � 100, and window
dij � ���Xi − μj ��� . (20) function is Hanning window. The duration of the audio
2
sequence is 180 s, and the total number of sampling points is
1419856 at the sampling rate of FS � 8 khz. The sequence is
For Xi , it is assigned to the class λi corresponding to the
divided into 14197 frames, and the corresponding time of
smallest dij , and the class division is updated:
each frame is 25 ms. Through the time domain waveform, we
can see that the audio has a number of voice segments, and
Cλi � Cλi ∪ Xi . (21) there is a short gap between the voice segments.
The short-time energy and spectral centroid character-
(d) For j � 1, 2, . . . , ko , recalculate the cluster centers for istics of each frame of speech are calculated from beginning
all sample points in Cj : to end. The audio is segmented based on the improved
Mathematical Problems in Engineering 13
No
t > 100
Yes
Number of categories K = Kn
Number of categories K = Kn
K-means speaker
clustering
Improved k-means
speaker clustering
based on SOM
double-threshold endpoint detection method, and the seg- In the clustering experiment, MFCC (mel-frequency
mented speech waveform is shown in Figure 18. cepstrum coefficient) is used as the basis to distinguish
As can be seen from Figure 18, the audio is divided into a different speakers.
number of speech segments. In the picture, the speech The average of MFCC vectors of all frames in the speech
segments are shown in dark colors, and the silence between is used to represent the MFCC feature of the whole speech;
the speech segments is shown in light gray. After speech that is, the MFCC feature vector is obtained by calculating
segmentation, a total of 96 short-time speech segments are the average of the feature matrix according to the column.
obtained, and each of which contains only one person’s For these 96 speech segments, the MFCC feature vectors
speech. of each speech segment are extracted, respectively, and the
14 Mathematical Problems in Engineering
1 2 3 4 5 6 7 8 9 10 11 12
1 –0.2318 –0.6216 –0.8246 0.8197 –1 0.0115 0.0276 –0.7478 –0.3292 –1 –0.0117 –0.3540
2 –0.4504 –0.4722 –0.2641 0.6031 –0.4363 0.0251 –0.0120 –0.3220 –0.2556 –0.3009 –0.0195 –0.5682
3 0.0611 0.4078 0.4520 0.9593 –0.1953 0.2274 0.2702 0.4878 –0.1718 0.5584 –0.1076 –0.1543
4 0.0127 0.6880 –0.1491 –0.3808 –0.0066 0.0899 0.2659 0.5310 –0.4424 0.2297 –0.4673 –0.0847
5 0.1407 0.7320 0.4009 0.6745 0.1440 0.1724 0.0330 0.4351 –0.2905 0.5446 0.2133 –0.2824
6 –0.7343 –0.4021 –0.4176 –0.1656 –0.5628 –0.1922 –0/0962 –0.5104 –0.5630 –0.3270 –0.7263 –0.5573
7 –0.3879 0.7239 –0.1195 –0.3951 –0.3567 0.3155 –0.0020 0.6301 –0.0082 0.4713 –0.0835 0.0816
8 –0.9601 –0.7837 –0.3593 0.0225 –0.2102 0.0413 –0.1888 –0.6212 –0.4318 –0.3511 –0.1205 –0.2959
9 –0.8367 –1 –0.0377 0.4291 –0.6289 –0.1104 –0.4750 –0.9491 –0.3937 0.0958 –0.5412 –0.2736
10 –0.0155 0.2727 0.2820 0.6068 –0.1523 0.4739 0.3752 0.2055 –0.5360 0.1430 0.0281 –0.2739
11 –0.5196 –0.6370 –0.3923 0.1216 –0.5501 –0.0138 –0.2554 –0.5492 –0.3250 –0.4713 0.0153 –0.0793
12 0.2645 0.5404 –0.5999 0.3577 0.3054 –0.3061 0.2822 0.8176 –0.2828 0.6644 –0.4664 –1
13 0.1543 0.6268 –0.0959 –0.0056 –0.0863 –0.1137 0.3584 0.8135 –0.7636 0.2058 –3.0007 –0.5701
14 –0.3113 0.7890 0.1734 –0.0671 –0.6751 –0.1598 0.6133 0.5194 –0.4182 1 –0.3457 –0.5522
15 0.1356 0.7456 –0.1363 –0.1853 0.4774 –0.2183 0.0697 0.8824 –0.9199 0.4200 –0.8365 –0.3673
88 –0.5569 0.2714 –0.8230 0.2283 –0.0786 –0.7724 –0.2758 –0.2648 –0.6207 –0.3815 –0.2345 0.1627
89 –0.2697 –0.2357 –0.7891 –0.0462 0.3934 –0.3903 –0.1434 –0.5120 –0.0433 –0.8132 0.0258 –0.0983
90 0.0043 –0.0438 –0.8671 0.1035 –0.2169 –0.9042 –0.2299 –0.4181 –0.5621 –0.4401 –0.3949 0.1063
91 0.9374 –0.3438 –0.9019 0.2494 –0.9633 –1 0.4109 0.0358 –0.2700 –0.9540 –0.6541 –0.5618
92 0.4810 0.5058 0.5543 0.4640 –0.0819 0.2065 0.2168 9.4952e 0.1676 0.1380 0.0483 0.0595
93 0.2873 –0.2742 –0.2113 0.5668 0.0048 –0.2358 –0.2286 –0.0165 –0.2277 –0.0424 0.0560 0.0929
94 –0.2994 0.0670 –0.6855 0.0121 –0.2985 –0.6471 0.0806 –0.2228 –0.5288 –0.1501 –0.1191 0.6426
95 –0.6023 0.0626 –0.4080 –0.2971 –0.0838 –0.7251 –0.1777 –0.3717 –0.0202 –0.6585 –0.3076 0.5214
96 0.0239 –0.4341 –0.7470 0.2458 0.0398 –0.7044 –0.0350 –0.4216 0.0658 –0.4074 –0.4117 –0.0400
Laber 1 2 3 4 5 6 7 8 9 10 Laber 1 2 3 4 5 6 7 8 9 10
Categories a a b c b a c a a b Categories a a b b× b a b× a a b
Laber 11 12 13 14 15 16 17 18 19 20 Laber 11 12 13 14 15 16 17 18 19 20
Categories a c c c c c b b a c Categories a b× b× c c b× b b a b×
Laber 21 22 23 24 25 26 27 28 29 30 Laber 21 22 23 24 25 26 27 28 29 30
Categories c b a a a a b a a a Categories b× b a a a a b a a a
Laber 31 32 33 34 35 36 37 38 39 40 Laber 31 32 33 34 35 36 37 38 39 40
Categories a a b c c c c c c b Categories a a b b× b× c c b× b× b
Laber 41 42 43 44 45 46 47 48 49 50 Laber 41 42 43 44 45 46 47 48 49 50
Categories a a a a b b a b c a Categories a a a c× b b c× b b× c×
Laber 51 52 53 54 55 56 57 58 59 60 Laber 51 52 53 54 55 56 57 58 59 60
Categories a b c a a c c a a a Categories c× b b× c× a b× b× a c× c×
Laber 61 62 63 64 65 66 67 68 69 70 Laber 61 62 63 64 65 66 67 68 69 70
Categories c c c b b b a a a a Categories b× b× b× b b b c× c× c× c×
Laber 71 72 73 74 75 76 77 78 79 80 Laber 71 72 73 74 75 76 77 78 79 80
Categories c c c c c a a a c c Categories b× b× b× b× b× a ca a× b× b×
Laber 81 82 83 84 85 86 87 88 89 90 Laber 81 82 83 84 85 86 87 88 89 90
Categories a c c b c c a a a a Categories a b× b× b b× b× c× c× a c×
Laber 91 92 93 94 95 96 Laber 91 92 93 94 95 96
Categories a b a a a a Categories a b a c× c× a
16 Mathematical Problems in Engineering
1 2 3 4 5 6 7 8 9 10 11 12
1 –0.4599 –0.2687 –0.4762 0.0375 –0.3098 –0.4637 –0.3003 –0.4836 –0.4448 –0.4555 –0.1798 0.0254
2 0.1069 0.5795 0.2901 0.4029 –0.0095 0.2195 0.1038 0.2853 –0.0727 0.2792 0.1259 –0.2594
3 0.3370 0.6511 –0.3395 –0.1476 0.1696 –0.2267 0.4405 0.5404 –0.4869 0.4999 –0.5759 –0.1712
In Figure 20, the three rows of the matrix correspond to implementation of the k-means speaker clustering algorithm
the values of W1 , W2 , W3 . The weight value W1 , W2 , W3 is and the end of the experiment are shown (Table 4).
saved as the initial clustering center of the k-means algorithm. To sum up, for Recording 1. wav, the improved k-means
Finally, the initial clustering center of the k-means al- speaker clustering algorithm based on self-organizing neural
gorithm is set as follows: μj � Wj (j � 1, 2, 3) The network has achieved good clustering results. It effectively
Mathematical Problems in Engineering 17
0.6
0.4
0.2
Amplitude
0
–0.2
–0.4
–0.6
–0.8
0 10 20 30 40 50 60
Time (s)
ZhangSan
LiSi
WangWu
Figure 21: Diagram of SOM + k-means speaker clustering results. For clarity, a partial enlargement is shown in Figure 22.
0.6
0.4
Amplitude
0.2
0
–0.2
–0.4
28 30 32 34 36 38
Time (s)
ZhangSan
LiSi
WangWu
Figure 22: Diagram of SOM + k-means speaker clustering results (local magnification).
makes up for the shortcomings of the self-organizing neural Recording 6: contains three voices, three men
network algorithm and k-means algorithm. Recording 7: contains three voices, three women
The clustering effect is shown in Figure 21, which dis-
Recording 8: contains four voices, two men and two
tinguishes different speakers by different colors, and the
women
image is intuitive. Among them, Zhang San’s voice is red, Li
Si’s voice is blue, Wang Wu’s voice is green, and the mute Recording 9: contains four voices, two men and two
segment is gray. women
Recording 10: contains four voices, three men and one
woman
3.4.4. Comparative Analysis. Nine audio files are selected to Based on the improved double-threshold endpoint
verify and analyze the above experimental results, added detection method in this paper, three algorithms are
audio file Recording 1. wav, a total of 10 recordings. The used to perform speaker clustering experiments. The
contents of each audio file are as follows: accuracy of each algorithm is shown in Table 5.
Recording 1: contains three voices, two men and one It can be seen from the table that the clustering ac-
woman curacy of self-organizing neural network algorithm is low,
Recording 2: contains two voices, two men but the average accuracy of k-means algorithm is often
Recording 3: contains two voices, two women lower than that of self-organizing neural network algorithm
because of its instability. With the increase of the number of
Recording 4: contains two voices, one male and one
speakers in the audio samples, or the decrease of gender
female
differences, the clustering accuracy has a downward trend.
Recording 5: contains three voices, one male and two However, in the same audio samples, the clustering accuracy
female
18 Mathematical Problems in Engineering
Table 5: Comparison of speaker clustering results based on three self-organizing neural network algorithms both decrease
algorithms. to 80%. However, the clustering accuracy of the improved
Improved k- k-means algorithm based on the self-organizing neural
K-means SOM speaker network is still maintained at 85%–89%. The improved k-
means speaker
speaker clustering means speaker clustering algorithm based on the self-
clustering based
clustering (%) (%)
on SOM (%) organizing neural network improves the clustering ac-
Sound curacy, which not only makes up for the defects of the self-
84.5 88.5 94.8 organizing neural network algorithm that the conver-
recording 1
Sound gence is slow and cannot provide accurate clustering
84.3 89.2 95.1 information, but also makes up for the defects of the k-
recording 2
Sound means algorithm that the number of clusters needs to be
85.2 83.7 94.9 given in advance and is greatly affected by the selection of
recording 3
Sound initial clustering centers.
86.0 90.2 96.1
recording 4
Sound Data Availability
82.2 85.5 93.6
recording 5
Sound
All of the data used to support the findings of this study are
81.0 82.2 90.2 included within the article.
recording 6
Sound
82.0 81.5 89.8 Conflicts of Interest
recording 7
Sound The authors declare that they have no conflicts of interest.
74.8 77.2 85.5
recording 8
Sound
recording 9
73.8 76.8 86.0 Acknowledgments
Sound This work was supported in part by the Natural Science
73.3 78.0 84.8
recording 10 Foundation of Liaoning Province (2019-ZD-0168 and 2020-
KF-12-11), Major Training Program of Criminal Investi-
gation Police University of China (3242019010), and Key
Research and Development Projects of the Ministry of
of the improved k-means algorithm based on the self-orga- Science and Technology (2017YFC0821005).
nizing neural network is always higher than the other two
algorithms. References
To sum up, compared with the k-means speaker clus-
tering algorithm, the improved algorithm can not only [1] J. Yang, Z. P. Li, and P. Su, “Review of speech segmentation
predict the number of categories but also select the initial and endpoint detection,” Journal of Computer Applications,
clustering center reasonably, so that the clustering results are vol. 40, no. 1, pp. 1–7, 2020.
[2] Q. Fan, “Implementation and Performance Research of Speaker
stable. Compared with the self-organizing neural network
Logging System”, Ph.D. Beijing Normal University, Beijing,
speaker clustering algorithm, the improved algorithm re- China, 2011.
duces the number of iterations of the network, makes [3] D. Z. Yang, J. M. Xu, J. Liu et al., “Reliable mute model and
convergence faster, and greatly improves the clustering speech activity detection in speaker logs,” Journal of Zhejiang
accuracy. Therefore, the improved k-means speaker clus- University (Engineering), vol. 50, no. 1, pp. 151–157, 2016.
tering algorithm based on the self-organizing neural net- [4] I. K. Sethi, “Video classification using speaker identification,”
work is better than the self-organizing neural network SPIE, vol. 3022, pp. 218–225, 1997.
algorithm and k-means algorithm. [5] F. Zheng, L. T. Li, and H. Zhang, “Voiceprint recognition
technology and its application status,” Information Security
Research, vol. 2, no. 1, pp. 44–57, 2016.
4. Conclusion [6] X. K. Li, Y. L. Zheng, N. Yuan et al., “Research on voiceprint
recognition method based on deep learning,” Journal of
The improved speech endpoint detection algorithm pro- Engineering of Heilongjiang University, vol. 9, no. 1, pp. 64–70,
posed in this paper can effectively eliminate the isolated 2018.
noise points and enhance the antinoise performance of the [7] A. Hannun, C. Case, J. Casper et al., “Deep speech: scaling up
algorithm. The threshold value is selected by the local end-to-end speech recognition,” Computer Science, vol. 17,
maximum of the histogram of the statistical feature se- pp. 1–12, 2014.
[8] H. Z. Chen and Z. J. Zhang, “A speech endpoint detection
quence, which improves the accuracy of speech detection. It
method based on energy and frequency band variance,”
enhances the ability of antinoise and meets the requirements Science Technology and Engineering, vol. 19, no. 26, pp. 249–
of speaker segmentation better. Through the comparative 254, 2019.
analysis of the clustering accuracy of 10 recordings, it can be [9] N. Seman, Z. Abu Bakar, and N. Abu Bakar, “An evaluation of
seen that with the increase of the number of speakers in endpoint detection measures for Malay speech recognition of
the audio samples, the clustering accuracy of k-means and an isolated words,” in Proceedings of the 2010 International
Mathematical Problems in Engineering 19