Abstract
The objective of this paper is to develop an unsupervised method for segmentation of speech signals into phoneme-like units. The proposed algorithm is based on the observation that the feature vectors from the same segment exhibit higher degree of similarity than the feature vectors across the segments. The kernel-Gram matrix of an utterance is formed by computing the similarity between every pair of feature vectors in the Gaussian kernel space. The kernel-Gram matrix consists of square patches, along with the principle diagonal, corresponding to different phoneme-like segments in the speech signal. It detects the number of segments, as well as their boundaries automatically. The proposed approach does not assume any information about input utterances like exact distribution of segment length or correct number of segments in an utterance. The proposed method out-performs the state-of-the-art blind segmentation algorithms on Zero Resource 2015 databases and TIMIT database.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9(5–6), 453–467 (1990)
Furui, S.: Digital Speech Processing: Synthesis, and Recognition. CRC Press, Boca Raton (2000)
Wang, A., et al.: An industrial strength audio search algorithm. In: ISMIR, vol. 2003, pp. 7–13, Washington, D.C. (2003)
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Gales, M.J., Young, S.J.: Robust continuous speech recognition using parallel model combination. IEEE Trans. Speech Audio Process. 4(5), 352–359 (1996)
Brugnara, F., Falavigna, D., Omologo, M.: Automatic segmentation and labeling of speech based on hidden Markov models. Speech Commun. 12(4), 357–370 (1993)
Demuynck, K., Laureys, T.: A comparison of different approaches to automatic speech segmentation. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2002. LNCS (LNAI), vol. 2448, pp. 277–284. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-46154-X_38
Scharenborg, O., Ernestus, M., Wan, V.: Segmentation of speech: child’s play? (2007)
Rybach, D., Gollan, C., Schluter, R., Ney, H.: Audio segmentation for speech recognition using segment features. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2009, pp. 4197–4200. IEEE (2009)
Davy, M., Godsill, S.: Detection of abrupt spectral changes using support vector machines an application to audio signal segmentation. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 1313–1316. IEEE (2002)
Dusan, S., Rabiner, L.: On the relation between maximum spectral transition positions and phone boundaries. In: Ninth International Conference on Spoken Language Processing (2006)
Aversano, G., Esposito, A., Marinaro, M.: A new text-independent method for phoneme segmentation. In: Proceedings of the 44th IEEE 2001 Midwest Symposium on Circuits and Systems, MWSCAS 2001, vol. 2, pp. 516–519. IEEE (2001)
Goodwin, M.M., Laroche, J.: Audio segmentation by feature-space clustering using linear discriminant analysis and dynamic programming. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 131–134. IEEE (2003)
Estevan, Y.P., Wan, V., Scharenborg, O.: Finding maximum margin segments in speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, pp. 937–940. IEEE (2007)
Park, A.S., Glass, J.R.: Unsupervised pattern discovery in speech. IEEE Trans. Audio Speech Lang. Process. 16(1), 186–197 (2008)
Micallef, P., Chilton, T.: Automatic identification of phoneme boundaries using a mixed parameter model. In: Fifth European Conference on Speech Communication and Technology (1997)
van Santen, J.P., Sproat, R.: High-accuracy automatic segmentation. In: EUROSPEECH (1999)
Chang, J.W., Glass, J.R.: Segmentation and modeling in segment-based recognition. In: Fifth European Conference on Speech Communication and Technology (1997)
Qiao, Y., Shimomura, N., Minematsu, N.: Unsupervised optimal phoneme segmentation: objectives, algorithm and comparisons. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2008, pp. 3989–3992. IEEE (2008)
Leow, S.J., Chng, E.S., Lee, C.-H.: Language-resource independent speech segmentation using cues from a spectrogram image. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5813–5817. IEEE (2015)
Stan, A., Valentini-Botinhao, C., Orza, B., Giurgiu, M.: Blind speech segmentation using spectrogram image-based features and Mel cepstral coefficients. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 597–602. IEEE (2016)
Khanagha, V., Daoudi, K., Pont, O., Yahia, H.: Phonetic segmentation of speech signal using local singularity analysis. Digit. Signal Proc. 35, 86–94 (2014)
Rasanen, O., Laine, U., Altosaar, T.: Blind segmentation of speech using non-linear filtering methods. In: Speech Technologies. InTech (2011)
Lee, C., Glass, J.: A nonparametric Bayesian approach to acoustic model discovery. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 40–49. Association for Computational Linguistics (2012)
Vert, J.-P., Tsuda, K., Schölkopf, B.: A primer on kernel methods. In: Kernel Methods in Computational Biology, pp. 35–70 (2004)
Rabiner, L.R.: Multirate Digital Signal Processing. Prentice Hall PTR, Upper Saddle River (1996)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, no. 34, pp. 226–231 (1996)
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical report N, vol. 93 (1993)
Versteegh, M., Thiolliere, R., Schatz, T., Cao, X.-N., Anguera, X., Jansen, A., Dupoux, E.: The zero resource speech challenge 2015. In: Interspeech, pp. 3169–3173 (2015)
Jansen, A., Van Durme, B.: Efficient spoken term discovery using randomized algorithms. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 401–406. IEEE (2011)
Räsänen, O., Doyle, G., Frank, M.C.: Unsupervised word discovery from speech using automatic segmentation into syllable-like units. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Lyzinski, V., Sell, G., Jansen, A.: An evaluation of graph clustering methods for unsupervised term discovery. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Vuuren, V., Bosch, L., Niesler, T.: Unconstrained speech segmentation using deep neural networks. In: Proceedings of the International Conference on Pattern Recognition Applications and Methods, ICPRAM 2015, vol. 1, pp. 248–254 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bhati, S., Nayak, S., Sri Rama Murty, K. (2018). Unsupervised Segmentation of Speech Signals Using Kernel-Gram Matrices. In: Rameshan, R., Arora, C., Dutta Roy, S. (eds) Computer Vision, Pattern Recognition, Image Processing, and Graphics. NCVPRIPG 2017. Communications in Computer and Information Science, vol 841. Springer, Singapore. https://doi.org/10.1007/978-981-13-0020-2_13
Download citation
DOI: https://doi.org/10.1007/978-981-13-0020-2_13
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0019-6
Online ISBN: 978-981-13-0020-2
eBook Packages: Computer ScienceComputer Science (R0)