Abstract
Speech segmentation at both word and phoneme levels is crucial for various speech processing tasks. It significantly aids in extracting meaningful units from an utterance, thus enabling the generation of discrete elements. In this work we propose a model-agnostic framework to perform word boundary detection in a supervised manner also employing a labels augmentation technique and an output-frame selection strategy. We trained and tested on the Buckeye dataset and only tested on TIMIT one, using state-of-the-art encoder models, including pre-trained solutions (Wav2Vec 2.0 and HuBERT), as well as convolutional and convolutional recurrent networks. Our method, with the HuBERT encoder, surpasses the performance of other state-of-the-art architectures, whether trained in supervised or self-supervised settings on the same datasets. Specifically, we achieved F-values of 0.8427 on the Buckeye dataset and 0.7436 on the TIMIT dataset, along with R-values of 0.8489 and 0.7807, respectively. These results establish a new state-of-the-art for both datasets. Beyond the immediate task, our approach offers a robust and efficient preprocessing method for future research in audio tokenization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agarwal, A., Jain, A., Prakash, N., Agrawal, S.: Word boundary detection in continuous speech based on suprasegmental features for Hindi language. In: 2nd International Conference on Signal Processing Systems (2010)
Ajmera, J., McCowan, I., Bourlard, H.: Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)
Almpanidis, G., Kotropoulos, C.: Phonemic segmentation using the generalised gamma distribution and small sample bayesian information criterion. Speech Commun. 50(1), 38–55 (2008). https://doi.org/10.1016/j.specom.2007.06.005
Aversano, G., Esposito, A., Marinaro, M.: A new text-independent method for phoneme segmentation. In: IEEE MWSCAS. vol. 2 (2001)
Baevski, A., Schneider, S., Auli, M.: Vq-wav2vec: self-supervised learning of discrete speech representations. In: IEEE ICLR (2020)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Bhati, S., Nayak, S., Murty, K.S.R.: Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications. In: Interspeech (2017). https://doi.org/10.21437/Interspeech.2017-1476
Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: Unsupervised speech segmentation and variable rate representation learning using segmental contrastive predictive coding. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2002–2014 (2022)
Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation. In: Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-1874
Chorowski, J., Weiss, R.J., Bengio, S., Van Den Oord, A.: Unsupervised speech representation learning using Wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019)
Dusan, S., Rabiner, L.: On the relation between maximum spectral transition positions and phone boundaries. In: Interspeech (2006). https://doi.org/10.21437/Interspeech.2006-230
Franke, J., Mueller, M., Hamlaoui, F., Stueker, S., Waibel, A.: Phoneme boundary detection using deep bidirectional LSTMs. In: Speech Communication; 12. ITG Symposium (2016)
Fuchs, T., Hoshen, Y., Keshet, Y.: Unsupervised Word Segmentation using K Nearest Neighbors. In: Proceedings of Interspeech 2022, pp. 4646–4650 (2022). https://doi.org/10.21437/Interspeech.2022-11474
Fuchs, T.S., Hoshen, Y.: Unsupervised word segmentation using temporal gradient pseudo-labels. In: IEEE ICASSP (2023)
Garofolo, J.S.: TIMIT acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993 (1993)
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3451–3460 (2021)
Jankowski, C., Kalyanswamy, A., Basson, S., Spitz, J.: NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 109–112. IEEE (1990)
Kamper, H., Jansen, A., Goldwater, S.: Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 669–679 (2016)
Kamper, H., Jansen, A., Goldwater, S.: A segmental framework for fully-unsupervised large-vocabulary speech recognition. Comput. Speech Lang. 46, 154–174 (2017)
Kamper, H., Livescu, K., Goldwater, S.: An embedded segmental k-means model for unsupervised segmentation and clustering of speech. In: IEEE ASRU (2017)
Kamper, H., van Niekerk, B.: Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks. In: Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-50
Keshet, J., Shalev-Shwartz, S., Singer, Y., Chazan, D.: Phoneme alignment based on discriminative learning. In: Interspeech (2005). https://doi.org/10.21437/Interspeech.2005-129
Kreuk, F., Keshet, J., Adi, Y.: Self-supervised contrastive learning for unsupervised phoneme segmentation. In: Interspeech (2020). https://doi.org/10.21437/Interspeech.2020-2398
Kreuk, F., Sheena, Y., Keshet, J., Adi, Y.: Phoneme boundary detection using learnable segmental features. In: IEEE ICASSP (2020)
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using Kaldi. In: Interspeech (2017). https://doi.org/10.21437/Interspeech.2017-1386
Michel, P., Rasanen, O., Thiollière, R., Dupoux, E.: Blind phoneme segmentation with temporal prediction errors. In: ACL Student Research Workshop, pp. 62–68 (2017)
Naganoor, V., Jagadish, A.K., Chemmangat, K.: Word boundary estimation for continuous speech using higher order statistical features. In: IEEE TENCON (2016)
Payne, B., Ng, S., Shantz, K., Federmeier, K.: Event-related brain potentials in multilingual language processing: The N’s and P’s, pp. 75–118. Psychology of Learning and Motivation - Advances in Research and Theory, Academic Press Inc., United States (2020). https://doi.org/10.1016/bs.plm.2020.03.003
Petek, B., Andersen, O., Dalsgaard, P.: On the robust automatic segmentation of spontaneous speech. In: IEEE ICSLP. vol. 2 (1996)
Pitt, M.A., Johnson, K., Hume, E., Kiesling, S., Raymond, W.: The buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Commun. 45(1), 89–95 (2005)
Räsänen, O.J., Laine, U.K., Altosaar, T.: An improved speech segmentation quality measure: the R-value. In: Tenth Annual Conference of the International Speech Communication Association (2009)
Salamon, J., MacConnell, D., Cartwright, M., Li, P., Bello, J.P.: Scaper: a library for soundscape synthesis and augmentation. In: IEEE WASPAA (2017)
Shezi, N., Reddy, S.: Word boundary estimation of isizulu continuous speech. In: IEEE PICC, pp. 1–6 (2020)
Strgar, L., Harwath, D.: Phoneme segmentation using self-supervised speech models. In: IEEE SLT (2023)
Venkatesh, S., et al.: Artificially synthesising data for audio classification and segmentation to improve speech and music detection in radio broadcast. In: IEEE ICASSP (2021)
Venkatesh, S., Moffat, D., Miranda, E.R.: Investigating the effects of training set synthesis for audio segmentation of radio broadcast. Electronics 10(7), 827 (2021)
Venkatesh, S., Moffat, D., Miranda, E.R.: You only hear once: a YOLO-like algorithm for audio segmentation and sound event detection. Appl. Sci. 12(7), 3293 (2022). https://doi.org/10.3390/app12073293
Wang, Y.H., Chung, C.T., Lee, H.Y.: Gate activation signal analysis for gated recurrent neural networks and its correlation with phoneme boundaries. In: Interspeech (2017). https://doi.org/10.21437/Interspeech.2017-877
Acknowledgements
Simone Carnemolla and Salvatore Calcagno acknowledge financial support from: PNRR MUR project PE0000013-FAIR.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Carnemolla, S., Calcagno, S., Palazzo, S., Giordano, D. (2025). Back to Supervision: Boosting Word Boundary Detection Through Frame Classification. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15333. Springer, Cham. https://doi.org/10.1007/978-3-031-80136-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-80136-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-80135-8
Online ISBN: 978-3-031-80136-5
eBook Packages: Computer ScienceComputer Science (R0)