Back to Supervision: Boosting Word Boundary Detection Through Frame Classification

Carnemolla, Simone; Calcagno, Salvatore; Palazzo, Simone; Giordano, Daniela

doi:10.1007/978-3-031-80136-5_9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15333))

Included in the following conference series:

International Conference on Pattern Recognition

111 Accesses

Abstract

Speech segmentation at both word and phoneme levels is crucial for various speech processing tasks. It significantly aids in extracting meaningful units from an utterance, thus enabling the generation of discrete elements. In this work we propose a model-agnostic framework to perform word boundary detection in a supervised manner also employing a labels augmentation technique and an output-frame selection strategy. We trained and tested on the Buckeye dataset and only tested on TIMIT one, using state-of-the-art encoder models, including pre-trained solutions (Wav2Vec 2.0 and HuBERT), as well as convolutional and convolutional recurrent networks. Our method, with the HuBERT encoder, surpasses the performance of other state-of-the-art architectures, whether trained in supervised or self-supervised settings on the same datasets. Specifically, we achieved F-values of 0.8427 on the Buckeye dataset and 0.7436 on the TIMIT dataset, along with R-values of 0.8489 and 0.7807, respectively. These results establish a new state-of-the-art for both datasets. Beyond the immediate task, our approach offers a robust and efficient preprocessing method for future research in audio tokenization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Bidirectional LSTM Approach with Word Embeddings for Sentence Boundary Detection

Article 30 September 2017

Unsupervised Speech Unit Discovery Using K-means and Neural Networks

Multiclass audio segmentation based on recurrent neural networks for broadcast domain data

Article Open access 05 March 2020

Notes

1.
https://github.com/simonecarnemolla/Word-Segmenter.

References

Agarwal, A., Jain, A., Prakash, N., Agrawal, S.: Word boundary detection in continuous speech based on suprasegmental features for Hindi language. In: 2nd International Conference on Signal Processing Systems (2010)
Google Scholar
Ajmera, J., McCowan, I., Bourlard, H.: Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)
Article Google Scholar
Almpanidis, G., Kotropoulos, C.: Phonemic segmentation using the generalised gamma distribution and small sample bayesian information criterion. Speech Commun. 50(1), 38–55 (2008). https://doi.org/10.1016/j.specom.2007.06.005
Article Google Scholar
Aversano, G., Esposito, A., Marinaro, M.: A new text-independent method for phoneme segmentation. In: IEEE MWSCAS. vol. 2 (2001)
Google Scholar
Baevski, A., Schneider, S., Auli, M.: Vq-wav2vec: self-supervised learning of discrete speech representations. In: IEEE ICLR (2020)
Google Scholar
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Google Scholar
Bhati, S., Nayak, S., Murty, K.S.R.: Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications. In: Interspeech (2017). https://doi.org/10.21437/Interspeech.2017-1476
Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: Unsupervised speech segmentation and variable rate representation learning using segmental contrastive predictive coding. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2002–2014 (2022)
Article Google Scholar
Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation. In: Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-1874
Chorowski, J., Weiss, R.J., Bengio, S., Van Den Oord, A.: Unsupervised speech representation learning using Wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019)
Article Google Scholar
Dusan, S., Rabiner, L.: On the relation between maximum spectral transition positions and phone boundaries. In: Interspeech (2006). https://doi.org/10.21437/Interspeech.2006-230
Franke, J., Mueller, M., Hamlaoui, F., Stueker, S., Waibel, A.: Phoneme boundary detection using deep bidirectional LSTMs. In: Speech Communication; 12. ITG Symposium (2016)
Google Scholar
Fuchs, T., Hoshen, Y., Keshet, Y.: Unsupervised Word Segmentation using K Nearest Neighbors. In: Proceedings of Interspeech 2022, pp. 4646–4650 (2022). https://doi.org/10.21437/Interspeech.2022-11474
Fuchs, T.S., Hoshen, Y.: Unsupervised word segmentation using temporal gradient pseudo-labels. In: IEEE ICASSP (2023)
Google Scholar
Garofolo, J.S.: TIMIT acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993 (1993)
Google Scholar
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3451–3460 (2021)
Article Google Scholar
Jankowski, C., Kalyanswamy, A., Basson, S., Spitz, J.: NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 109–112. IEEE (1990)
Google Scholar
Kamper, H., Jansen, A., Goldwater, S.: Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 669–679 (2016)
Article Google Scholar
Kamper, H., Jansen, A., Goldwater, S.: A segmental framework for fully-unsupervised large-vocabulary speech recognition. Comput. Speech Lang. 46, 154–174 (2017)
Article Google Scholar
Kamper, H., Livescu, K., Goldwater, S.: An embedded segmental k-means model for unsupervised segmentation and clustering of speech. In: IEEE ASRU (2017)
Google Scholar
Kamper, H., van Niekerk, B.: Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks. In: Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-50
Keshet, J., Shalev-Shwartz, S., Singer, Y., Chazan, D.: Phoneme alignment based on discriminative learning. In: Interspeech (2005). https://doi.org/10.21437/Interspeech.2005-129
Kreuk, F., Keshet, J., Adi, Y.: Self-supervised contrastive learning for unsupervised phoneme segmentation. In: Interspeech (2020). https://doi.org/10.21437/Interspeech.2020-2398
Kreuk, F., Sheena, Y., Keshet, J., Adi, Y.: Phoneme boundary detection using learnable segmental features. In: IEEE ICASSP (2020)
Google Scholar
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using Kaldi. In: Interspeech (2017). https://doi.org/10.21437/Interspeech.2017-1386
Michel, P., Rasanen, O., Thiollière, R., Dupoux, E.: Blind phoneme segmentation with temporal prediction errors. In: ACL Student Research Workshop, pp. 62–68 (2017)
Google Scholar
Naganoor, V., Jagadish, A.K., Chemmangat, K.: Word boundary estimation for continuous speech using higher order statistical features. In: IEEE TENCON (2016)
Google Scholar
Payne, B., Ng, S., Shantz, K., Federmeier, K.: Event-related brain potentials in multilingual language processing: The N’s and P’s, pp. 75–118. Psychology of Learning and Motivation - Advances in Research and Theory, Academic Press Inc., United States (2020). https://doi.org/10.1016/bs.plm.2020.03.003
Petek, B., Andersen, O., Dalsgaard, P.: On the robust automatic segmentation of spontaneous speech. In: IEEE ICSLP. vol. 2 (1996)
Google Scholar
Pitt, M.A., Johnson, K., Hume, E., Kiesling, S., Raymond, W.: The buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Commun. 45(1), 89–95 (2005)
Article Google Scholar
Räsänen, O.J., Laine, U.K., Altosaar, T.: An improved speech segmentation quality measure: the R-value. In: Tenth Annual Conference of the International Speech Communication Association (2009)
Google Scholar
Salamon, J., MacConnell, D., Cartwright, M., Li, P., Bello, J.P.: Scaper: a library for soundscape synthesis and augmentation. In: IEEE WASPAA (2017)
Google Scholar
Shezi, N., Reddy, S.: Word boundary estimation of isizulu continuous speech. In: IEEE PICC, pp. 1–6 (2020)
Google Scholar
Strgar, L., Harwath, D.: Phoneme segmentation using self-supervised speech models. In: IEEE SLT (2023)
Google Scholar
Venkatesh, S., et al.: Artificially synthesising data for audio classification and segmentation to improve speech and music detection in radio broadcast. In: IEEE ICASSP (2021)
Google Scholar
Venkatesh, S., Moffat, D., Miranda, E.R.: Investigating the effects of training set synthesis for audio segmentation of radio broadcast. Electronics 10(7), 827 (2021)
Article Google Scholar
Venkatesh, S., Moffat, D., Miranda, E.R.: You only hear once: a YOLO-like algorithm for audio segmentation and sound event detection. Appl. Sci. 12(7), 3293 (2022). https://doi.org/10.3390/app12073293
Wang, Y.H., Chung, C.T., Lee, H.Y.: Gate activation signal analysis for gated recurrent neural networks and its correlation with phoneme boundaries. In: Interspeech (2017). https://doi.org/10.21437/Interspeech.2017-877

Download references

Acknowledgements

Simone Carnemolla and Salvatore Calcagno acknowledge financial support from: PNRR MUR project PE0000013-FAIR.

Author information

Authors and Affiliations

Department of Electrical Electronic and Computer Engineering, University of Catania, Via Santa Sofia, 95123 , Catania, Italy
Simone Carnemolla, Salvatore Calcagno, Simone Palazzo & Daniela Giordano

Authors

Simone Carnemolla
View author publications
You can also search for this author in PubMed Google Scholar
Salvatore Calcagno
View author publications
You can also search for this author in PubMed Google Scholar
Simone Palazzo
View author publications
You can also search for this author in PubMed Google Scholar
Daniela Giordano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simone Carnemolla .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, West Bengal, India
Umapada Pal

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 169 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Carnemolla, S., Calcagno, S., Palazzo, S., Giordano, D. (2025). Back to Supervision: Boosting Word Boundary Detection Through Frame Classification. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15333. Springer, Cham. https://doi.org/10.1007/978-3-031-80136-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-80136-5_9
Published: 01 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-80135-8
Online ISBN: 978-3-031-80136-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Back to Supervision: Boosting Word Boundary Detection Through Frame Classification