Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Back to Supervision: Boosting Word Boundary Detection Through Frame Classification

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Abstract

Speech segmentation at both word and phoneme levels is crucial for various speech processing tasks. It significantly aids in extracting meaningful units from an utterance, thus enabling the generation of discrete elements. In this work we propose a model-agnostic framework to perform word boundary detection in a supervised manner also employing a labels augmentation technique and an output-frame selection strategy. We trained and tested on the Buckeye dataset and only tested on TIMIT one, using state-of-the-art encoder models, including pre-trained solutions (Wav2Vec 2.0 and HuBERT), as well as convolutional and convolutional recurrent networks. Our method, with the HuBERT encoder, surpasses the performance of other state-of-the-art architectures, whether trained in supervised or self-supervised settings on the same datasets. Specifically, we achieved F-values of 0.8427 on the Buckeye dataset and 0.7436 on the TIMIT dataset, along with R-values of 0.8489 and 0.7807, respectively. These results establish a new state-of-the-art for both datasets. Beyond the immediate task, our approach offers a robust and efficient preprocessing method for future research in audio tokenization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/simonecarnemolla/Word-Segmenter.

References

  1. Agarwal, A., Jain, A., Prakash, N., Agrawal, S.: Word boundary detection in continuous speech based on suprasegmental features for Hindi language. In: 2nd International Conference on Signal Processing Systems (2010)

    Google Scholar 

  2. Ajmera, J., McCowan, I., Bourlard, H.: Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)

    Article  Google Scholar 

  3. Almpanidis, G., Kotropoulos, C.: Phonemic segmentation using the generalised gamma distribution and small sample bayesian information criterion. Speech Commun. 50(1), 38–55 (2008). https://doi.org/10.1016/j.specom.2007.06.005

    Article  Google Scholar 

  4. Aversano, G., Esposito, A., Marinaro, M.: A new text-independent method for phoneme segmentation. In: IEEE MWSCAS. vol. 2 (2001)

    Google Scholar 

  5. Baevski, A., Schneider, S., Auli, M.: Vq-wav2vec: self-supervised learning of discrete speech representations. In: IEEE ICLR (2020)

    Google Scholar 

  6. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)

    Google Scholar 

  7. Bhati, S., Nayak, S., Murty, K.S.R.: Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications. In: Interspeech (2017). https://doi.org/10.21437/Interspeech.2017-1476

  8. Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: Unsupervised speech segmentation and variable rate representation learning using segmental contrastive predictive coding. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2002–2014 (2022)

    Article  Google Scholar 

  9. Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation. In: Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-1874

  10. Chorowski, J., Weiss, R.J., Bengio, S., Van Den Oord, A.: Unsupervised speech representation learning using Wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019)

    Article  Google Scholar 

  11. Dusan, S., Rabiner, L.: On the relation between maximum spectral transition positions and phone boundaries. In: Interspeech (2006). https://doi.org/10.21437/Interspeech.2006-230

  12. Franke, J., Mueller, M., Hamlaoui, F., Stueker, S., Waibel, A.: Phoneme boundary detection using deep bidirectional LSTMs. In: Speech Communication; 12. ITG Symposium (2016)

    Google Scholar 

  13. Fuchs, T., Hoshen, Y., Keshet, Y.: Unsupervised Word Segmentation using K Nearest Neighbors. In: Proceedings of Interspeech 2022, pp. 4646–4650 (2022). https://doi.org/10.21437/Interspeech.2022-11474

  14. Fuchs, T.S., Hoshen, Y.: Unsupervised word segmentation using temporal gradient pseudo-labels. In: IEEE ICASSP (2023)

    Google Scholar 

  15. Garofolo, J.S.: TIMIT acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993 (1993)

    Google Scholar 

  16. Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3451–3460 (2021)

    Article  Google Scholar 

  17. Jankowski, C., Kalyanswamy, A., Basson, S., Spitz, J.: NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 109–112. IEEE (1990)

    Google Scholar 

  18. Kamper, H., Jansen, A., Goldwater, S.: Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 669–679 (2016)

    Article  Google Scholar 

  19. Kamper, H., Jansen, A., Goldwater, S.: A segmental framework for fully-unsupervised large-vocabulary speech recognition. Comput. Speech Lang. 46, 154–174 (2017)

    Article  Google Scholar 

  20. Kamper, H., Livescu, K., Goldwater, S.: An embedded segmental k-means model for unsupervised segmentation and clustering of speech. In: IEEE ASRU (2017)

    Google Scholar 

  21. Kamper, H., van Niekerk, B.: Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks. In: Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-50

  22. Keshet, J., Shalev-Shwartz, S., Singer, Y., Chazan, D.: Phoneme alignment based on discriminative learning. In: Interspeech (2005). https://doi.org/10.21437/Interspeech.2005-129

  23. Kreuk, F., Keshet, J., Adi, Y.: Self-supervised contrastive learning for unsupervised phoneme segmentation. In: Interspeech (2020). https://doi.org/10.21437/Interspeech.2020-2398

  24. Kreuk, F., Sheena, Y., Keshet, J., Adi, Y.: Phoneme boundary detection using learnable segmental features. In: IEEE ICASSP (2020)

    Google Scholar 

  25. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using Kaldi. In: Interspeech (2017). https://doi.org/10.21437/Interspeech.2017-1386

  26. Michel, P., Rasanen, O., Thiollière, R., Dupoux, E.: Blind phoneme segmentation with temporal prediction errors. In: ACL Student Research Workshop, pp. 62–68 (2017)

    Google Scholar 

  27. Naganoor, V., Jagadish, A.K., Chemmangat, K.: Word boundary estimation for continuous speech using higher order statistical features. In: IEEE TENCON (2016)

    Google Scholar 

  28. Payne, B., Ng, S., Shantz, K., Federmeier, K.: Event-related brain potentials in multilingual language processing: The N’s and P’s, pp. 75–118. Psychology of Learning and Motivation - Advances in Research and Theory, Academic Press Inc., United States (2020). https://doi.org/10.1016/bs.plm.2020.03.003

  29. Petek, B., Andersen, O., Dalsgaard, P.: On the robust automatic segmentation of spontaneous speech. In: IEEE ICSLP. vol. 2 (1996)

    Google Scholar 

  30. Pitt, M.A., Johnson, K., Hume, E., Kiesling, S., Raymond, W.: The buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Commun. 45(1), 89–95 (2005)

    Article  Google Scholar 

  31. Räsänen, O.J., Laine, U.K., Altosaar, T.: An improved speech segmentation quality measure: the R-value. In: Tenth Annual Conference of the International Speech Communication Association (2009)

    Google Scholar 

  32. Salamon, J., MacConnell, D., Cartwright, M., Li, P., Bello, J.P.: Scaper: a library for soundscape synthesis and augmentation. In: IEEE WASPAA (2017)

    Google Scholar 

  33. Shezi, N., Reddy, S.: Word boundary estimation of isizulu continuous speech. In: IEEE PICC, pp. 1–6 (2020)

    Google Scholar 

  34. Strgar, L., Harwath, D.: Phoneme segmentation using self-supervised speech models. In: IEEE SLT (2023)

    Google Scholar 

  35. Venkatesh, S., et al.: Artificially synthesising data for audio classification and segmentation to improve speech and music detection in radio broadcast. In: IEEE ICASSP (2021)

    Google Scholar 

  36. Venkatesh, S., Moffat, D., Miranda, E.R.: Investigating the effects of training set synthesis for audio segmentation of radio broadcast. Electronics 10(7), 827 (2021)

    Article  Google Scholar 

  37. Venkatesh, S., Moffat, D., Miranda, E.R.: You only hear once: a YOLO-like algorithm for audio segmentation and sound event detection. Appl. Sci. 12(7), 3293 (2022). https://doi.org/10.3390/app12073293

  38. Wang, Y.H., Chung, C.T., Lee, H.Y.: Gate activation signal analysis for gated recurrent neural networks and its correlation with phoneme boundaries. In: Interspeech (2017). https://doi.org/10.21437/Interspeech.2017-877

Download references

Acknowledgements

Simone Carnemolla and Salvatore Calcagno acknowledge financial support from: PNRR MUR project PE0000013-FAIR.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simone Carnemolla .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 169 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Carnemolla, S., Calcagno, S., Palazzo, S., Giordano, D. (2025). Back to Supervision: Boosting Word Boundary Detection Through Frame Classification. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15333. Springer, Cham. https://doi.org/10.1007/978-3-031-80136-5_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-80136-5_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-80135-8

  • Online ISBN: 978-3-031-80136-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics