Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Acoustic segment modeling with spectral clustering methods

Published: 01 February 2015 Publication History

Abstract

This paper presents a study of spectral clustering-based approaches to acoustic segment modeling (ASM). ASM aims at finding the underlying phoneme-like speech units and building the corresponding acoustic models in the unsupervised setting, where no prior linguistic knowledge and manual transcriptions are available. A typical ASM process involves three stages, namely initial segmentation, segment labeling, and iterative modeling. This work focuses on the improvement of segment labeling. Specifically, we use posterior features as the segment representations, and apply spectral clustering algorithms on the posterior representations. We propose a Gaussian component clustering (GCC) approach and a segment clustering (SC) approach. GCC applies spectral clustering on a set of Gaussian components, and SC applies spectral clustering on a large number of speech segments. Moreover, to exploit the complementary information of different posterior representations, a multiview segment clustering (MSC) approach is proposed. MSC simultaneously utilizes multiple posterior representations to cluster speech segments. To address the computational problem of spectral clustering in dealing with large numbers of speech segments, we use inner product similarity graph and make reformulations to avoid the explicit computation of the affinity matrix and Laplacian matrix. We carried out two sets of experiments for evaluation. First, we evaluated the ASM accuracy on the OGI-MTS dataset, and it was shown that our approach could yield 18.7% relative purity improvement and 15.1% relative NMI improvement compared with the baseline approach. Second, we examined the performances of our approaches in the real application of zero-resource query-by-example spoken term detection on SWS2012 dataset, and it was shown that our approaches could provide consistent improvement on four different testing scenarios with three evaluation metrics.

References

[1]
L. Lamel, J. Gauvain, and G. Adda, "Lightly supervised and unsupervised acoustic model training," Comput. Speech Lang., vol. 16, no. 1, pp. 115-129, 2002.
[2]
F. Wessel and H. Ney, "Unsupervised training of acoustic models for large vocabulary continuous speech recognition," IEEE Trans. Speech Audio Process., vol. 13, no. 1, pp. 23-31, Jan. 2005.
[3]
S. Novotney, R. Schwartz, and J. Ma, "Unsupervised acoustic and language model training with small amounts of labelled data," in Proc. ICASSP, 2009, pp. 4297-4300.
[4]
J. Glass, "Towards unsupervised speech processing," in Proc. ISSPA, 2012, pp. 1-4.
[5]
C. Lee and J. Glass, "A nonparametric Bayesian approach to acoustic model discovery," in Proc. ACL, 2012, pp. 40-49.
[6]
A. Jansen, S. Thomas, and H. Hermansky, "Weak top-down constraints for unsupervised acoustic model training," in Proc. ICASSP, 2013, pp. 8091-8095.
[7]
M. Siu, A. Chan, H. Gish, W. Belfield, and S. Lowe, "Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery," Comput. Speech Lang., vol. 28, no. 1, pp. 210-223, 2014.
[8]
A. Jansen, E. Dupoux, S. Goldwater, M. Johnson, S. Khudanpur, K. Church, N. Feldman, H. Hermansky, F. Metze, and R. Rose et al., "A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition," in Proc. ICASSP, 2013, pp. 8111-8115.
[9]
C. Lee, F. Soong, and B. Juang, "A segment model based approach to speech recognition," in Proc. ICASSP, 1988, pp. 501-541.
[10]
J. Reed and C. Lee, "A study on music genre classification based on universal acoustic models," in Proc. ISMIR, 2006, pp. 89-94.
[11]
H. Li, B. Ma, and C. Lee, "A vector space modeling approach to spoken language identification," IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 271-284, Jan. 2007.
[12]
S. Dusan and L. Rabiner, "On the relation between maximum spectral transition positions and phone boundaries," in Proc. Interspeech, 2006, pp. 645-648.
[13]
Y. Qiao, N. Shimomura, and N. Minematsu, "Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons," in Proc. ICASSP, 2008, pp. 3989-3992.
[14]
O. Scharenborg, V. Wan, and M. Ernestus, "Unsupervised speech segmentation: An analysis of the hypothesized phone boundaries," J. Acoust. Soc. Amer., vol. 127, no. 2, pp. 1084-1095, 2010.
[15]
Y. Estevan, V. Wan, and O. Scharenborg, "Finding maximum margin segments in speech," in Proc. ICASSP, 2007, pp. 937-940.
[16]
A. Torbati, J. Picone, and M. Sobel, "Speech acoustic unit segmentation using hierarchical Dirichlet processes," in Proc. Interspeech, 2013, pp. 637-641.
[17]
M. Siu, H. Gish, S. Lowe, and A. Chan, "Unsupervised audio patterns discovery using HMM-based self-organized units," in Proc. Interspeech, 2011, pp. 2333-2336.
[18]
F. Metze, X. Anguera, E. Barnard, M. Davel, and G. Gravier, "Language independent search in MediaEval's spoken web search task," Comput. Speech Lang., vol. 28, no. 5, pp. 1066-1082, 2014.
[19]
A. Garcia and H. Gish, "Keyword spotting of arbitrary words using minimal speech resources," in Proc. ICASSP, 2006, pp. 949-952.
[20]
M. Siu, H. Gish, A. Chan, and W. Belfield, "Improved topic classification and keyword discovery using an HMM-based speech recognizer trained without supervision," in Proc. Interspeech, 2010, pp. 2838-2841.
[21]
M. Bacchiani and M. Ostendorf, "Joint lexicon, acoustic unit inventory and model design," Speech Commun., vol. 29, no. 2, pp. 99-114, 1999.
[22]
H. Gish and K. Ng, "A segmental speech model with applications to word spotting," in Proc. ICASSP, 1993, vol. 2, pp. 447-450.
[23]
H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li, "An acoustic segment modeling approach to query-by-example spoken term detection," in Proc. ICASSP, 2012, pp. 5157-5160.
[24]
H. Singer and M. Ostendorf, "Maximum likelihood successive state splitting," in Proc. ICASSP, 1996, pp. 601-604.
[25]
B. Varadarajan and S. Khudanpur, "Automatically learning speaker-independent acoustic subword units," in Proc. Interspeech, 2008, pp. 1333-1336.
[26]
C. Chung, C. Chan, and L. Lee, "Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cascaded stages of iterative optimization," in Proc. ICASSP, 2013, pp. 8081-8085.
[27]
A. Jansen and K. Church, "Towards unsupervised training of speaker independent acoustic models," in Proc. Interspeech, 2011, pp. 1693-1692.
[28]
R. Singh, B. Raj, and R. Stern, "Automatic generation of subword units for speech recognition systems," IEEE Trans. Speech Audio Process., vol. 10, no. 2, pp. 89-99, Feb. 2002.
[29]
C. Lee, Y. Zhang, and J. Glass, "Joint learning of phonetic units and word pronunciations for ASR," in Proc. EMNLP, 2013, pp. 182-192.
[30]
B. Ma, D. Zhu, and H. Li, "Acoustic segment modeling for speaker recognition," in Proc. ICME, 2009, pp. 1668-1671.
[31]
M. Siu, O. Lang, H. Gish, S. Lowe, A. Chan, and O. Kimball, "Mllr transforms of self-organized units as features in speaker recognition," in Proc. ICASSP, 2012, pp. 4385-4388.
[32]
H. Lee, Y. Li, C. Chung, and L. Lee, "Enhancing query expansion for semantic retrieval of spoken content with automatically discovered acoustic patterns," in Proc. ICASSP, 2013, pp. 8297-8301.
[33]
Y. Zhang and J. Glass, "Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams," in Proc. ASRU, 2009, pp. 398-403.
[34]
M. Huijbregts, M. McLaren, and D. Leeuwen, "Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection," in Proc. ICASSP, 2011, pp. 4436-4439.
[35]
C. Chan and L. Lee, "Model-based unsupervised spoken term detection with spoken queries," IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 1330-1342, Jul. 2013.
[36]
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, and D. Pallett, DARPA TIMIT acoustic-phonetic continuous speech corpus NASA STI/Recon Tech. Rep. N., 1993.
[37]
Q. Zhu, A. Stolcke, B. Chen, and N. Morgan, "Using MLP features in SRI's conversational speech recognition system," in Proc. Interspeech, 2005, pp. 2141-2144.
[38]
T. Hazen, W. Shen, and C. White, "Query-by-example spoken term detection using phonetic posteriorgram templates," in Proc. ASRU, 2009, pp. 421-426.
[39]
L. Zheng, C.-C. Leung, L. Xie, B. Ma, and H. Li, "Acoustic texttiling for story segmentation of spoken documents," in Proc. ICASSP, 2012, pp. 5121-5124.
[40]
H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li, "Shifted-delta MLP features for spoken language recognition," IEEE Signal Process. Lett., vol. 20, no. 1, pp. 15-18, 2013.
[41]
G. Aradilla, J. Vepa, and H. Bourlard, "Using posterior-based features in template matching for speech recognition," in Proc. Interspeech, 2006, pp. 949-952.
[42]
F. Grezl, M. Karafiát, and M. Janda, "Study of probabilistic and bottleneck features in multilingual environment," in Proc. ASRU, 2011, pp. 359-364.
[43]
S. Deerwester, S. Dumais, and T. Landauer et al., "Indexing by latent semantic analysis," JASIS, vol. 41, no. 6, pp. 391-407, 1990.
[44]
Y. Muthusamy, R. Cole, and B. Oshika, The OGI multi-language telephone speech corpus 1994.
[45]
L. Van der Maaten and G. Hinton, "Visualizing data using t-SNE," J. Mach. Learn. Res., vol. 9, no. 11, pp. 2579-2605, 2008.
[46]
A. Ng, M. Jordan, and Y. Weiss, "On spectral clustering: Analysis and an algorithm," in Proc. NIPS, 2001, pp. 849-856.
[47]
U. Luxburg, "A tutorial on spectral clustering," Statist. Comput., vol. 17, no. 4, pp. 395-416, 2007.
[48]
M. Tipping and C. Nh, "Sparse kernel principal component analysis," in Proc. NIPS, 2000, pp. 633-639.
[49]
T. Xia, D. Tao, T. Mei, and Y. Zhang, "Multiview spectral embedding," IEEE Trans. Systems, Man, Cybern. B: Cybern., vol. 40, no. 6, pp. 1438-1446, Jun. 2010.
[50]
M. Wang, X.-S. Hua, X. Yuan, Y. Song, and L.-R. Dai, "Optimizing multi-graph learning: Towards a unified video annotation scheme," in Proc. ACM Multimedia, 2007, pp. 862-871.
[51]
S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.: Cambridge Univ. Press, 2004.
[52]
C. Manning, P. Raghavan, and H. Schütze, Introduction to information retrieval 2008.
[53]
S. Wegmann, D. McAllaster, J. Orloff, and B. Peskin, "Speaker normalization on conversational telephone speech," in Proc. ICASSP, 1996, pp. 339-341.
[54]
P. Schwarz, "Phoneme recognition based on long temporal context," Ph.D. dissertation, Brno Univ. of Technol., Brno, Czech Republic, 2009.
[55]
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, and D. Povey et al., The HTK book. Cambridge, U.K.: Cambridge Univ., 2006.
[56]
F. Metze, X. Anguera, E. Barnard, and M. Davel et al., "The spoken web search task at MediaEval 2012," in Proc. ICASSP, 2013, pp. 8121-8125.
[57]
F. Metze, E. Barnard, M. Davel, V. H., X. Anguera, G. Gravier, and N. Rajput et al., "The spoken web search task," in Proc. MediaEval '12 Workshop, 2012.
[58]
H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li, "Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection," in Proc. ICASSP, 2013, pp. 8545-8549.
[59]
I. Szöke, L. Burget, and F. Grézl et al., "BUT SWS 2013- massive parallel approach," in Proc. MediaEval '13 Workshop, 2013.
[60]
A. Abad, R. Fuentes, M. Penagarikano, A. Varona, M. Diez, and G. Bordel, "On the calibration and fusion of heterogeneous spoken term detection systems," in Proc. Interspeech, 2013.
[61]
A. Azran and Z. Ghahramani, "Spectral methods for automatic multiscale data clustering," in Proc. CVPR, 2006, pp. 190-197.
[62]
H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li, "Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams," in Proc. Interspeech, 2013.
[63]
H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li, "A graph-based Gaussian component clustering approach to unsupervised acoustic modeling," in Proc. Interspeech, 2014, pp. 875-879.
[64]
A. Kumar, P. Rai, and H. Daumé, "Co-regularized multi-view spectral clustering," in Proc. NIPS, 2011, pp. 1413-1421.
[65]
A. Kumar and H. Daumé, "A co-training approach for multi-view spectral clustering," in Proc. ICML, 2011, pp. 393-400.
[66]
J. Liu, C. Wang, J. Gao, and J. Han, "Multi-view clustering via joint nonnegative matrix factorization," in Proc. SDM, 2013, pp. 252-260.

Cited By

View all
  • (2023)Noise-label Suppressed Module for Speech Emotion RecognitionProceedings of the 2023 3rd International Conference on Robotics and Control Engineering10.1145/3598151.3598176(148-152)Online publication date: 12-May-2023
  • (2020)Using Prosodic and Acoustic Features for Chinese Dialects IdentificationProceedings of the 2020 2nd International Conference on Image Processing and Machine Vision10.1145/3421558.3421577(118-123)Online publication date: 5-Aug-2020
  • (2020)Automatic Language Identification using Suprasegmental Feature and Supervised Topic ModelProceedings of the 2020 2nd Symposium on Signal Processing Systems10.1145/3421515.3421521(69-73)Online publication date: 11-Jul-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 23, Issue 2
February 2015
193 pages
ISSN:2329-9290
EISSN:2329-9304
  • Editor:
  • Haizou Li
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 February 2015
Accepted: 11 December 2014
Revised: 29 October 2014
Received: 30 June 2014
Published in TASLP Volume 23, Issue 2

Author Tags

  1. acoustic segment modeling
  2. multiview segment clustering
  3. sub-word unit discovery
  4. unsupervised training
  5. zero-resource query-by-example spoken term detection

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Noise-label Suppressed Module for Speech Emotion RecognitionProceedings of the 2023 3rd International Conference on Robotics and Control Engineering10.1145/3598151.3598176(148-152)Online publication date: 12-May-2023
  • (2020)Using Prosodic and Acoustic Features for Chinese Dialects IdentificationProceedings of the 2020 2nd International Conference on Image Processing and Machine Vision10.1145/3421558.3421577(118-123)Online publication date: 5-Aug-2020
  • (2020)Automatic Language Identification using Suprasegmental Feature and Supervised Topic ModelProceedings of the 2020 2nd Symposium on Signal Processing Systems10.1145/3421515.3421521(69-73)Online publication date: 11-Jul-2020
  • (2019)Search on speech from spoken queriesEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-019-0156-x2019:1(1-29)Online publication date: 1-Dec-2019
  • (2019)Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword ModelingIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2019.293795327:12(2000-2011)Online publication date: 1-Dec-2019
  • (2019)A Deep Time-delay Embedded Algorithm for Unsupervised Stress Speech Clustering2019 IEEE International Conference on Systems, Man and Cybernetics (SMC)10.1109/SMC.2019.8914250(1193-1198)Online publication date: 6-Oct-2019
  • (2018)Sparse Subspace Modeling for Query by Example Spoken Term DetectionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.281578026:6(1126-1139)Online publication date: 1-Jun-2018
  • (2018)Unsupervised Discovery of Structured Acoustic Tokens With Applications to Spoken Term DetectionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.277894826:2(394-405)Online publication date: 1-Feb-2018
  • (2017)HMM/SVM segmentation and labelling of Arabic speech for speech recognition applicationsInternational Journal of Speech Technology10.5555/3135535.313555920:3(563-573)Online publication date: 1-Sep-2017
  • (2017)Scalable and Flexible Multiview MAX-VAR Canonical Correlation AnalysisIEEE Transactions on Signal Processing10.1109/TSP.2017.269836565:16(4150-4165)Online publication date: 15-Aug-2017
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media