Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/ICASSP.2016.7472622guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

A deep scattering spectrum — Deep Siamese network pipeline for unsupervised acoustic modeling

Published: 01 March 2016 Publication History

Abstract

Recent work has explored deep architectures for learning acoustic features in an unsupervised or weakly-supervised way for phone recognition. Here we investigate the role of the input features, and in particular we test whether standard mel-scaled filterbanks could be replaced by inherently richer representations, such as derived from an analytic scattering spectrum. We use a Siamese network using lexical side information similar to a well-performing architecture used in the Zero Resource Speech Challenge (2015), and show a substantial improvement when the filterbanks are replaced by scattering features, even though these features yield similar performance when tested without training. This shows that unsupervised and weakly-supervised architectures can benefit from richer features than the traditional ones.

7. References

[1]
A. Jansen, E. Dupoux, S. Goldwater, M. Johnson, S. Khudanpur, K. Church, N. Feldman, H. Hermansky, F. Metze, R. Rose, et al. “A summary of the 2012 JH CLSP Workshop on zero resource speech technologies and models of early language acquisition”, in Proceedings of ICASSP 2013.
[2]
G. Synnaeve, T. Schatz, and E. Dupoux, “Phonetics embedding learning with side information”, in Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 106–111.
[3]
L. Badino, A. Mereta, and Lorenzo Rosasco, “Discovering discrete subword units with binarized autoencoders and hidden-markov-model encoders”, in Proceedings of Interspeech, 2015.
[4]
D. Renshaw, H. Kamper, A. Jansen, and Sharon Goldwater, “A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge”, in Proceedings of Interspeech, 2015.
[5]
Gabriel Synnaeve and Emmanuel Dupoux, “Weakly supervised multi-embeddings learning of acoustic models”, in ICLR, 2014.
[6]
Joakim Andén and Stéphane Mallat, “Deep scattering spectrum”, Signal Processing, IEEE Transactions on, vol. 62, no. 16, pp. 4114–4128, 2014.
[7]
D. Palaz, R. Collobert et al., “Analysis of cnn-based speech recognition system using raw speech as input”, in Proceedings of Interspeech, 2015, number EPFL-CONF-210029.
[8]
T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, “Evaluating speech features with the minimal-pair abx task: Analysis of the classical mfc/plp pipeline”, in INTERSPEECH 2013: 14th Annual Conference of the International Speech Communication Association, 2013, pp. 1–5.
[9]
T. Schatz, V. Peddinti, X-N. Cao, F. Bach, H. Hermansky, and E. Dupoux, “Evaluating speech features with the minimal-pair abx task (ii): Resistance to noise”, in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[10]
Jane Bromley, James W Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak Shah, “Signature verification using a siamese time delay neural network”, International Journal of Pattern Recognition and Artificial Intelligence, vol. 7, no. 04, pp. 669–688, 1993.
[11]
H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 26, no. 1, pp. 43–49, 1978.
[12]
N. H. Feldman, T. L. Griffiths, and J. L. Morgan, “Learning phonetic categories by learning a lexicon”, in Proceedings of the 31st annual conference of the cognitive science society, 2009, pp. 2208–2213.
[13]
Abdellah Fourtassi and Emmanuel Dupoux, “A rudimentary lexicon and semantics help bootstrap phoneme acquisition”, CoNLL-2014, p. 191, 2014.
[14]
William M Fisher, George R Doddington, and Kathleen M Goudie-Marshall, “The darpa speech recognition research database: specifications and status”, in Proc. DARPA Workshop on speech recognition, 1986, pp. 93–99.
[15]
Maarten Versteegh, Roland Thiolliere, Thomas Schatz, Xuan Nga Cao, Xavier Anguera, Aren Jansen, and Emmanuel Dupoux, “The zero resource speech challenge 2015”, in Proc. of Interspeech, 2015.
[16]
M. A. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, and E. Fosler-Lussier, “Buckeye corpus of conversational speech (2nd release)”, www.buckeyecorpus.osu.edu, 2007.
[17]
N. J. de Vries, M. H. Davel, J. Badenhorst, W. D. Basson, F. de Wet, E. Barnard, and A. de Waal, “A smartphone-based asr data collection tool for under-resourced languages”, Speech Communication, vol. 56, pp. 119–131, 2014.
[18]
A. Jansen and B. Van Durme, “Efficient spoken term discovery using randomized algorithms”, in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011, pp. 401–406.
[19]
R. Thiollière, E. Dunbar, G. Synnaeve, M. Versteegh, and E. Dupoux, “A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling”, in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[20]
Hongjie Chen, Cheung-Chi Leung, Lei Xie, Bin Ma, and Haizhou Li, “Parallel inference of dirichlet process gaussian mixture models for unsupervised acoustic modeling: A feasibility study”, in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[21]
V. Peddinti, T. Sainath, S. Maymon, B. Ramabhadran, D. Nahamoo, and V. Goel, “Deep scattering spectrum with deep neural networks”, in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 210–214.

Cited By

View all
  • (2020)KymatioThe Journal of Machine Learning Research10.5555/3455716.345577621:1(2256-2261)Online publication date: 1-Jan-2020

Index Terms

  1. A deep scattering spectrum — Deep Siamese network pipeline for unsupervised acoustic modeling
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Guide Proceedings
            2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
            6592 pages

            Publisher

            IEEE Press

            Publication History

            Published: 01 March 2016

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 14 Jan 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2020)KymatioThe Journal of Machine Learning Research10.5555/3455716.345577621:1(2256-2261)Online publication date: 1-Jan-2020

            View Options

            View options

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media