Abstract
This paper presents a speaker identification system based on dynamical features of both the audio and visual modes. Speakers are modeled using a text dependent HMM methodology. Early and late audio-visual integration are investigated. Experiments are carried out for 252 speakers from the XM2VTS database. From our experimental results, it has been shown that the addition of the dynamical visual information improves the speaker identification accuracies for both clean and noisy audio conditions compared to the audio only case. The best audio, visual and audio-visual identification accuracies achieved were 86.91%, 57.14% and 94.05% respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brunelli, R. and Falavigna, D.: Person identification using multiple cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 10, pp. 955–966, Oct.1995
Brunelli, R. and Poggio, T.: Face Recognition: Features versus Templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042–1052, 1993
Chen, T.: Audiovisual Speech Processing. IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 9–21, Jan.2001
Chibelushhi, C. C., Deravi, F., and Mason, J. S. D.: A Review of Speech-Based Bimodal Recognition. IEEE Transaction on Multimedia, vol. 4, no. 1, pp. 23–36, Mar.2002
Davis, S. and Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980
Lucey, S.: Audio-Visual Speech Processing. PhD thesis, Queensland University of Technology, Brisbane, Australia, Apr.2002
Luettin J.: Speaker verification experiments on the XM2VTS database. In IDIAP Communication 98-02, IDIAP, Martigny, Switzerland, Aug.1999
Luettin, J. and Maitre, G.: Evaluation Protocol for the XM2VTSDB Database (Lausanne Protocol). In IDIAP Communication 98-05, IDIAP, Martigny, Switzerland, Oct.1998
Matthews, I., Cootes, T. F., Bangham, J. A., Cox, J. A., and Harvey, R.: Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 198–213, Feb.2002
McGurk, H. and MacDonald, J.: Hearing Lips and Seeing Voices. Nature, vol. 264, pp. 746–748, Dec.1976
Messer, K., Matas, J., Kittler, J., Luettin J., and Maitre, G.: XM2VTSDB: The Extended M2VTS Database. The Proceedings of the Second International Conference on Audio and Video-based Biometric Person Authentication (AVBPA’99), Washington D.C., pp. 72–77, Mar.1999
Netravali, A. N. and Haskell, B. G.: Digital Pictures. Plenum Press, pp. 408–416, 1998
Potamianos, G., Graf, H., and Cosatto, E.: An Image Transform Approach for HMM Based Automatic Lipreading. Proceedings of the IEEE International Conference on Image Processing, Chicago, vol. 3 pp. 173–177, 1998
Rabiner, L. R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, Feb.1989
Ramachandran, R. P., Zilovic, M. S., and Mammone, R. J.: A Comparative Study of Robust Linear Predictive Analysis Methods with Applications to Speaker Identification. IEEE Transactions on Speech and Audio Processing, vol. 3, no. 2, pp. 117–125, Mar.1995
Reynolds, D. A. and Rose, R. C.: Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, Jan.1995
Scanlon, P. and Reilly, R.: Visual Feature Analysis For Automatic Speechreading. DSP Research Group, UCD, Dublin, Ireland, 2001
Silsbee, P. and Bovik, A.: A Computer Lipreading for Improved Accuracy in Automatic Speech Recognition. IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 337–350, Sept.1990
Yacoub, S. B. and Luetin, J.: Audio-Visual Person Verification. In IDIAP Communication 98-18, IDIAP, Martigny, Switzerland, Nov.1998
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., and Woodland, P.: The HTK Book (for HTK Version 3.1). Microsoft Corporation, Cambridge University Engineering Department, Nov.2001
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fox, N., Reilly, R.B. (2003). Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features. In: Kittler, J., Nixon, M.S. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2003. Lecture Notes in Computer Science, vol 2688. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44887-X_86
Download citation
DOI: https://doi.org/10.1007/3-540-44887-X_86
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40302-9
Online ISBN: 978-3-540-44887-7
eBook Packages: Springer Book Archive