Google Scholar

[PDF][PDF] Multimodal deep learning

J Ngiam, A Khosla, M Kim, J Nam, H Lee… - Proceedings of the 28th …, 2011 - ai.stanford.edu

J Ngiam, A Khosla, M Kim, J Nam, H Lee, AY Ng

Proceedings of the 28th international conference on machine learning …, 2011•ai.stanford.edu

Abstract

Deep networks have been successfully applied to unsupervised feature learning for single modalities (eg, text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (eg, video) can be learned if multiple modalities (eg, audio and video) are present at feature learning time. Furthermore, we show how to learn a shared representation between modalities and evaluate it on a unique task, where the classifier is trained with audio-only data but tested with video-only data and vice-versa. Our models are validated on the CUAVE and AVLetters datasets on audio-visual speech classification, demonstrating best published visual speech classification on AVLetters and effective shared representation learning.

ai.stanford.edu

Show moreShow less

Save Cite Cited by 4086 Related articles All 29 versions View as HTML

Cite

Advanced search

Saved to My library

[PDF][PDF] Multimodal deep learning