chapter

Challenges and applications in multimodal machine learning

Authors:

Tadas Baltrušaitis,

Chaitanya Ahuja,

Louis-Philippe MorencyAuthors Info & Claims

The Handbook of Multimodal-Multisensor Interfaces: Signal Processing, Architectures, and Detection of Emotion and Cognition - Volume 2

October 2018

Pages 17 - 48

https://doi.org/10.1145/3107990.3107993

Published: 01 October 2018 Publication History

First page of PDF

References

[1]

D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, pp. 173--182. 23, 25

Digital Library

[2]

C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos. 2015. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review, 43(2):155--177. 23

Digital Library

[3]

G. Andrew, R. Arora, J. Bilmes, and K. Livescu. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning, pp. 1247--1255. 25, 32

Digital Library

[4]

L. Anne Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1--10. 37

[5]

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2425--2433, 2015. 23, 27

Digital Library

[6]

R. Arora and K. Livescu. 2013. Multi-view cca-based acoustic features for phonetic recognition across speakers and domains. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7135--7139. IEEE. 34

[7]

P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, 16(6):345--379, 2010. 21

Digital Library

[8]

D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural Machine Translation By Jointly Learning To Align and Translate. ICLR. 29

[9]

T. Baltrušaitis, C. Ahuja, and L.-P. Morency. 2017. Multimodal machine learning: A survey and taxonomy. arXiv preprint arXiv:1705.09406. 20, 21, 38

[10]

L. W. Barsalou. 2008. Grounded cognition. Annu. Rev. Psychol., 59:617--645. 36

[11]

Y. Bengio, A. Courville, and P. Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798--1828. 23, 26, 27, 28

Digital Library

[12]

J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, et al. 2010. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pp. 333--342. ACM. 22

Digital Library

[13]

A. Blum and T. Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 92--100. ACM. 33, 34

Digital Library

[14]

H. Bourlard and S. Dupont. 1996. A mew asr approach based on independent processing and recombination of partial frequency bands. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 1, pp. 426--429. IEEE, 1996. 21

[15]

M. Brand, N. Oliver, and A. Pentland. 1997. Coupled hidden markov models for complex action recognition. In Computer vision and pattern recognition, 1997. proceedings., 1997 ieee computer society conference on, pp. 994--999. IEEE. 21

Digital Library

[16]

M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. 2010. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3594--3601. IEEE. 31

[17]

E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran. Distributional semantics in technicolor. 2012. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 136--145. Association for Computational Linguistics. 36

Digital Library

[18]

E. Bruni, N.-K. Tran, and M. Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res.(JAIR), 49(2014): 1--47. 36

Digital Library

[19]

Y. Cao, M. Long, J. Wang, Q. Yang, and S. Y. Philip. 2016. Deep visual-semantic hashing for cross-modal retrieval. In KDD, pp. 1445--1454. 27, 31

Digital Library

[20]

J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, et al. 2005. The ami meeting corpus: A pre-announcement. In International Workshop on Machine Learning for Multimodal Interaction, pp. 28--39. Springer. 22

Digital Library

[21]

X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325. 30

[22]

C. M. Christoudias, K. Saenko, L.-P. Morency, and T. Darrell. 2006. Co-adaptation of audio-visual speech and gesture classifiers. In Proceedings of the 8th international conference on Multimodal interfaces, pp. 84--91. ACM. 33

Digital Library

[23]

C. M. Christoudias, R. Urtasun, and T. Darrell. 2008. Multi-view learning in the presence of view disagreement. In UAI. 33

Digital Library

[24]

P. Cosi, E. M. Caldognetto, K. Vagges, G. A. Mian, and M. Contolini. 1994. Bimodal recognition experiments with recurrent neural networks. In Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, volume 2, pp. II-553. IEEE, 1994. 30

[25]

F. De la Torre and J. F. Cohn. 2011. Facial expression analysis. In Visual analysis of humans, pp. 377--409. Springer. 22

[26]

S. K. D'Mello and J. Kory. 2015. A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys (CSUR), 47(3): 43. 22, 25, 26

Digital Library

[27]

G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, and Y. Avrithis. 2013. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia, 15(7): 1553--1568. 22

Digital Library

[28]

A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. 2009. Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition, 2009., pp. 1778--1785. IEEE. 37

[29]

F. Feng, X. Wang, and R. Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 7--16. ACM. 32

Digital Library

[30]

F. Feng, R. Li, and X. Wang. 2015. Deep correspondence restricted boltzmann machine for cross-modal retrieval. Neurocomputing, 154: 50--60. 32

Digital Library

[31]

Y. Feng and M. Lapata. 2010. Visual information in semantic representation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 91--99. Association for Computational Linguistics. 36

Digital Library

[32]

A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121--2129. 25, 27, 30, 34, 35, 37

Digital Library

[33]

X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249--256. 28

[34]

M. Gurban, J.-P. Thiran, T. Drugman, and T. Dutoit. 2008. Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition. In Proceedings of the 10th international conference on Multimodal interfaces, pp. 237--240. ACM. 21

Digital Library

[35]

D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12): 2639--2664. 32

Digital Library

[36]

G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine. 23, 25

[37]

G. E. Hinton and R. S. Zemel. 1994. Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pp. 3--10. 28

Digital Library

[38]

G. E. Hinton, S. Osindero, and Y.-W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation, 18(7): 1527--1554. 28

Digital Library

[39]

S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8): 1735--1780. 29

Digital Library

[40]

M. Hodosh, P. Young, and J. Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47: 853--899, 2013. 22

Digital Library

[41]

H. Hotelling. 1936. Relations between two sets of variates. Biometrika, 28(3/4):321--377. 32

[42]

J. Huang and B. Kingsbury. 2013. Audio-visual deep learning for noise robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7596--7599. IEEE. 28

[43]

A. Jameson and P. O. Kristensson. 2017. Understanding and supporting modality choices. In The Handbook of Multimodal-Multisensor Interfaces, pp. 201--238. Association for Computing Machinery and Morgan & Claypool. 21

Digital Library

[44]

Q.-y. Jiang and W.-j. Li. 2017. Deep Cross-Modal Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31

[45]

X. Jiang, F. Wu, Y. Zhang, S. Tang, W. Lu, and Y. Zhuang. 2015. The classification of multi-modal data with hidden conditional random field. Pattern Recognition Letters, 51: 63--69. 31

Digital Library

[46]

B. H. Juang and L. R. Rabiner. 1991. Hidden markov models for speech recognition. Technometrics, 33(3): 251--272. 21

Digital Library

[47]

S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-Lewandowski, et al. 2016. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 10(2): 99--111. 27

[48]

M. M. Khapra, A. Kumaran, and P. Bhattacharyya. 2010. Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 420--428. Association for Computational Linguistics. 37

Digital Library

[49]

D. Kiela and L. Bottou. 2014. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In EMNLP, pp. 36--45. 36

[50]

D. Kiela and S. Clark. 2015. Multi-and cross-modal semantics beyond vision: Grounding in auditory perception. Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2461--2470. 36

[51]

D. Kiela, L. Bulat, and S. Clark. 2015. Grounding semantics in olfactory perception. In ACL (2), pp. 231--236. 36

[52]

Y. Kim, H. Lee, and E. M. Provost. 2013. Deep learning for robust feature generation in audiovisual emotion recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 3687--3691. IEEE. 27, 28

[53]

R. Kiros, R. Salakhutdinov, and R. S. Zemel. 2015. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. TACL. 27, 30

[54]

B. Klein, G. Lev, G. Sadeh, and L. Wolf. 2015. Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation. In CVPR. 32

[55]

C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. 2014. What are you talking about? text-to-image coreference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3558--3565. 36

Digital Library

[56]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097--1105. 23, 25

Digital Library

[57]

S. Kumar and R. Udupa. 2011. Learning hash functions for cross-view similarity search. In IJCAIproceedings-international joint conference on artificial intelligence, volume 22, p. 1360. 31

Digital Library

[58]

P. L. Lai and C. Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 10(05): 365--377. 32

[59]

A. Lazaridou, E. Bruni, and M. Baroni. 2014. Is this a wampimuk? cross-modal mapping between distributional semantics and the visual world. In ACL (1), pp. 1403--1414. 37

[60]

A. Levin, P. Viola, and Y. Freund. 2003. Unsupervised improvement of visual detectors using cotraining. In ICCV. 33

Digital Library

[61]

Y. Li, S. Wang, Q. Tian, and X. Ding. 2015. A survey of recent advances in visual feature detection. Neurocomputing, 149: 736--751. 23

Digital Library

[62]

R. Lienhart. 1999. Comparison of automatic shot boundary detection algorithms. In Storage and Retrieval for Image and Video Databases (SPIE), pp. 290--301. 22

[63]

M. M. Louwerse. 2011. Symbol interdependency in symbolic and embodied cognition. Topics in Cognitive Science, 3(2): 273--302. 36

[64]

D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2): 91--110. 23

Digital Library

[65]

B. Mahasseni and S. Todorovic. 2016. Regularizing long short term memory with 3d human-skeleton sequences for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3054--3062. 34, 35

[66]

H. McGurk and J. MacDonald. 1976. Hearing lips and seeing voices. Nature, 264(5588): 746--748. 21

[67]

G. McKeown, M. F. Valstar, R. Cowie, and M. Pantic. 2010. The semaine corpus of emotionally coloured character interactions. In Multimedia and Expo (ICME), 2010 IEEE International Conference on, pp. 1079--1084. IEEE. 22

[68]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111--3119. 25, 35

Digital Library

[69]

S. Moon, S. Kim, and H. Wang. 2015. Multimodal Transfer Deep Learning for Audio-Visual Recognition. NIPS Workshops. 34

[70]

Y. Mroueh, E. Marcheret, and V. Goel. 2015. Deep multimodal learning for audio-visual speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 2130--2134. IEEE. 27

[71]

P. Nakov and H. T. Ng. 2012. Improving statistical machine translation for a resource-poor language using related resource-rich languages. Journal of Artificial Intelligence Research, 44: 179--222. 34, 37

Digital Library

[72]

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689--696. 21, 26, 27, 28, 34

Digital Library

[73]

M. A. Nicolaou, H. Gunes, and M. Pantic. 2011. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Transactions on Affective Computing, 2(2): 92--105. 27, 30

Digital Library

[74]

W. Ouyang, X. Chu, and X. Wang. 2014. Multi-source deep learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2329--2336. 26, 27, 29

Digital Library

[75]

M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. 2009. Zero-shot learning with semantic output codes. In Advances in neural information processing systems, pp. 1410--1418. 34, 37

Digital Library

[76]

Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4594--4602. 27, 30

[77]

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641--2649. 36

Digital Library

[78]

S. S. Rajagopalan, L.-P. Morency, T. Baltrušaitis, and R. Goecke. 2016. Extending long short-term memory for multi-view structured learning. In European Conference on Computer Vision, pp. 338--353. Springer. 27, 30

[79]

J. Rajendran, M. M. Khapra, S. Chandar, and B. Ravindran. 2015. Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning. In NAACL. 34, 37

[80]

N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia, pp. 251--260. ACM. 32

Digital Library

[81]

M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. 2013. Grounding Action Descriptions in Videos. TACL. ISSN 2307--387X. 36

[82]

R. Salakhutdinov and G. Hinton. 2009. Deep boltzmann machines. In Artificial Intelligence and Statistics, pp. 448--455. 28

[83]

M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp. 2007. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 9(7): 1396--1403. 32

Digital Library

[84]

A. Sarkar. 2001. Applying co-training methods to statistical parsing. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pp. 1--8. Association for Computational Linguistics. 33

Digital Library

[85]

B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic. 2011. Avec 2011--the first international audio/visual emotion challenge. Affective Computing and Intelligent Interaction, pp. 415--424. 22

Digital Library

[86]

E. Shutova, D. Kiela, and J. Maillard. 2016. Black holes and white rabbits: Metaphor identification with visual features. In HLT-NAACL, pp. 160--170. 34, 36

[87]

C. Silberer and M. Lapata. 2012. Grounded models of semantic representation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1423--1433. Association for Computational Linguistics. 36

Digital Library

[88]

C. Silberer and M. Lapata. 2014. Learning grounded meaning representations with autoencoders. In ACL (1), pp. 721--732. 27, 28

[89]

M. Slaney and M. Covell. 2001. Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In Advances in Neural Information Processing Systems, pp. 814--820. 32

Digital Library

[90]

C. G. Snoek and M. Worring. 2005. Multimodal video indexing: A review of the state-of-the-art. volume 25, pp. 5--35. Springer. 21, 22

Digital Library

[91]

R. Socher and L. Fei-Fei. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 966--973. IEEE. 37

[92]

R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. 2013. Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935--943. 34, 37

Digital Library

[93]

R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2: 207--218. 30

[94]

N. Srivastava and R. Salakhutdinov. 2012a. Learning representations for multimodal data with deep belief nets. In International conference on machine learning workshop. 28

[95]

N. Srivastava and R. R. Salakhutdinov. 2012b. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pp. 2222--2230. 23, 27, 29, 34

Digital Library

[96]

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1): 1929--1958. 28

Digital Library

[97]

H.-I. Suk, S.-W. Lee, D. Shen, A. D. N. Initiative, et al. 2014. Hierarchical feature representation and multimodal fusion with deep learning for ad/mci diagnosis. NeuroImage, 101: 569--582. 29

[98]

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou. 2016. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 5200--5204. IEEE. 25

[99]

M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, and M. Pantic. 2013. Avec 2013: the continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge, pp. 3--10. ACM. 22

Digital Library

[100]

I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun. 2016. Order-Embeddings of Images and Language. In ICLR. 25, 27, 31

[101]

S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. NAACL. 29

[102]

D. Wang, P. Cui, M. Ou, and W. Zhu. 2015a. Deep multimodal hashing with orthogonal regularization. In IJCAI, pp. 2291--2297. 26, 28

Digital Library

[103]

J. Wang, H. T. Shen, J. Song, and J. Ji. 2014. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927. 31

[104]

L. Wang, Y. Li, and S. Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005--5013. 31

[105]

W. Wang, R. Arora, K. Livescu, and J. Bilmes. 2015b. On deep multi-view representation learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1083--1092. 27, 32

Digital Library

[106]

J. Weston, S. Bengio, and N. Usunier. 2011. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, volume 11, pp. 2764--2770. 30

Digital Library

[107]

D. Wu and L. Shao. 2014. Multimodal dynamic networks for gesture recognition. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 945--948. ACM. 29

Digital Library

[108]

Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. 2014. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 167--176. ACM. 27

Digital Library

[109]

R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI, volume 5, p. 6. 27, 30

Digital Library

[110]

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 67--78. 31

[111]

H. Yu and J. M. Siskind. 2013. Grounded language learning from video described with sentences. In ACL (1), pp. 53--63. 36

[112]

B. P. Yuhas, M. H. Goldstein, and T. J. Sejnowski. 1989. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27(11): 65--71. 21

Digital Library

[113]

D. Zhang and W.-J. Li. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI, volume 1, p. 7. 32

Digital Library

[114]

H. Zhang, Z. Hu, Y. Deng, M. Sachan, Z. Yan, and E. P. Xing. 2016. Learning concept taxonomies from multi-modal data. arXiv preprint arXiv:1606.09239. 31

Cited By

Zhu XZhang Z(2024)Research on the Multimodal Teaching Model of Spoken English in Colleges and Universities by Counting and ANN ModelingApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-31989:1Online publication date: 11-Nov-2024
https://doi.org/10.2478/amns-2024-3198
Tu JWang MLi WSu JLi YLv ZLi HFeng XChen X(2023)Electronic skins with multimodal sensing and perceptionSoft Science10.20517/ss.2023.153:3Online publication date: 11-Jul-2023
https://doi.org/10.20517/ss.2023.15
Malitesta DCornacchia GPomo CDi Noia TJi WWei YZheng ZFei HChua T(2023)On Popularity Bias of Multimodal-aware Recommender Systems: A Modalities-driven AnalysisProceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval10.1145/3606040.3617441(59-68)Online publication date: 2-Nov-2023
https://dl.acm.org/doi/10.1145/3606040.3617441
Show More Cited By

Challenges and applications in multimodal machine learning
1. Computing methodologies

Recommendations

Machine Learning Applications Using Python: Cases Studies from Healthcare, Retail, and Finance
Lifelong Machine Learning
Challenges in Deep Learning for Multimodal Applications
ICMI '15: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

This consortium paper outlines a research plan for investigating deep learning techniques as applied to multimodal multi-task learning and multimodal fusion. We discuss our prior research results in this area, and how these results motivate us to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Books

The Handbook of Multimodal-Multisensor Interfaces: Signal Processing, Architectures, and Detection of Emotion and Cognition - Volume 2

October 2018

2034 pages

ISBN:9781970001716

DOI:10.1145/3107990

Editors:
Sharon Oviatt
Monash University
,
Björn Schuller
University of Augsburg and Imperial College London
,
Philip R. Cohen
Monash University
,
Daniel Sonntag
German Research Center for Artificial Intelligence (DFKI)
,
Gerasimos Potamianos
University of Thessaly
,
Antonio Krüger
Saarland University and German Research Center for Artificial Intelligence (DFKI)

Publisher

Association for Computing Machinery and Morgan & Claypool

Publication History

Published: 01 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Chapter

Appears in

ACM Books

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
594
Total Downloads

Downloads (Last 12 months)140
Downloads (Last 6 weeks)10

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu XZhang Z(2024)Research on the Multimodal Teaching Model of Spoken English in Colleges and Universities by Counting and ANN ModelingApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-31989:1Online publication date: 11-Nov-2024
https://doi.org/10.2478/amns-2024-3198
Tu JWang MLi WSu JLi YLv ZLi HFeng XChen X(2023)Electronic skins with multimodal sensing and perceptionSoft Science10.20517/ss.2023.153:3Online publication date: 11-Jul-2023
https://doi.org/10.20517/ss.2023.15
Malitesta DCornacchia GPomo CDi Noia TJi WWei YZheng ZFei HChua T(2023)On Popularity Bias of Multimodal-aware Recommender Systems: A Modalities-driven AnalysisProceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval10.1145/3606040.3617441(59-68)Online publication date: 2-Nov-2023
https://dl.acm.org/doi/10.1145/3606040.3617441
Guo XKong AKot A(2023)Deep Multimodal Sequence Fusion by Regularized Expressive Representation DistillationIEEE Transactions on Multimedia10.1109/TMM.2022.314244825(2085-2096)Online publication date: 2023
https://doi.org/10.1109/TMM.2022.3142448
Lin YGao YGong MZhang SZhang YLi Z(2023)Federated Learning on Multimodal Data: A Comprehensive SurveyMachine Intelligence Research10.1007/s11633-022-1398-020:4(539-553)Online publication date: 1-Jun-2023
https://doi.org/10.1007/s11633-022-1398-0
Navarro-Guerrero NToprak SJosifovski JJamone L(2023)Visuo-haptic object perception for robots: an overviewAutonomous Robots10.1007/s10514-023-10091-yOnline publication date: 14-Mar-2023
https://doi.org/10.1007/s10514-023-10091-y
McTear M(2020)Conversational AI: Dialogue Systems, Conversational Agents, and ChatbotsSynthesis Lectures on Human Language Technologies10.2200/S01060ED1V01Y202010HLT04813:3(1-251)Online publication date: 30-Oct-2020
https://doi.org/10.2200/S01060ED1V01Y202010HLT048
Christophides VEfthymiou VPalpanas TPapadakis GStefanidis K(2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
https://doi.org/10.1145/3418896

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Chapter

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents