Multi-modal Deep Analysis for Multimedia

Zhu, Wenwu; Wang, Xin; Li, Hongzhi

doi:10.1109/TCSVT.2019.2940647

Computer Science > Multimedia

arXiv:1910.04964 (cs)

[Submitted on 11 Oct 2019 (v1), last revised 4 Jan 2020 (this version, v2)]

Title:Multi-modal Deep Analysis for Multimedia

Authors:Wenwu Zhu, Xin Wang, Hongzhi Li

View PDF

Abstract:With the rapid development of Internet and multimedia services in the past decade, a huge amount of user-generated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for multi-modal analysis in multimedia. We introduce two scientific research problems, data-driven correlational representation and knowledge-guided fusion for multimedia analysis. To address the two scientific problems, we investigate them from the following aspects: 1) multi-modal correlational representation: multi-modal fusion of data across different modalities, and 2) multi-modal data and knowledge fusion: multi-modal fusion of data with domain knowledge. More specifically, on data-driven correlational representation, we highlight three important categories of methods, such as multi-modal deep representation, multi-modal transfer learning, and multi-modal hashing. On knowledge-guided fusion, we discuss the approaches for fusing knowledge with data and four exemplar applications that require various kinds of domain knowledge, including multi-modal visual question answering, multi-modal video summarization, multi-modal visual pattern mining and multi-modal recommendation. Finally, we bring forward our insights and future research directions.

Comments:	25 pages, 39 figures, IEEE Transactions on Circuits and Systems for Video Technology
Subjects:	Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:1910.04964 [cs.MM]
	(or arXiv:1910.04964v2 [cs.MM] for this version)
	https://doi.org/10.48550/arXiv.1910.04964
Related DOI:	https://doi.org/10.1109/TCSVT.2019.2940647

Submission history

From: Xin Wang [view email]
[v1] Fri, 11 Oct 2019 04:21:36 UTC (6,827 KB)
[v2] Sat, 4 Jan 2020 08:42:13 UTC (14,282 KB)

Computer Science > Multimedia

Title:Multi-modal Deep Analysis for Multimedia

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multimedia

Title:Multi-modal Deep Analysis for Multimedia

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators