research-article

Multimodal tag localization based on deep learning

Authors:

JinTao LiAuthors Info & Claims

ICIMCS '15: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service

Article No.: 50, Pages 1 - 4

https://doi.org/10.1145/2808492.2808542

Published: 19 August 2015 Publication History

Abstract

Tag localization which localizes the relevant video clips for an associated semantic tag has become an important research topic in the field of video retrieval and recommendation. Most existing approaches adopt and depend in large degree on carefully selected features which are manually designed by experts and do not take into consideration of multimodality. In order to take into account complementarity of different modalities and take advantage of learned features, in this paper, we propose a multimodal tag localization framework by exploiting deep learning to learn both visual and textual features of videos for tag localization, followed by the multimodal fusion of both visual and textual results. Extensive experiments on the public dataset show that our proposed approach achieves promising results. The tag localization based on visual deep learning greatly improves the precision of tag localization, and the multi-modal fusion of both visual and textual modalities further improves the precision despite the low performances of single textual modality.

References

[1]

L. Ballan, M. Bertini, A. Del Bimbo, M. Meoni, and G. Serra. Tag suggestion and localization in user-generated videos based on social knowledge. In Proceedings of second ACM SIGMM workshop on Social media, pages 3--8. ACM, 2010.

Digital Library

[2]

W.-T. Chu, C.-J. Li, and Y.-K. Chou. Tag suggestion and localization for web videos by bipartite graph matching. In Proceedings of the 3rd ACM SIGMM international workshop on Social media, pages 35--40. ACM, 2011.

Digital Library

[3]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248--255. IEEE, 2009.

[4]

R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google's image search. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1816--1823. IEEE, 2005.

Digital Library

[5]

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725--1732. IEEE, 2014.

Digital Library

[6]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.

Digital Library

[7]

G. Li, M. Wang, Y.-T. Zheng, H. Li, Z.-J. Zha, and T.-S. Chua. Shottagger: tag location for internet videos. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 37. ACM, 2011.

Digital Library

[8]

H. Li, L. Yi, Y. Guan, and H. Zhang. Dut-webv: a benchmark dataset for performance evaluation of tag localization for web video. In Advances in Multimedia Modeling, pages 305--315. Springer, 2013.

[9]

H. Li, L. Yi, B. Liu, and Y. Wang. Localizing relevant frames in web videos using topic model and relevance filtering. Machine Vision and Applications, 25(7):1661--1670, 2014.

Digital Library

[10]

W. Liu, T. Mei, and Y. Zhang. Instant mobile video search with layered audio-video indexing and progressive transmission. Multimedia, IEEE Transactions on, 16(8):2242--2255, 2014.

[11]

W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo. Multi-task deep visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3707--3715, 2015.

[12]

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[13]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119, 2013.

Digital Library

[14]

A. Ulges, C. Schulze, and T. Breuel. Multiple instance learning from weakly labeled videos. In SAMT Workshop on Cross-Media Information Analysis and Retrieval, 2008.

[15]

M.-L. Zhang and Z.-H. Zhou. Improve multi-instance neural networks through feature selection. Neural Processing Letters, 19(1):1--10, 2004.

Digital Library

Cited By

Zhang RTang SLiu WZhang YLi J(2017)Multi-modal tag localization for mobile video searchMultimedia Systems10.1007/s00530-016-0506-923:6(713-724)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1007/s00530-016-0506-9

Index Terms

Multimodal tag localization based on deep learning
1. Information systems
  1. Information retrieval

Recommendations

Survey on Deep Learning Based Fusion Recognition of Multimodal Biometrics
Biometric Recognition
Abstract
We take multimodal as a new research paradigm. This research paradigm is based on the premise that all human interactions with the outside world required the support of multimodal sensory systems. Deep learning (DL) has shown outstanding ...
Tag suggestion and localization in user-generated videos based on social knowledge
WSM '10: Proceedings of second ACM SIGMM workshop on Social media

Nowadays, almost any web site that provides means for sharing user-generated multimedia content, like Flickr, Facebook, YouTube and Vimeo, has tagging functionalities to let users annotate the material that they want to share. The tags are then used to ...
Robust Deep Multi-modal Learning Based on Gated Information Fusion Network
Computer Vision – ACCV 2018
Abstract
The goal of multi-modal learning is to use complementary information on the relevant task provided by the multiple modalities to achieve reliable and robust performance. Recently, deep learning has led significant improvement in multi-modal ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICIMCS '15: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service

August 2015

397 pages

ISBN:9781450335287

DOI:10.1145/2808492

General Chairs:
Ramesh Jain
University of California, Irvine
,
Shuqiang Jiang
Institute of Computing Technology, Chinese Academy of Sciences, China
,
Program Chairs:
John Smith
IBM Thomas J. Watson Research Center
,
Jitao Sang
Institute of Automation, Chinese Academy of Sciences, China
,
Guohui Li
National University of Defense Technology, China

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 August 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Cosponsored Project of Beijing Committee of Education
Beijing Natural Science Foundation
National Nature Science Foundation of China
the Funds for Creative Research Groups of China under Grant
863 Project

Conference

ICIMCS '15

ICIMCS '15: International Conference on Internet Multimedia Computing and Service

August 19 - 21, 2015

Hunan, Zhangjiajie, China

Acceptance Rates

ICIMCS '15 Paper Acceptance Rate 20 of 128 submissions, 16%;

Overall Acceptance Rate 163 of 456 submissions, 36%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
161
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang RTang SLiu WZhang YLi J(2017)Multi-modal tag localization for mobile video searchMultimedia Systems10.1007/s00530-016-0506-923:6(713-724)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1007/s00530-016-0506-9

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents