Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2808492.2808542acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicimcsConference Proceedingsconference-collections
research-article

Multimodal tag localization based on deep learning

Published: 19 August 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Tag localization which localizes the relevant video clips for an associated semantic tag has become an important research topic in the field of video retrieval and recommendation. Most existing approaches adopt and depend in large degree on carefully selected features which are manually designed by experts and do not take into consideration of multimodality. In order to take into account complementarity of different modalities and take advantage of learned features, in this paper, we propose a multimodal tag localization framework by exploiting deep learning to learn both visual and textual features of videos for tag localization, followed by the multimodal fusion of both visual and textual results. Extensive experiments on the public dataset show that our proposed approach achieves promising results. The tag localization based on visual deep learning greatly improves the precision of tag localization, and the multi-modal fusion of both visual and textual modalities further improves the precision despite the low performances of single textual modality.

    References

    [1]
    L. Ballan, M. Bertini, A. Del Bimbo, M. Meoni, and G. Serra. Tag suggestion and localization in user-generated videos based on social knowledge. In Proceedings of second ACM SIGMM workshop on Social media, pages 3--8. ACM, 2010.
    [2]
    W.-T. Chu, C.-J. Li, and Y.-K. Chou. Tag suggestion and localization for web videos by bipartite graph matching. In Proceedings of the 3rd ACM SIGMM international workshop on Social media, pages 35--40. ACM, 2011.
    [3]
    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248--255. IEEE, 2009.
    [4]
    R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google's image search. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1816--1823. IEEE, 2005.
    [5]
    A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725--1732. IEEE, 2014.
    [6]
    A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.
    [7]
    G. Li, M. Wang, Y.-T. Zheng, H. Li, Z.-J. Zha, and T.-S. Chua. Shottagger: tag location for internet videos. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 37. ACM, 2011.
    [8]
    H. Li, L. Yi, Y. Guan, and H. Zhang. Dut-webv: a benchmark dataset for performance evaluation of tag localization for web video. In Advances in Multimedia Modeling, pages 305--315. Springer, 2013.
    [9]
    H. Li, L. Yi, B. Liu, and Y. Wang. Localizing relevant frames in web videos using topic model and relevance filtering. Machine Vision and Applications, 25(7):1661--1670, 2014.
    [10]
    W. Liu, T. Mei, and Y. Zhang. Instant mobile video search with layered audio-video indexing and progressive transmission. Multimedia, IEEE Transactions on, 16(8):2242--2255, 2014.
    [11]
    W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo. Multi-task deep visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3707--3715, 2015.
    [12]
    T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    [13]
    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119, 2013.
    [14]
    A. Ulges, C. Schulze, and T. Breuel. Multiple instance learning from weakly labeled videos. In SAMT Workshop on Cross-Media Information Analysis and Retrieval, 2008.
    [15]
    M.-L. Zhang and Z.-H. Zhou. Improve multi-instance neural networks through feature selection. Neural Processing Letters, 19(1):1--10, 2004.

    Cited By

    View all

    Index Terms

    1. Multimodal tag localization based on deep learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ICIMCS '15: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service
      August 2015
      397 pages
      ISBN:9781450335287
      DOI:10.1145/2808492
      • General Chairs:
      • Ramesh Jain,
      • Shuqiang Jiang,
      • Program Chairs:
      • John Smith,
      • Jitao Sang,
      • Guohui Li
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 August 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. deep learning
      2. multi-modal fusion
      3. semantic tag localization

      Qualifiers

      • Research-article

      Funding Sources

      • the Cosponsored Project of Beijing Committee of Education
      • Beijing Natural Science Foundation
      • National Nature Science Foundation of China
      • the Funds for Creative Research Groups of China under Grant
      • 863 Project

      Conference

      ICIMCS '15

      Acceptance Rates

      ICIMCS '15 Paper Acceptance Rate 20 of 128 submissions, 16%;
      Overall Acceptance Rate 163 of 456 submissions, 36%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media