Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2983563.2983567acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking

Published: 16 October 2016 Publication History

Abstract

Video hyperlinking represents a classical example of multimodal problems. Common approaches to such problems are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially deep autoencoders, have proven promising both for crossmodal translation and for early fusion via multimodal embedding. A particular architecture, bidirectional symmetrical deep neural networks, have been proven to yield improved multimodal embeddings over classical autoencoders, while also being able to perform crossmodal translation. In this work, we focus firstly at evaluating good single-modal continuous representations both for textual and for visual information. Word2Vec and paragraph vectors are evaluated for representing collections of words, such as parts of automatic transcripts and multiple visual concepts, while different deep convolutional neural networks are evaluated for directly embedding visual information, avoiding the creation of visual concepts. Secondly, we evaluate methods for multimodal fusion and crossmodal translation, with different single-modal pairs, in the task of video hyperlinking. Bidirectional (symmetrical) deep neural networks were shown to successfully tackle downsides of multimodal autoencoders and yield a superior multimodal representation. In this work, we extensively tests them in different settings, with different single-modal representations, within the context of video-hyperlinking. Our novel bidirectional symmetrical deep neural networks are compared to classical autoencoders and are shown to yield significantly improved multimodal embeddings that significantly (alpha=0.0001) outperform multimodal embeddings obtained by deep autoencoders with an absolute improvement in precision at 10 of 14.1% when embedding visual concepts and automatic transcripts and an absolute improvement of 4.3% when embedding automatic transcripts with features obtained with very deep convolutional neural networks, yielding 80% of precision at 10.

References

[1]
M. Campr and K. Je\vzek. Comparing semantic models for evaluating automatic document summarization. In Text, Speech, and Dialogue, 2015.
[2]
M. Cha, Y. Gwon, and H. T. Kung. Multimodal sparse representation learning and applications. CoRR, abs/1511.06238, 2015.
[3]
F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In ACM Intl. Conf. on Multimedia, pages 7--16, 2014.
[4]
X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Aistats, volume 9, pages 249--256, 2010.
[5]
C. Guinaudeau, A. R. Simon, G. Gravier, and P. Sébillot. HITS and IRISA at MediaEval 2013: Search and hyperlinking task. In Working Notes MediaEval Workshop, 2013.
[6]
G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504--507, 2006.
[7]
L. Jiang, S.-I. Yu, D. Meng, Y. Yang, T. Mitamura, and A. G. Hauptmann. Fast and accurate content-based semantic search in 100m internet videos. In Proceedings of the 23rd ACM international conference on Multimedia, pages 49--58. ACM, 2015.
[8]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.
[9]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.
[10]
Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, volume 14, pages 1188--1196, 2014.
[11]
H. Lu, Y. Liou, H. Lee, and L. Lee. Semantic retrieval of personal photos using a deep autoencoder fusing visual features with speech annotations represented as word/paragraph vectors. In Annual Conf. of the Intl. Speech Communication Association, 2015.
[12]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 2013.
[13]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Intl. Conf. on Machine Learning, 2011.
[14]
P. Over, J. Fiscus, G. Sanders, D. Joy, M. Michel, G. Awad, A. Smeaton, W. Kraaij, and G. Quénot. Trecvid 2014--an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID, page 52, 2014.
[15]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211--252, 2015.
[16]
T. Search and H. T. at MediaEval 2014. Maria eskevich and robin aly and david n. racca and roeland ordelman and shu chen and gareth j.f. jones. In Working Notes MediaEval Workshop, 2014.
[17]
A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 806--813, 2014.
[18]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of ICLR, 2015.
[19]
N. Srivastava and R. Salakhutdinov. Learning representations for multimodal data with deep belief nets. In Intl.\ Conf.\ on Machine Learning, 2012.
[20]
T. Tommasi, T. Tuytelaars, and B. Caputo. A testbed for cross-dataset analysis. CoRR, abs/1402.5923, 2014.
[21]
V. Vukotić, C. Raymond, and G. Gravier. Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pages 343--346. ACM, 2016.
[22]
J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, 81(1):21--35, 2010.
[23]
G. Ye, Y. Li, H. Xu, D. Liu, and S.-F. Chang. Eventnet: A large scale structured concept library for complex event detection in video. In Proceedings of the 23rd ACM international conference on Multimedia, pages 471--480. ACM, 2015.

Cited By

View all
  • (2025)Edge-Cloud Collaborated Object Detection via Bandwidth Adaptive Difficult-Case DiscriminatorIEEE Transactions on Mobile Computing10.1109/TMC.2024.347474324:2(1181-1196)Online publication date: Feb-2025
  • (2023)Edge-Cloud Collaborated Object Detection via Difficult-Case Discriminator2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS57875.2023.00062(259-270)Online publication date: Jul-2023
  • (2022)A novel integrative computational framework for breast cancer radiogenomic biomarker discoveryComputational and Structural Biotechnology Journal10.1016/j.csbj.2022.05.03120(2484-2494)Online publication date: 2022
  • Show More Cited By

Index Terms

  1. Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        iV&L-MM '16: Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion
        October 2016
        70 pages
        ISBN:9781450345194
        DOI:10.1145/2983563
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 16 October 2016

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. CNN
        2. DNN
        3. autoencoder
        4. bidirectional learning
        5. convolutional neural networks
        6. crossmodal
        7. deep learning
        8. deep neural networks
        9. embedding
        10. multimodal
        11. multimodal fusion
        12. neural networks
        13. representation
        14. retrieval
        15. shared weights
        16. tied weights
        17. video and text
        18. video hyperlinking
        19. video retrieval

        Qualifiers

        • Research-article

        Conference

        MM '16
        Sponsor:
        MM '16: ACM Multimedia Conference
        October 16, 2016
        Amsterdam, The Netherlands

        Acceptance Rates

        iV&L-MM '16 Paper Acceptance Rate 7 of 15 submissions, 47%;
        Overall Acceptance Rate 7 of 15 submissions, 47%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)3
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 01 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)Edge-Cloud Collaborated Object Detection via Bandwidth Adaptive Difficult-Case DiscriminatorIEEE Transactions on Mobile Computing10.1109/TMC.2024.347474324:2(1181-1196)Online publication date: Feb-2025
        • (2023)Edge-Cloud Collaborated Object Detection via Difficult-Case Discriminator2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS57875.2023.00062(259-270)Online publication date: Jul-2023
        • (2022)A novel integrative computational framework for breast cancer radiogenomic biomarker discoveryComputational and Structural Biotechnology Journal10.1016/j.csbj.2022.05.03120(2484-2494)Online publication date: 2022
        • (2021)Unsupervised Learning of Cross-Modal Mappings in Multi-Omics data for Survival Stratification of Gastric CancerFuture Oncology10.2217/fon-2021-105918:2(215-230)Online publication date: 2-Dec-2021
        • (2021)Bidirectional deep neural networks to integrate RNA and DNA data for predicting outcome for patients with hepatocellular carcinomaFuture Oncology10.2217/fon-2021-065917:33(4481-4495)Online publication date: Nov-2021
        • (2021)Imbalanced Source-free Domain AdaptationProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475487(3330-3339)Online publication date: 17-Oct-2021
        • (2021)InterBN: Channel Fusion for Adversarial Unsupervised Domain AdaptationProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475481(3691-3700)Online publication date: 17-Oct-2021
        • (2021)Towards Robust Cross-domain Image Understanding with Unsupervised Noise RemovalProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475175(3024-3033)Online publication date: 17-Oct-2021
        • (2021)PFA: Privacy-preserving Federated Adaptation for Effective Model PersonalizationProceedings of the Web Conference 202110.1145/3442381.3449847(923-934)Online publication date: 19-Apr-2021
        • (2021)Learning to Match Anchor-Target Video Pairs With Dual Attentional Holographic NetworksIEEE Transactions on Image Processing10.1109/TIP.2021.311316530(8130-8143)Online publication date: 2021
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media