research-article

Multimodal video classification with stacked contractive autoencoders

Authors:

Zhiguang ZhouAuthors Info & Claims

Signal Processing, Volume 120, Issue C

Pages 761 - 766

https://doi.org/10.1016/j.sigpro.2015.01.001

Published: 01 March 2016 Publication History

Abstract

In this paper we propose a multimodal feature learning mechanism based on deep networks (i.e., stacked contractive autoencoders) for video classification. Considering the three modalities in video, i.e., image, audio and text, we first build one Stacked Contractive Autoencoder (SCAE) for each single modality, whose outputs will be joint together and fed into another Multimodal Stacked Contractive Autoencoder (MSCAE). The first stage preserves intra-modality semantic relations and the second stage discovers inter-modality semantic correlations. Experiments on real world dataset demonstrate that the proposed approach achieves better performance compared with the state-of-the-art methods. HighlightsA two-stage framework for multimodal video classification is proposed.The model is built based on stacked contractive autoencoders.The first stage is single modal pre-training.The second stage is multimodal fine-tuning.The objective functions are optimized by stochastic gradient descent.

References

[1]

M. Wang, X.-S. Hua, J. Tang, R. Hong, Beyond distance measurement, IEEE Trans. Multimed., 11 (2009) 465-476.

Digital Library

[2]

M. Wang, B. Ni, X.-S. Hua, T.-S. Chua, Assistive tagging, ACM Comput. Surv., 44 (2012) 1-24.

Digital Library

[3]

G. Li, M. Wang, Z. Lu, R. Hong, T.-S. Chua, In-video product annotation with web information mining, ACM Trans. Multimed. Comput. Commun. Appl., 8 (2012) 1-19.

Digital Library

[4]

Y. Yang, J. Song, Z. Huang, Z. Ma, N. Sebe, A. Hauptmann, Multi-feature fusion via hierarchical regression for multimedia analysis, IEEE Trans. Multimed., 15 (2013) 572-581.

Digital Library

[5]

M. Wang, X.-S. Hua, R. Hong, J. Tang, G.-J. Qi, Y. Song, Unified video annotation via multigraph learning, IEEE Trans. Circuits Syst. Video Technol., 19 (2009) 733-746.

Digital Library

[6]

M. Bronstein, A. Bronstein, F. Michel, N. Paragios, Data fusion through cross-modality metric learning using similarity-sensitive hashing, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3594-3601.

[7]

Y. Zhuang, Y. Yang, F. Wu, Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval, IEEE Trans. Multimed., 10 (2008) 221-229.

Digital Library

[8]

L. Zhang, Y. Gao, C. Hong, Y. Feng, J. Zhu, D. Cai, Feature correlation hypergraph, IEEE Trans. Cybern., 44 (2014) 1408-1419.

[9]

M. Wang, H. Li, D. Tao, K. Lu, X. Wu, Multimodal graph-based reranking for web image search, IEEE Trans. Image Process., 21 (2012) 4649-4661.

Digital Library

[10]

L. Zhang, Y. Gao, R. Zimmermann, Q. Tian, X. Li, Fusion of multichannel local and global structural cues for photo aesthetics evaluation, IEEE Trans. Image Process., 23 (2014) 1419-1429.

Digital Library

[11]

L. Zhang, Y. Gao, Y. Xia, Q. Dai, X. Li, A fine-grained image categorization system by cellet-encoded spatial pyramid modeling, IEEE Trans. Ind. Electron., 99 (2014) 1.

[12]

Y. Xia, X. Li, Z. Shan, Parallelized fusion on multisensor transportation data, Int. J. Intell. Syst., 28 (2013) 540-564.

[13]

L. Zhang, M. Song, X. Liu, L. Sun, C. Chen, J. Bu, Recognizing architecture styles by hierarchical sparse coding of blocklets, Inf. Sci., 254 (2014) 141-154.

Digital Library

[14]

Y. Xia, J. Hu, M.D. Fontaine, A cyber-its framework for massive traffic data analysis using cyber infrastructure, Sci. World J., http://dx.doi.org/10.1155/2013/462846

[15]

L. Zhang, Y. Han, Y. Yang, M. Song, S. Yan, Q. Tian, Discovering discriminative graphlets for aerial image categories recognition, IEEE Trans. Image Process., 22 (2013) 5071-5084.

Digital Library

[16]

L. Zhang, M. Song, X. Liu, J. Bu, C. Chen, Fast multi-view segment graph kernel for object classification, Signal Process., 93 (2013) 1597-1607.

Digital Library

[17]

G. E. Hinton, S. Osindero, A fast learning algorithm for deep belief nets, Neural Comput. 18 (2006) 2006.

Digital Library

[18]

Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, U.D. Montral, M. Qubec, Greedy layer-wise training of deep networks, Adv. Neural Inf. Process. Syst. (NIPS) (2007) 153-160.

[19]

M. Ranzato, C. Poultney, S. Chopra, Y. Lecun, Efficient learning of sparse representations with an energy-based model, Adv. Neural Inf. Process. Syst. (NIPS) (2006) 1137-1144.

[20]

Y. Bengio, A. Courville, P. Vincent, Representation learning, IEEE Trans. Pattern Anal. Mach. Intell., 35 (2013) 1798-1828.

Digital Library

[21]

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng, Multimodal deep learning, in: Proceedings of the 28th International Conference on Machine Learning (ICML), 2011.

[22]

N. Srivastava, R. Salakhutdinov, Multimodal learning with deep Boltzmann machines, J. Mach. Learn. Res., 15 (2014) 2949-2980.

Digital Library

[23]

A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, M.A. Ranzato, T. Mikolov, Devise: a deep visual-semantic embedding model, Adv. Neural Inf. Process. Syst. (NIPS) (2013) 2121-2129.

[24]

W. Wang, B.C. Ooi, X. Yang, D. Zhang, Y. Zhuang, Effective multi-modal retrieval based on stacked auto-encoders, in: Proceedings of International Conference on Very Large Data Bases (VLDB), 2014.

Digital Library

[25]

S. Rifai, P. Vincent, X. Muller, X. Glorot, Y. Bengio, Contractive auto-encoders: explicit invariance during feature extraction, in: L. Getoor, T. Scheffer (Eds.), Proceedings of the International Conference on Machine Learning (ICML), Omnipress, 2011, pp. 833-840.

[26]

Z. Ma, Y. Yang, N. Sebe, K. Zheng, A. Hauptmann, Multimedia event detection using a classifier-specific intermediate representation, IEEE Trans. Multimed., 15 (2013) 1628-1637.

Digital Library

[27]

Z. Ma, Y. Yang, N. Sebe, A. Hauptmann, Knowledge adaptation with partially shared features for event detection using few exemplars, IEEE Trans. Pattern Anal. Mach. Intell., 36 (2014) 1789-1802.

[28]

Z. Xu, Y. Yang, I. Tsang, N. Sebe, A. Hauptmann, Feature weighting via optimal thresholding for video analysis, in: Proceedings of the International Conference on Computer Vision (ICCV), 2013, pp. 3440-3447.

Digital Library

[29]

Trevid. {http://www-nlpir.nist.gov/projects/trevid/}

[30]

C.G.M. Snoek, M. Worring, J.C.V. Gemert, J. Mark Geusebroek, A.W.M. Smeulders, The challenge problem for automated detection of 101 semantic concepts in multimedia, in: Proceedings of the ACM International Conference on Multimedia, 2006, pp. 421-430.

Digital Library

[31]

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, Stacked denoising autoencoders, J. Mach. Learn. Res., 11 (2010) 3371-3408.

Digital Library

[32]

C.-C. Chang, C.-J. Lin, LIBSVM, ACM Trans. Intell. Syst. Technol., 2 (2011) 27:1-27:27.

Digital Library

[33]

Y. Yang, Z. Ma, Z. Xu, S. Yan, A. Hauptmann, How related exemplars help complex event detection in web videos? in: Proceedings of the International Conference on Computer Vision (ICCV), 2013, pp. 2104-2111.

Digital Library

[34]

H. Shen, Y. Yan, S. Xu, N. Ballas, W. Chen, Evaluation of semi-supervised learning method on action recognition, Multimed. Tools Appl. (2014) 1-20.

Digital Library

Cited By

Kukushkin MBogdan MSchmid T(2024)BiCAE – A Bimodal Convolutional Autoencoder for Seed Purity TestingMachine Learning and Knowledge Discovery in Databases. Applied Data Science Track10.1007/978-3-031-70381-2_28(447-462)Online publication date: 8-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70381-2_28
Tu XCao ZXia CMostafavi SGao GKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Cross-linked unified embedding for cross-modality representation learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601430(15942-15955)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601430
Baltescu PChen HPancha NZhai ALeskovec JRosenberg CZhang ARangwala H(2022)ItemSage: Learning Product Embeddings for Shopping Recommendations at PinterestProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539170(2703-2711)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539170
Show More Cited By

Multimodal video classification with stacked contractive autoencoders
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking
iV&L-MM '16: Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion

Video hyperlinking represents a classical example of multimodal problems. Common approaches to such problems are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...
Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Common approaches to problems involving multiple modalities (classification, retrieval, hyperlinking, etc.) are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...
Evaluating Two-Stream CNN for Video Classification
ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval

Videos contain very rich semantic information. Traditional hand-crafted features are known to be inadequate in analyzing complex video semantics. Inspired by the huge success of the deep learning methods in analyzing image, audio and text data, ...

Comments

Information & Contributors

Information

Published In

cover image Signal Processing

Signal Processing Volume 120, Issue C

March 2016

824 pages

ISSN:0165-1684

Issue’s Table of Contents

Copyright © Elsevier B.V.

Publisher

Elsevier North-Holland, Inc.

United States

Publication History

Published: 01 March 2016

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kukushkin MBogdan MSchmid T(2024)BiCAE – A Bimodal Convolutional Autoencoder for Seed Purity TestingMachine Learning and Knowledge Discovery in Databases. Applied Data Science Track10.1007/978-3-031-70381-2_28(447-462)Online publication date: 8-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70381-2_28
Tu XCao ZXia CMostafavi SGao GKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Cross-linked unified embedding for cross-modality representation learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601430(15942-15955)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601430
Baltescu PChen HPancha NZhai ALeskovec JRosenberg CZhang ARangwala H(2022)ItemSage: Learning Product Embeddings for Shopping Recommendations at PinterestProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539170(2703-2711)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539170
Wang TNg WLi JWu QZhang SNugent CShewell C(2022)A Deep Clustering via Automatic Feature Embedded Learning for Human Activity RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.305746932:1(210-223)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1109/TCSVT.2021.3057469
Bayoudh KKnani RHamdaoui FMtibaa A(2022)A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasetsThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-021-02166-738:8(2939-2970)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s00371-021-02166-7
Du DWang LLi ZWu G(2021)Cross-Modal Pyramid Translation for RGB-D Scene RecognitionInternational Journal of Computer Vision10.1007/s11263-021-01475-7129:8(2309-2327)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s11263-021-01475-7
Khaksar Pour AChaw Seng WPalaiahnakote STahaei HAnuar N(2021)A survey on video content rating: taxonomy, challenges and open issuesMultimedia Tools and Applications10.1007/s11042-021-10838-880:16(24121-24145)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.1007/s11042-021-10838-8
Chaki J(2021)Pattern analysis based acoustic signal processing: a survey of the state-of-artInternational Journal of Speech Technology10.1007/s10772-020-09681-324:4(913-955)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s10772-020-09681-3
Yu ZLi TYu NPan YChen HLiu B(2019)Reconstruction of Hidden Representation for Robust Feature ExtractionACM Transactions on Intelligent Systems and Technology10.1145/328417410:2(1-24)Online publication date: 12-Jan-2019
https://dl.acm.org/doi/10.1145/3284174
Alshamaa DChehade FHoneine P(2018)A hierarchical classification method using belief functionsSignal Processing10.1016/j.sigpro.2018.02.021148:C(68-77)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.1016/j.sigpro.2018.02.021
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents