Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Multimodal video classification with stacked contractive autoencoders

Published: 01 March 2016 Publication History

Abstract

In this paper we propose a multimodal feature learning mechanism based on deep networks (i.e., stacked contractive autoencoders) for video classification. Considering the three modalities in video, i.e., image, audio and text, we first build one Stacked Contractive Autoencoder (SCAE) for each single modality, whose outputs will be joint together and fed into another Multimodal Stacked Contractive Autoencoder (MSCAE). The first stage preserves intra-modality semantic relations and the second stage discovers inter-modality semantic correlations. Experiments on real world dataset demonstrate that the proposed approach achieves better performance compared with the state-of-the-art methods. HighlightsA two-stage framework for multimodal video classification is proposed.The model is built based on stacked contractive autoencoders.The first stage is single modal pre-training.The second stage is multimodal fine-tuning.The objective functions are optimized by stochastic gradient descent.

References

[1]
M. Wang, X.-S. Hua, J. Tang, R. Hong, Beyond distance measurement, IEEE Trans. Multimed., 11 (2009) 465-476.
[2]
M. Wang, B. Ni, X.-S. Hua, T.-S. Chua, Assistive tagging, ACM Comput. Surv., 44 (2012) 1-24.
[3]
G. Li, M. Wang, Z. Lu, R. Hong, T.-S. Chua, In-video product annotation with web information mining, ACM Trans. Multimed. Comput. Commun. Appl., 8 (2012) 1-19.
[4]
Y. Yang, J. Song, Z. Huang, Z. Ma, N. Sebe, A. Hauptmann, Multi-feature fusion via hierarchical regression for multimedia analysis, IEEE Trans. Multimed., 15 (2013) 572-581.
[5]
M. Wang, X.-S. Hua, R. Hong, J. Tang, G.-J. Qi, Y. Song, Unified video annotation via multigraph learning, IEEE Trans. Circuits Syst. Video Technol., 19 (2009) 733-746.
[6]
M. Bronstein, A. Bronstein, F. Michel, N. Paragios, Data fusion through cross-modality metric learning using similarity-sensitive hashing, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3594-3601.
[7]
Y. Zhuang, Y. Yang, F. Wu, Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval, IEEE Trans. Multimed., 10 (2008) 221-229.
[8]
L. Zhang, Y. Gao, C. Hong, Y. Feng, J. Zhu, D. Cai, Feature correlation hypergraph, IEEE Trans. Cybern., 44 (2014) 1408-1419.
[9]
M. Wang, H. Li, D. Tao, K. Lu, X. Wu, Multimodal graph-based reranking for web image search, IEEE Trans. Image Process., 21 (2012) 4649-4661.
[10]
L. Zhang, Y. Gao, R. Zimmermann, Q. Tian, X. Li, Fusion of multichannel local and global structural cues for photo aesthetics evaluation, IEEE Trans. Image Process., 23 (2014) 1419-1429.
[11]
L. Zhang, Y. Gao, Y. Xia, Q. Dai, X. Li, A fine-grained image categorization system by cellet-encoded spatial pyramid modeling, IEEE Trans. Ind. Electron., 99 (2014) 1.
[12]
Y. Xia, X. Li, Z. Shan, Parallelized fusion on multisensor transportation data, Int. J. Intell. Syst., 28 (2013) 540-564.
[13]
L. Zhang, M. Song, X. Liu, L. Sun, C. Chen, J. Bu, Recognizing architecture styles by hierarchical sparse coding of blocklets, Inf. Sci., 254 (2014) 141-154.
[14]
Y. Xia, J. Hu, M.D. Fontaine, A cyber-its framework for massive traffic data analysis using cyber infrastructure, Sci. World J., http://dx.doi.org/10.1155/2013/462846
[15]
L. Zhang, Y. Han, Y. Yang, M. Song, S. Yan, Q. Tian, Discovering discriminative graphlets for aerial image categories recognition, IEEE Trans. Image Process., 22 (2013) 5071-5084.
[16]
L. Zhang, M. Song, X. Liu, J. Bu, C. Chen, Fast multi-view segment graph kernel for object classification, Signal Process., 93 (2013) 1597-1607.
[17]
G. E. Hinton, S. Osindero, A fast learning algorithm for deep belief nets, Neural Comput. 18 (2006) 2006.
[18]
Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, U.D. Montral, M. Qubec, Greedy layer-wise training of deep networks, Adv. Neural Inf. Process. Syst. (NIPS) (2007) 153-160.
[19]
M. Ranzato, C. Poultney, S. Chopra, Y. Lecun, Efficient learning of sparse representations with an energy-based model, Adv. Neural Inf. Process. Syst. (NIPS) (2006) 1137-1144.
[20]
Y. Bengio, A. Courville, P. Vincent, Representation learning, IEEE Trans. Pattern Anal. Mach. Intell., 35 (2013) 1798-1828.
[21]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng, Multimodal deep learning, in: Proceedings of the 28th International Conference on Machine Learning (ICML), 2011.
[22]
N. Srivastava, R. Salakhutdinov, Multimodal learning with deep Boltzmann machines, J. Mach. Learn. Res., 15 (2014) 2949-2980.
[23]
A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, M.A. Ranzato, T. Mikolov, Devise: a deep visual-semantic embedding model, Adv. Neural Inf. Process. Syst. (NIPS) (2013) 2121-2129.
[24]
W. Wang, B.C. Ooi, X. Yang, D. Zhang, Y. Zhuang, Effective multi-modal retrieval based on stacked auto-encoders, in: Proceedings of International Conference on Very Large Data Bases (VLDB), 2014.
[25]
S. Rifai, P. Vincent, X. Muller, X. Glorot, Y. Bengio, Contractive auto-encoders: explicit invariance during feature extraction, in: L. Getoor, T. Scheffer (Eds.), Proceedings of the International Conference on Machine Learning (ICML), Omnipress, 2011, pp. 833-840.
[26]
Z. Ma, Y. Yang, N. Sebe, K. Zheng, A. Hauptmann, Multimedia event detection using a classifier-specific intermediate representation, IEEE Trans. Multimed., 15 (2013) 1628-1637.
[27]
Z. Ma, Y. Yang, N. Sebe, A. Hauptmann, Knowledge adaptation with partially shared features for event detection using few exemplars, IEEE Trans. Pattern Anal. Mach. Intell., 36 (2014) 1789-1802.
[28]
Z. Xu, Y. Yang, I. Tsang, N. Sebe, A. Hauptmann, Feature weighting via optimal thresholding for video analysis, in: Proceedings of the International Conference on Computer Vision (ICCV), 2013, pp. 3440-3447.
[29]
Trevid. {http://www-nlpir.nist.gov/projects/trevid/}
[30]
C.G.M. Snoek, M. Worring, J.C.V. Gemert, J. Mark Geusebroek, A.W.M. Smeulders, The challenge problem for automated detection of 101 semantic concepts in multimedia, in: Proceedings of the ACM International Conference on Multimedia, 2006, pp. 421-430.
[31]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, Stacked denoising autoencoders, J. Mach. Learn. Res., 11 (2010) 3371-3408.
[32]
C.-C. Chang, C.-J. Lin, LIBSVM, ACM Trans. Intell. Syst. Technol., 2 (2011) 27:1-27:27.
[33]
Y. Yang, Z. Ma, Z. Xu, S. Yan, A. Hauptmann, How related exemplars help complex event detection in web videos? in: Proceedings of the International Conference on Computer Vision (ICCV), 2013, pp. 2104-2111.
[34]
H. Shen, Y. Yan, S. Xu, N. Ballas, W. Chen, Evaluation of semi-supervised learning method on action recognition, Multimed. Tools Appl. (2014) 1-20.

Cited By

View all
  • (2024)BiCAE – A Bimodal Convolutional Autoencoder for Seed Purity TestingMachine Learning and Knowledge Discovery in Databases. Applied Data Science Track10.1007/978-3-031-70381-2_28(447-462)Online publication date: 8-Sep-2024
  • (2022)Cross-linked unified embedding for cross-modality representation learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601430(15942-15955)Online publication date: 28-Nov-2022
  • (2022)ItemSage: Learning Product Embeddings for Shopping Recommendations at PinterestProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539170(2703-2711)Online publication date: 14-Aug-2022
  • Show More Cited By
  1. Multimodal video classification with stacked contractive autoencoders

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Signal Processing
    Signal Processing  Volume 120, Issue C
    March 2016
    824 pages

    Publisher

    Elsevier North-Holland, Inc.

    United States

    Publication History

    Published: 01 March 2016

    Author Tags

    1. Deep learning
    2. Multimodal
    3. Stacked contractive autoencoder
    4. Video classification

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)BiCAE – A Bimodal Convolutional Autoencoder for Seed Purity TestingMachine Learning and Knowledge Discovery in Databases. Applied Data Science Track10.1007/978-3-031-70381-2_28(447-462)Online publication date: 8-Sep-2024
    • (2022)Cross-linked unified embedding for cross-modality representation learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601430(15942-15955)Online publication date: 28-Nov-2022
    • (2022)ItemSage: Learning Product Embeddings for Shopping Recommendations at PinterestProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539170(2703-2711)Online publication date: 14-Aug-2022
    • (2022)A Deep Clustering via Automatic Feature Embedded Learning for Human Activity RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.305746932:1(210-223)Online publication date: 1-Jan-2022
    • (2022)A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasetsThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-021-02166-738:8(2939-2970)Online publication date: 1-Aug-2022
    • (2021)Cross-Modal Pyramid Translation for RGB-D Scene RecognitionInternational Journal of Computer Vision10.1007/s11263-021-01475-7129:8(2309-2327)Online publication date: 1-Aug-2021
    • (2021)A survey on video content rating: taxonomy, challenges and open issuesMultimedia Tools and Applications10.1007/s11042-021-10838-880:16(24121-24145)Online publication date: 1-Jul-2021
    • (2021)Pattern analysis based acoustic signal processing: a survey of the state-of-artInternational Journal of Speech Technology10.1007/s10772-020-09681-324:4(913-955)Online publication date: 1-Dec-2021
    • (2019)Reconstruction of Hidden Representation for Robust Feature ExtractionACM Transactions on Intelligent Systems and Technology10.1145/328417410:2(1-24)Online publication date: 12-Jan-2019
    • (2018)A hierarchical classification method using belief functionsSignal Processing10.1016/j.sigpro.2018.02.021148:C(68-77)Online publication date: 1-Jul-2018
    • Show More Cited By

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media