research-article

Towards Micro-video Understanding by Joint Sequential-Sparse Modeling

Authors:

Baoquan ChenAuthors Info & Claims

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 970 - 978

https://doi.org/10.1145/3123266.3123341

Published: 19 October 2017 Publication History

Abstract

Like the traditional long videos, micro-videos are the unity of textual, acoustic, and visual modalities. These modalities sequentially tell a real-life event from distinct angles. Yet, unlike the traditional long videos with rich content, micro-videos are very short, lasting for 6-15 seconds, and they hence usually convey one or a few high-level concepts. In the light of this, we have to characterize and jointly model the sparseness and multiple sequential structures for better micro-video understanding. To accomplish this, in this paper, we present an end-to-end deep learning model, which packs three parallel LSTMs to capture the sequential structures and a convolutional neural network to learn the sparse concept-level representations of micro-videos. We applied our model to the application of micro-video categorization. Besides, we constructed a real-world dataset for sequence modeling and released it to facilitate other researchers. Experimental results demonstrate that our model yields better performance than several state-of-the-art baselines.

References

[1]

Grigory Antipov, Sid-Ahmed Berrani, Natacha Ruchaud, and Jean-Luc Dugelay. 2015. Learned vs. hand-crafted features for pedestrian gender recognition ACM MM. 1263--1266.

Digital Library

[2]

Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2011. Sequential deep learning for human action recognition HBU. 29--39.

Digital Library

[3]

Soheil Bahrampour, Nasser M Nasrabadi, Asok Ray, and William Kenneth Jenkins. 2016. Multimodal task-driven dictionary learning for image classification. IEEE TIP, Vol. 25, 1 (2016), 24--38.

[4]

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE NN, Vol. 5, 2 (1994), 157--166.

Digital Library

[5]

Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. 2016. Micro tells macro: predicting the popularity of micro-videos via a transductive model ACM MM. 898--907.

Digital Library

[6]

Ken Chen, Bao-Liang Lu, and James T Kwok. 2006. Efficient classification of multi-label and imbalanced data using min-max modular classifiers. In IEEE IJCNN. 1770--1775.

[7]

Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. 2012. Multi-column deep neural networks for image classification IEEE CVPR. 3642--3649.

Digital Library

[8]

Cheng Deng, Xu Tang, Junchi Yan, Wei Liu, and Xinbo Gao. 2016. Discriminative dictionary learning with common label alignment for cross-modal retrieval. IEEE MM, Vol. 18, 2 (2016), 208--218.

Digital Library

[9]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description IEEE CVPR. 2625--2634.

[10]

Chao Dong, Change Loy Chen, Kaiming He, and Xiaoou Tang. 2016. Image super-resolution using deep convolutional networks. IEEE PAMI, Vol. 38, 2 (2016), 295--307.

Digital Library

[11]

Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. 2013. Recent developments in opensmile, the munich open-source multimedia feature extractor ACM MM. 835--838.

Digital Library

[12]

Felix A Gers and E Schmidhuber. 2001. LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE NN, Vol. 12, 6 (2001), 1333--1340.

Digital Library

[13]

Felix A Gers, Nicol N Schraudolph, and Jürgen Schmidhuber. 2002. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research Vol. 3, Aug (2002), 115--143.

Digital Library

[14]

Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks ICML, Vol. Vol. 14. 1764--1772.

Digital Library

[15]

Alex Graves and Jürgen Schmidhuber. 2009. Offline handwriting recognition with multidimensional recurrent neural networks NIPS. 545--552.

Digital Library

[16]

Sepp Hochreiter, Martin Heusel, and Klaus Obermayer. 2007. Fast model-based protein homology detection without alignment. Bioinformatics, Vol. 23, 14 (2007), 1728--1736.

Digital Library

[17]

Sepp Hochreiter and Jiirgen Schmidhuber. 1997. LTSM can solve hard time lag problems. In NIPS. 473--479.

Digital Library

[18]

Viren Jain and Sebastian Seung. 2009. Natural image denoising with convolutional networks NIPS. 769--776.

Digital Library

[19]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding ACM MM. 675--678.

Digital Library

[20]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks IEEE CVPR. 1725--1732.

Digital Library

[21]

Markus Koskela and Jorma Laaksonen. 2014. Convolutional network features for scene recognition ACM MM. 1169--1172.

Digital Library

[22]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks NIPS. 1097--1105.

Digital Library

[23]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE Vol. 86, 11 (1998), 2278--2324.

[24]

Bruno Lepri, Nadia Mana, Alessandro Cappelletti, and Fabio Pianesi. 2009. Automatic prediction of individual performance from thin slices of social behavior ACM MM. 733--736.

Digital Library

[25]

David D Lewis. 1991. Evaluating text categorization. In HLT. 312--318.

Digital Library

[26]

Guang Li, Shubo Ma, and Yahong Han. 2015. Summarization-based video caption via deep neural networks ACM MM. 1191--1194.

Digital Library

[27]

Yehao Li, Ting Yao, Tao Mei, Hongyang Chao, and Yong Rui. 2016. Share-and-chat: Achieving human-level video commenting by search and multi-view embedding ACM MM. 928--937.

Digital Library

[28]

Lie Lu, Hao Jiang, and HongJiang Zhang. 2001. A robust audio classification and segmentation method ACM MM. 203--211.

Digital Library

[29]

Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech, Vol. Vol. 2. 3--3.

[30]

Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines ICML. 807--814.

Digital Library

[31]

Phuc Xuan Nguyen, Gregory Rogez, Charless Fowlkes, and Deva Ramanan. 2016. The open world of micro-videos. arXiv preprint arXiv:1603.09439 (2016).

[32]

Wanli Ouyang and Xiaogang Wang. 2013. Joint deep learning for pedestrian detection. In IEEE ICCV. 2056--2063.

Digital Library

[33]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection IEEE CVPR. 779--788.

[34]

Jimmy Ren, Yongtao Hu, Yu-Wing Tai, Chuan Wang, Li Xu, Wenxiu Sun, and Qiong Yan. 2016. Look, listen and learn-a multimodal LSTM for speaker identification. arXiv preprint arXiv:1602.04364 (2016).

Digital Library

[35]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks NIPS. 91--99.

Digital Library

[36]

Chris Sanden and John Z Zhang. 2011. Enhancing multi-label music genre classification through ensemble techniques ACM SIGIR. 705--714.

Digital Library

[37]

Jürgen Schmidhuber, Daan Wierstra, and Faustino Gomez. 2005. Evolino: Hybrid neuroevolution optimal linear search for sequence learning IJCAI. 853--858.

Digital Library

[38]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering IEEE CVPR. 815--823.

[39]

Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: an astounding baseline for recognition IEEE CVPR. 806--813.

[40]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos NIPS. 568--576.

Digital Library

[41]

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised learning of video representations using LSTMs ICML. 843--852.

Digital Library

[42]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks NIPS. 3104--3112.

Digital Library

[43]

Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification IEEE CVPR. 1701--1708.

Digital Library

[44]

Srinivas C Turaga, Joseph F Murray, Viren Jain, Fabian Roth, Moritz Helmstaedter, Kevin Briggman, Winfried Denk, and H Sebastian Seung. 2010. Convolutional networks can learn to generate affinity graphs for image segmentation. Neural computation, Vol. 22, 2 (2010), 511--538.

Digital Library

[45]

Vivek Veeriah, Naifan Zhuang, and Guo-Jun Qi. 2015. Differential recurrent neural networks for action recognition IEEE ICCV. 4041--4049.

Digital Library

[46]

Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016. Image captioning with deep bidirectional LSTMs. In ACM MM. 988--997.

Digital Library

[47]

Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2017. Item Silk Road: Recommending Items from Information Domains to Social Users ACM SIGIR.

Digital Library

[48]

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification ACM MM. 461--470.

Digital Library

[49]

Zhongwen Xu, Yi Yang, and Alex G Hauptmann. 2015. A discriminative CNN video representation for event detection IEEE CVPR. 1798--1807.

[50]

Jianglong Zhang, Liqiang Nie, Xiang Wang, Xiangnan He, Xianglin Huang, and Tat Seng Chua. 2016. Shorter-is-better: Venue category estimation from micro-video ACM MM. 1415--1424.

Digital Library

Cited By

Jing PLiu XZhang LLi YLiu YSu Y(2024)Multimodal Attentive Representation Learning for Micro-video Multi-label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364388820:6(1-23)Online publication date: 8-Mar-2024
https://dl.acm.org/doi/10.1145/3643888
Jing PLiu XWang XSu Y(2024)Deep Matrix Factorization With Complementary Semantic Aggregation for Micro-Video Multi-Label ClassificationIEEE Signal Processing Letters10.1109/LSP.2023.334009731(1685-1689)Online publication date: 2024
https://doi.org/10.1109/LSP.2023.3340097
Li YLiu XZhang LTian HJing P(2024)Multimodal semantic enhanced representation network for micro-video event detectionKnowledge-Based Systems10.1016/j.knosys.2024.112255301(112255)Online publication date: Oct-2024
https://doi.org/10.1016/j.knosys.2024.112255
Show More Cited By

Index Terms

Towards Micro-video Understanding by Joint Sequential-Sparse Modeling
1. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

Personalized Hashtag Recommendation for Micro-videos
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Personalized hashtag recommendation methods aim to suggest users hashtags to annotate, categorize, and describe their posts. The hashtags, that a user provides to a post (e.g., a micro-video), are the ones which in her mind can well describe the post ...
Enhancing Micro-video Understanding by Harnessing External Sounds
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Different from traditional long videos, micro-videos are much shorter and usually recorded at a specific place with mobile devices. To better understand the semantics of a micro-video and facilitate downstream applications, it is crucial to estimate the ...
Multimodal deep hierarchical semantic-aligned matrix factorization method for micro-video multi-label classification
Abstract
As one of the typical formats of prevalent user-generated content in social media platforms, micro-videos inherently incorporate multimodal characteristics associated with a group of label concepts. However, existing methods generally explore the ...
Highlights
- We propose a multimodal deep hierarchical semantic-aligned matrix factorization (DHSAMF) method.
- DHSAMF aims to unveil fine-grained semantic dependencies between modality and label features.
- DHSAMF uses the correlations among ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '17: Proceedings of the 25th ACM international conference on Multimedia

October 2017

2028 pages

ISBN:9781450349062

DOI:10.1145/3123266

General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

one thousand talents plan
National Basic Research grant (973)
Joint NSFC-ISF Research Program

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 23 - 27, 2017

California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
432
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jing PLiu XZhang LLi YLiu YSu Y(2024)Multimodal Attentive Representation Learning for Micro-video Multi-label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364388820:6(1-23)Online publication date: 8-Mar-2024
https://dl.acm.org/doi/10.1145/3643888
Jing PLiu XWang XSu Y(2024)Deep Matrix Factorization With Complementary Semantic Aggregation for Micro-Video Multi-Label ClassificationIEEE Signal Processing Letters10.1109/LSP.2023.334009731(1685-1689)Online publication date: 2024
https://doi.org/10.1109/LSP.2023.3340097
Li YLiu XZhang LTian HJing P(2024)Multimodal semantic enhanced representation network for micro-video event detectionKnowledge-Based Systems10.1016/j.knosys.2024.112255301(112255)Online publication date: Oct-2024
https://doi.org/10.1016/j.knosys.2024.112255
Gong RZhang YZhang YLiu YGuo JNie X(2024)Demsasa: micro-video scene classification based on denoising multi-shots association self-attentionPattern Analysis and Applications10.1007/s10044-024-01378-627:4Online publication date: 29-Nov-2024
https://doi.org/10.1007/s10044-024-01378-6
Yuan BYao WJing PZhang JTsang KWang S(2024)Context-aware focal alignment network for micro-video multi-label classificationPattern Analysis & Applications10.1007/s10044-024-01376-827:4Online publication date: 14-Nov-2024
https://dl.acm.org/doi/10.1007/s10044-024-01376-8
Sun XLiu BAi LLiu DMeng QCao J(2023)In Your Eyes: Modality Disentangling for Personality Analysis in Short VideoIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.316170810:3(982-993)Online publication date: Jun-2023
https://doi.org/10.1109/TCSS.2022.3161708
Guo JGong RMa YLiu MXi XNie XYin Y(2023)A survey of micro-video analysisMultimedia Tools and Applications10.1007/s11042-023-16691-1Online publication date: 20-Sep-2023
https://doi.org/10.1007/s11042-023-16691-1
Wang BHuang XCao GYang LTao ZWei X(2023)Attention-enhanced joint learning network for micro-video venue classificationMultimedia Tools and Applications10.1007/s11042-023-15699-x83:5(12425-12443)Online publication date: 1-Jul-2023
https://doi.org/10.1007/s11042-023-15699-x
Ran DZheng WLi YBian KZhang JDeng XPelachaud CTaylor MFaliszewski PMascardi V(2022)Revenue and User Traffic Maximization in Mobile Short-Video AdvertisingProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems10.5555/3535850.3535972(1092-1100)Online publication date: 9-May-2022
https://dl.acm.org/doi/10.5555/3535850.3535972
Wang BHuang XCao GYang LWei XTao Z(2022)Hybrid-attention and frame difference enhanced network for micro-video venue recognitionJournal of Intelligent & Fuzzy Systems10.3233/JIFS-21319143:3(3337-3353)Online publication date: 21-Jul-2022
https://doi.org/10.3233/JIFS-213191
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten