Abstract
Automatic movie trailer genre classification is a challenging task because trailers have more diverse content and high-level sequential semantic concepts within the movie storyline, which can help for multimedia search and personalized movie recommendation. Traditional methods generally extract the low-level features or consider the local sequential dependencies among trailer frames, ignoring the global high-level sequential semantic concepts. In this manuscript, we propose a novel and effective Attention based Spatio-temporal Sequential Framework (ASTS) for movie trailer genre classification. The proposed framework mainly consists of two modules, respectively the spatio-temporal descriptive module and the attention-based sequential module. The spatio-temporal descriptive module adopts some advanced convolution neural networks to extract the spatio-temporal features of key trailer frames, which can capture the local spatio-temporal semantic features. The attention-based sequential module is designed to process the extracted spatio-temporal feature representation sequence for capturing the global high-level sequential semantic concepts within the movie storyline. We crawl 14,415 labeled movie trailers from YouTube and integrate them into the public dataset MovieLens. Experiment results show that our proposed framework is superior to state-of-the-art methods.
Similar content being viewed by others
Notes
k is set as 30, which is larger than the most clip numbers in movie trailers. And for those trailers whose clips are larger than 30, we randomly selected 30 clips among them.
Many strategies and methods can be adopted to extract the representative frames. In this paper, we use the interval sampling strategy and leave the exploration of sampling methods for future work.
The crawled dataset is public. https://github.com/Marinyyt/MovieTrailer-14k
We have performed experiments about the different settings for the parameters. Experiment results show that different Dh and Da have little effects on the performance of our method.
References
Abualigah L, Qasim M (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. arXiv:1705.07750
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Chu WT, Guo HJ (2017) Movie genre classification based on poster images with deep neural networks, pp 39–45. https://doi.org/10.1145/3132515.3132516
Chung J, Gülçehre Ç, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555
Deldjoo Y, Elahi M, Quadrana M, Cremonesi P (2015) Toward building a content-based video recommendation system based on low-level features. https://doi.org/10.1007/978-3-319-27729-5
Deldjoo Y, Elahi M, Cremonesi P, Garzotto F, Piazzolla P, Quadrana M (2016) Content-based video recommendation system based on stylistic visual features. Journal on Data Semantics 5:1–15. https://doi.org/10.1007/s13740-016-0060-9
Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248– 255
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Harper F M, Konstan J A (2015) The movielens datasets: history and context. ACM Trans Interact Intell Syst 5(4):19:1–19:19
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang H, Shih W, Hsu W (2007) A film classifier based on low-level visual features. In: 2007 IEEE 9th workshop on multimedia signal processing, pp 465–468
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Kingma D, Ba J (2014) Adam: A method for stochastic optimization. Computer Science
Kundalia K, Patel Y, Shah M (2019) Multi-label movie genre detection from a movie poster using knowledge transfer learning. Augmented Human Research 5:11. https://doi.org/10.1007/s41133-019-0029-y
Li Q, Qiu Z, Yao T, Mei T, Rui Y, Luo J (2016) Action recognition by learning deep multi-granular spatio-temporal video representation. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp 159–166
Rasheed Z, Shah M (2002) Movie genre classification by exploiting audio-visual features of previews. In: Object recognition supported by user interaction for service robots, vol 2, pp 1086–1089
Rasheed Z, Shah M (2002) Movie genre classification by exploiting audio-visual features of previews. In: International conference on pattern recognition
Rasheed Z, Sheikh Y, Shah M (2005) On the use of computable features for film classification. IEEE Transactions on Circuits And Systems for Video Technology 15:52–64
Schuster M, Paliwal K K (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11):2673–2681
Simões G, Wehrmann J, Barros R, Ruiz D (2016) Movie genre classification with convolutional neural networks, pp 259–266. https://doi.org/10.1109/IJCNN.2016.7727207
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Wang L, Xiong Y, Zhe W, Yu Q, Lin D, Tang X, Gool L V (2016) Temporal segment networks: towards good practices for deep action recognition. In: Eccv
Wehrmann J, Barros R C (2017) Movie genre classification: a multi-label approach based on convolutions through time. Appl Soft Comput 61
Wehrmann J, Barros R C, Simões G S, Paula T S, Ruiz DD (2017) (Deep) Learning from frames. In: Intelligent systems
Zha S, Luisier F, Andrews W, Srivastava N, Salakhutdinov R (2015) Exploiting image-trained cnn architectures for unconstrained video classification. arXiv:1503.04144
Zhou H, Hermans T, Karandikar A V, Rehg J M (2010) Movie genre classification via scene categorization. In: International conference on multimedia
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Yu, Y., Lu, Z., Li, Y. et al. ASTS: attention based spatio-temporal sequential framework for movie trailer genre classification. Multimed Tools Appl 80, 9749–9764 (2021). https://doi.org/10.1007/s11042-020-10125-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10125-y