Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Exploration of Speech and Music Information for Movie Genre Classification

Published: 13 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Movie genre prediction from trailers is mostly attempted in a multi-modal manner. However, the characteristics of movie trailer audio indicate that this modality alone might be highly effective in genre prediction. Movie trailer audio predominantly consists of speech and music signals in isolation or overlapping conditions. This work hypothesizes that the genre labels of movie trailers might relate to the composition of their audio component. In this regard, speech-music confidence sequences for the trailer audio are used as a feature. In addition, two other features previously proposed for discriminating speech-music are also adopted in the current task. This work proposes a time and channel Attention Convolutional Neural Network (ACNN) classifier for the genre classification task. The convolutional layers in ACNN learn the spatial relationships in the input features. The time and channel attention layers learn to focus on crucial timesteps and CNN kernel outputs, respectively. The Moviescope dataset is used to perform the experiments, and two audio-based baseline methods are employed to benchmark this work. The proposed feature set with the ACNN classifier improves the genre classification performance over the baselines. Moreover, decent generalization performance is obtained for genre prediction of movies with different cultural influences (EmoGDB).

    References

    [1]
    Sudhanshu Kumar, Kanjar De, and Partha Pratim Roy. 2020. Movie recommendation system using sentiment analysis from microblogging data. IEEE Trans. Comput. Soc. Syst. 7, 4 (2020), 915–923.
    [2]
    Syjung Hwang and Eunil Park. 2022. Movie recommendation systems using actor-based matrix computations in South Korea. IEEE Trans. Comput. Soc. Syst. 9, 5 (2022), 1387–1393.
    [3]
    Ashima Yadav and Dinesh Kumar Vishwakarma. 2020. A unified framework of deep networks for genre classification using movie trailer. Elsevier Appl. Soft Comput. 96 (2020), 106624.
    [4]
    Zeeshan Rasheed and Mubarak Shah. 2002. Movie genre classification by exploiting audio-visual features of previews. In Object Recognition Supported by User Interaction for Service Robots, Vol. 2. IEEE, 1086–1089.
    [5]
    Zeeshan Rasheed, Yaser Sheikh, and Mubarak Shah. 2005. On the use of computable features for film classification. IEEE Trans. Circ. Syst. Vid. Technol. 15, 1 (2005), 52–64.
    [6]
    Hee Lin Wang and Loong-Fah Cheong. 2006. Affective understanding in film. IEEE Trans. Circ. Syst. Vid. Technol. 16, 6 (2006), 689–704.
    [7]
    Sanjay K. Jain and R. S. Jadon. 2009. Movies genres classifier using neural network. In 24th International Symposium on Computer and Information Sciences. IEEE, 575–580.
    [8]
    Aida Austin, Elliot Moore, Udit Gupta, and Parag Chordia. 2010. Characterization of movie genre based on music score. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). IEEE, 421–424.
    [9]
    Theodoros Giannakopoulos, Alexandros Makris, Dimitrios Kosmopoulos, Stavros Perantonis, and Sergios Theodoridis. 2010. Audio-visual fusion for detecting violent scenes in videos. In Hellenic Conference on Artificial Intelligence. Springer, 91–100.
    [10]
    Theodoros Giannakopoulos, Aggelos Pikrakis, and Sergios Theodoridis. 2007. A multi-class audio classification method with respect to violent content in movies using Bayesian networks. In IEEE 9th Workshop on Multimedia Signal Processing. IEEE, 90–93.
    [11]
    Go Irie, Takashi Satou, Akira Kojima, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2010. Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. IEEE Trans. Multim. 12, 6 (2010), 523–535.
    [12]
    Fillipe D. M. de Souza, Guillermo C. Chávez, Eduardo A. do Valle Jr., and Arnaldo de A. Araujo. 2010. Violence detection in video using spatio-temporal features. In 23rd SIBGRAPI Conference on Graphics, Patterns and Images. 224–230.
    [13]
    Howard Zhou, Tucker Hermans, Asmita V. Karandikar, and James M. Rehg. 2010. Movie genre classification via scene categorization. In 18th ACM International Conference on Multimedia. 747–750.
    [14]
    Liang-Hua Chen, Hsi-Wen Hsu, Li-Yun Wang, and Chih-Wen Su. 2011. Violence detection in movies. In 8th International Conference on Computer Graphics, Imaging and Visualization. IEEE, 119–124.
    [15]
    Jianchao Wang, Bing Li, Weiming Hu, and Ou Wu. 2011. Horror video scene recognition via multiple-instance learning. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). IEEE, 1325–1328.
    [16]
    Jianchao Wang, Bing Li, Weiming Hu, and Ou Wu. 2010. Horror movie scene recognition based on emotional perception. In IEEE International Conference on Image Processing. IEEE, 1489–1492.
    [17]
    Yin-Fu Huang and Shih-Hao Wang. 2012. Movie genre classification using SVM with audio and video features. In International Conference on Active Media Technology. Springer, 1–10.
    [18]
    Esra Acar, Frank Hopfgartner, and Sahin Albayrak. 2013. Violence detection in hollywood movies by the fusion of visual and mid-level audio cues. In 21st ACM International Conference on Multimedia. 717–720.
    [19]
    Gabriel S. Simões, Jônatas Wehrmann, Rodrigo C. Barros, and Duncan D. Ruiz. 2016. Movie genre classification with convolutional neural networks. In International Joint Conference on Neural Networks (IJCNN’16). IEEE, 259–266.
    [20]
    Adarsh Tadimari, Naveen Kumar, Tanaya Guha, and Shrikanth S. Narayanan. 2016. Opening big in box office? Trailer content can help. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 2777–2781.
    [21]
    Jônatas Wehrmann, Rodrigo C. Barros, Gabriel S. Simões, Thomas S. Paula, and Duncan D. Ruiz. 2016. (Deep) learning from frames. In 5th Brazilian Conference on Intelligent Systems (BRACIS’16). IEEE, 1–6.
    [22]
    Jônatas Wehrmann and Rodrigo C. Barros. 2017. Movie genre classification: A multi-label approach based on convolutions through time. Elsevier Appl. Soft Comput. 61 (2017), 973–982.
    [23]
    Jônatas Wehrmann and Rodrigo C. Barros. 2017. Convolutions through time for multi-label movie genre classification. In Symposium on Applied Computing.114–119.
    [24]
    Jussi Tarvainen, Jorma Laaksonen, and Tapio Takala. 2018. Film mood and its quantitative determinants in different types of scenes. IEEE Trans. Affect. Comput. 11, 2 (2018), 313–326.
    [25]
    Federico Álvarez, Faustino Sánchez, Gustavo Hernández-Peñaloza, David Jiménez, José Manuel Menéndez, and Guillermo Cisneros. 2019. On the influence of low-level visual features in film classification. PLoS One 14, 2 (Feb.2019), 1–29.
    [26]
    Paola Cascante-Bonilla, Kalpathy Sitaraman, Mengjia Luo, and Vicente Ordonez. 2019. Moviescope: Large-scale analysis of movies using multiple modalities. arXiv preprint arXiv:1908.03180 (2019).
    [27]
    Jeong A. Wi, Soojin Jang, and Youngbin Kim. 2020. Poster-based multiple movie genre classification using inter-channel features. IEEE Access 8 (2020), 66615–66624.
    [28]
    Wei-Ta Chu and Hung-Jui Guo. 2017. Movie genre classification based on poster images with deep neural networks. In Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes (MUSA2’17). 39–45.
    [29]
    Prashant Giridhar Shambharkar, Pratyush Thakur, Shaikh Imadoddin, Shantanu Chauhan, and M. N. Doja. 2020. Genre classification of movie trailers using 3D convolutional neural networks. In 4th International Conference on Intelligent Computing Information and Control Systems (ICICCS’20). IEEE, 850–858.
    [30]
    Rafael B. Mangolin, Rodolfo M. Pereira, Alceu S. Britto Jr, Carlos N. Silla Jr, Valéria D. Feltrim, Diego Bertolini, and Yandre M. G. Costa. 2022. A multimodal approach for multi-label movie genre classification. Multimedia Tools and Applications 81, 14 (2022), 19071--19096.
    [31]
    Edward Fish, Jon Weinbren, and Andrew Gilbert. 2020. Rethinking movie genre classification with fine-grained semantic clustering. arXiv preprint arXiv:2012.02639 (2020).
    [32]
    Edward Fish, Jon Weinbren, and Andrew Gilbert. 2021. Rethinking genre classification with fine grained semantic clustering. In IEEE International Conference on Image Processing (ICIP’21). 1274–1278.
    [33]
    Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).
    [34]
    R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. 2018. NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 06 (June2018), 1437–1451.
    [35]
    Aditya Sharma, Mayank Jindal, Ayush Mittal, and Dinesh Kumar Vishwakarma. 2021. A unified audio analysis framework for movie genre classification using movie trailers. In International Conference on Emerging Smart Computing and Informatics (ESCI’21). IEEE, 510–515.
    [36]
    Dinesh Kumar Vishwakarma, Mayank Jindal, Ayush Mittal, and Aditya Sharma. 2021. Multilevel profiling of situation and dialogue-based deep networks for movie genre classification using movie trailers. arXiv preprint arXiv:2109.06488 (2021).
    [37]
    Shangfei Wang and Qiang Ji. 2015. Video affective content analysis: A survey of state-of-the-art methods. IEEE Trans. Affect. Comput. 6, 4 (2015), 410–430.
    [38]
    Mihai Gabriel Constantin, Liviu-Daniel Ştefan, Bogdan Ionescu, Claire-Hélene Demarty, Mats Sjöberg, Markus Schedl, and Guillaume Gravier. 2020. Affect in multimedia: Benchmarking violent scenes detection. IEEE Transactions on Affective Computing 13, 1 (2020), 347–366.
    [39]
    David Bordwell and Kristin Thompson. 2008. Film Art: An Introduction (8, revised ed.). McGraw Hill.
    [40]
    Mrinmoy Bhattacharjee, S. R. Mahadeva Prasanna, and Prithwijit Guha. 2020. Speech/music classification using features from spectral peaks. IEEE/ACM Trans. Aud., Speech, Lang. Process. 28 (2020), 1549–1559.
    [41]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (Dec.2017).
    [42]
    Mohaddeseh Mirbeygi, Aminollah Mahabadi, and Akbar Ranjbar. 2022. Speech and music separation approaches-a survey. Multimedia Tools and Applications 81, 15 (2022), 21155–21197.
    [43]
    David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 5329–5333.
    [44]
    Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. 2022. Singer identification for metaverse with timbral and middle-level perceptual features. arXiv preprint arXiv:2205.11817 (2022).
    [45]
    Laureano Moro-Velazquez, Jesus Villalba, and Najim Dehak. 2020. Using x-vectors to automatically detect Parkinson’s disease from speech. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 1155–1159.
    [46]
    David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. Spoken language recognition using X-vectors. In Odyssey: The Speaker and Language Recognition Workshop, International Speech Communication Association (ISCA'18), Les Sables d'Olonne, France, 105–111.
    [47]
    Hossein Zeinali, Lukas Burget, and Jan Cernocky. 2018. Convolutional neural networks and x-vector embedding for DCASE2018 acoustic scene classification challenge. arXiv preprint arXiv:1810.04273 (2018).
    [48]
    Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. 2021. SpeechBrain: A General-purpose Speech Toolkit. arxiv:eess.AS/2106.04624
    [49]
    David Snyder, Guoguo Chen, and Daniel Povey. 2015. MUSAN: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015).
    [50]
    Yitong Yu, Ziyu Lu, Yang Li, and Delong Liu. 2021. ASTS: Attention based spatio-temporal sequential framework for movie trailer genre classification. Multim. Tools Applic. 80, 7 (2021), 9749–9764.
    [51]
    Jing Li, Kan Jin, Dalin Zhou, Naoyuki Kubota, and Zhaojie Ju. 2020. Attention mechanism-based CNN for facial expression recognition. Neurocomputing 411 (2020), 340–350.
    [52]
    Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
    [53]
    Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Conference of the North American Chapter of the Association for Computer Linguistics: Human Language Technologies.1480–1489.
    [54]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR’21).
    [55]
    Ashima Yadav and Dinesh Kumar Vishwakarma. 2020. A unified framework of deep networks for genre classification using movie trailer. Elsevier Appl. Soft Comput. 96 (2020), 106624.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 8
    August 2024
    698 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3618074
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 June 2024
    Online AM: 07 May 2024
    Accepted: 30 April 2024
    Revised: 16 December 2023
    Received: 11 June 2023
    Published in TOMM Volume 20, Issue 8

    Check for updates

    Author Tags

    1. Movie trailer genre classification
    2. speech-music classification
    3. spectral peak tracking
    4. attention

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 103
      Total Downloads
    • Downloads (Last 12 months)103
    • Downloads (Last 6 weeks)33
    Reflects downloads up to 14 Aug 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media