Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Knowledge-integrated Multi-modal Movie Turning Point Identification

Published: 22 January 2024 Publication History

Abstract

The rapid development of artificial intelligence provides rich technologies and tools for the automated understanding of literary works. As a comprehensive carrier of storylines, movies are natural multimodal data sources that provide sufficient data foundations, and how to fully leverage the benefits of data remains a sustainable research hotspot. In addition, the efficient representation of multi-source data also poses new challenges for information fusion technology. Therefore, we propose a knowledge-enhanced turning points identification (KTPi) method for multimodal scene recognition. First, the BiLSTM method is used to encode scene text and integrate contextual information into scene representations to complete text sequence modeling. Then, the graph structure is used to model all scenes, which strengthens long-range semantic dependencies between scenes and enhances scene representations using graph convolution network. After, the self-supervised method is used to obtain the optimal number of neighboring nodes in sparse graph. Next, actor and verb knowledge involved in the scene text are added to the multimodal data to enhance the diversity of scene feature expressions. Finally, the teacher-student network strategy is used to train the KTPi model. Experimental results show that KTPi outperforms baseline methods in scene role recognition tasks, and ablation experiments show that incorporating knowledge into multimodal model can improve its performance.

References

[1]
Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In Advances in Neural Information Processing Systems. https://openreview.net/forum?id=RzYrn625bu8
[2]
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (Feb.2019), 423–443.
[3]
Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder. arxiv:1803.11175 [cs.CL]
[4]
Baili Chen, Hongwei Zheng, Lili Wang, Olaf Hellwich, Chunbo Chen, Liao Yang, Tie Liu, Geping Luo, Anming Bao, and Xi Chen. 2022. A joint learning Im-BiLSTM model for incomplete time-series sentinel-2A data imputation and crop classification. International Journal of Applied Earth Observation and Geoinformation 108 (April2022), 102762.
[5]
Hongyan Cui, Gangkun Wang, Yuanxin Li, and Roy E. Welsch. 2022. Self-training method based on GCN for semi-supervised short text classification. Information Sciences 611 (Sept.2022), 18–29.
[6]
Duoduo Feng, Xiangteng He, and Yuxin Peng. 2023. MKVSE: Multimodal knowledge enhanced visual-semantic embedding for image-text retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 19, 5, Article 162 (Mar.2023), 21 pages. DOI:
[7]
Fan Feng, Yue Ming, and Nannan Hu. 2022. SSLNet: A network for cross-modal sound source localization in visual scenes. Neurocomputing 500 (Aug.2022), 1052–1062.
[8]
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press, New Orleans, LA, USA, 776–780.
[9]
Philip John Gorinski and Mirella Lapata. 2015. Movie script summarization as graph-based scene extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, 1066–1076.
[10]
Lei Guo, Jinyu Zhang, Li Tang, Tong Chen, Lei Zhu, and Hongzhi Yin. 2022. Time interval-enhanced graph neural network for shared-account cross-domain sequential recommendation. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–15.
[11]
Michael Hauge. 2017. Storytelling Made Easy: Persuade and Transform Your Audiences, Buyers, and Clients — Simply, Quickly, and Profitably. Indie Books International, Oceanside, CA.
[12]
Tae Joon Jun, Youngsub Eom, Dohyeun Kim, Cherry Kim, Ji-Hye Park, Hoang Minh Nguyen, Young-Hak Kim, and Daeyoung Kim. 2021. TRk-CNN: Transferable ranking-CNN for image classification of glaucoma, glaucoma suspect, and normal eyes. Expert Systems with Applications 182 (Nov.2021), 115211.
[13]
Kornraphop Kawintiranon and Lisa Singh. 2021. Knowledge enhanced masked language model for stance detection. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 4725–4735.
[14]
Myungji Lee, Hongseok Kwon, Jaehun Shin, WonKee Lee, Baikjin Jung, and Jong-Hyeok Lee. 2021. Transformer-based screenplay summarization using augmented learning representation with dialogue information. In Proceedings of the Third Workshop on Narrative Understanding. Association for Computational Linguistics, Virtual, 56–61.
[15]
Xianyong Li, Jiabo Zhang, Yajun Du, Jian Zhu, Yongquan Fan, and Xiaoliang Chen. 2022. A novel deep learning-based sentiment analysis method enhanced with emojis in microblog social networks. Enterprise Information Systems 17, 5 (Feb.2022), 1–22.
[16]
Dengwen Lin, Jintao Tang, Xinyi Li, Kunyuan Pang, Shasha Li, and Ting Wang. 2022. BERT-SMAP: Paying attention to essential terms in passage ranking beyond BERT. Information Processing & Management 59, 2 (March2022), 102788.
[17]
Hao Liu, Xiaoshan Yang, and Changsheng Xu. 2023. Counterfactual scenario-relevant knowledge-enriched multi-modal emotion reasoning. ACM Trans. Multimedia Comput. Commun. Appl. (Feb.2023). DOI:
[18]
Yingying Liu, Peipei Li, and Xuegang Hu. 2022. Combining context-relevant features with multi-stage attention network for short text classification. Computer Speech & Language 71 (Jan.2022), 101268.
[19]
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations. https://openreview.net/forum?id=S1jE5L5gl
[20]
C. S. Myers and L. R. Rabiner. 1981. A comparative study of several dynamic time-warping algorithms for connected-word recognition. The Bell System Technical Journal 60, 7 (Sept.1981), 1389–1409.
[21]
Pinelopi Papalampidi, Frank Keller, Lea Frermann, and Mirella Lapata. 2020. Screenplay summarization using latent narrative structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1920–1933.
[22]
Pinelopi Papalampidi, Frank Keller, and Mirella Lapata. 2019. Movie plot analysis via turning point identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 1707–1717. DOI:
[23]
Pinelopi Papalampidi, Frank Keller, and Mirella Lapata. 2020. Movie Summarization via Sparse Graph Construction. arxiv:2012.07536 [cs.CL]
[24]
Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge enhanced contextual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 43–54.
[25]
Ruijie Quan, Linchao Zhu, Yu Wu, and Yi Yang. 2021. Holistic LSTM for pedestrian trajectory prediction. IEEE Transactions on Image Processing 30 (2021), 3229–3239.
[26]
Mehmet Umut Salur and İlhan Aydın. 2022. A soft voting ensemble learning-based approach for multimodal sentiment analysis. Neural Computing and Applications 34, 21 (Nov.2022), 18391–18406.
[27]
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced Representation through Knowledge Integration. arxiv:1904.09223 [cs.CL]
[28]
K. Suresh Kumar and C. Helen Sulochana. 2022. Local search five-element cycle optimized reLU-BiLSTM for multilingual aspect-based text classification. Concurrency and Computation: Practice and Experience 34, 28 (2022), e7374.
[29]
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding stories in movies through question-answering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, 4631–4640.
[30]
Helin Wang, Yuexian Zou, Dading Chong, and Wenwu Wang. 2020. Modeling label dependencies for audio tagging with graph convolutional network. IEEE Signal Processing Letters 27 (2020), 1560–1564.
[31]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5987–5995.
[32]
Yiran Xing, Zai Shi, Zhao Meng, Gerhard Lakemeyer, Yunpu Ma, and Roger Wattenhofer. 2021. KM-BART: Knowledge enhanced multimodal BART for visual commonsense generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 525–535.
[33]
Yi Yang, Yueting Zhuang, and Yunhe Pan. 2021. Multiple knowledge representation for big data artificial intelligence: Framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering 22 (2021), 1551–1558.
[34]
Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graph. arxiv:2006.16934 [cs.CV]
[35]
Hao Zheng and Mirella Lapata. 2019. Sentence centrality revisited for unsupervised summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6236–6247.
[36]
Ling Zhu, Xiaofei Zhu, Jiafeng Guo, and Stefan Dietze. 2023. Exploring rich structure information for aspect-based sentiment classification. Journal of Intelligent Information Systems 60, 1 (Feb.2023), 97–117.
[37]
Wenwu Zhu, Xin Wang, and Hongzhi Li. 2020. Multi-modal deep analysis for multimedia. IEEE Transactions on Circuits and Systems for Video Technology 30, 10 (2020), 3740–3764. DOI:

Cited By

View all
  • (2024)Network Information Security Monitoring Under Artificial Intelligence EnvironmentInternational Journal of Information Security and Privacy10.4018/IJISP.34503818:1(1-25)Online publication date: 21-Jun-2024
  • (2024)Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647620:7(1-24)Online publication date: 15-May-2024
  • (2024)MultiRider: Enabling Multi-Tag Concurrent OFDM Backscatter by Taming In-band InterferenceProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661862(292-303)Online publication date: 3-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 5
May 2024
650 pages
EISSN:1551-6865
DOI:10.1145/3613634
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 January 2024
Online AM: 23 December 2023
Accepted: 17 December 2023
Revised: 27 October 2023
Received: 16 May 2023
Published in TOMM Volume 20, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Knowledge enhance
  2. multimodal representation
  3. text tagging

Qualifiers

  • Research-article

Funding Sources

  • Guangdong Provincial Key Laboratory of Cyber-Physical Systems
  • National Natural Science Foundation of China
  • Shenzhen Foundational Research Funding
  • Major Key Project of PCL

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)158
  • Downloads (Last 6 weeks)15
Reflects downloads up to 02 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Network Information Security Monitoring Under Artificial Intelligence EnvironmentInternational Journal of Information Security and Privacy10.4018/IJISP.34503818:1(1-25)Online publication date: 21-Jun-2024
  • (2024)Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647620:7(1-24)Online publication date: 15-May-2024
  • (2024)MultiRider: Enabling Multi-Tag Concurrent OFDM Backscatter by Taming In-band InterferenceProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661862(292-303)Online publication date: 3-Jun-2024
  • (2024)Driver intention prediction based on multi-dimensional cross-modality information interactionMultimedia Systems10.1007/s00530-024-01282-330:2Online publication date: 15-Mar-2024

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media