research-article

Knowledge-integrated Multi-modal Movie Turning Point Identification

Authors:

Lianglun Cheng,

Zhuowei WangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 5

Article No.: 138, Pages 1 - 19

https://doi.org/10.1145/3638557

Published: 22 January 2024 Publication History

Abstract

The rapid development of artificial intelligence provides rich technologies and tools for the automated understanding of literary works. As a comprehensive carrier of storylines, movies are natural multimodal data sources that provide sufficient data foundations, and how to fully leverage the benefits of data remains a sustainable research hotspot. In addition, the efficient representation of multi-source data also poses new challenges for information fusion technology. Therefore, we propose a knowledge-enhanced turning points identification (KTPi) method for multimodal scene recognition. First, the BiLSTM method is used to encode scene text and integrate contextual information into scene representations to complete text sequence modeling. Then, the graph structure is used to model all scenes, which strengthens long-range semantic dependencies between scenes and enhances scene representations using graph convolution network. After, the self-supervised method is used to obtain the optimal number of neighboring nodes in sparse graph. Next, actor and verb knowledge involved in the scene text are added to the multimodal data to enhance the diversity of scene feature expressions. Finally, the teacher-student network strategy is used to train the KTPi model. Experimental results show that KTPi outperforms baseline methods in scene role recognition tasks, and ablation experiments show that incorporating knowledge into multimodal model can improve its performance.

References

[1]

Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In Advances in Neural Information Processing Systems. https://openreview.net/forum?id=RzYrn625bu8

[2]

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (Feb.2019), 423–443.

Digital Library

[3]

Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder. arxiv:1803.11175 [cs.CL]

[4]

Baili Chen, Hongwei Zheng, Lili Wang, Olaf Hellwich, Chunbo Chen, Liao Yang, Tie Liu, Geping Luo, Anming Bao, and Xi Chen. 2022. A joint learning Im-BiLSTM model for incomplete time-series sentinel-2A data imputation and crop classification. International Journal of Applied Earth Observation and Geoinformation 108 (April2022), 102762.

[5]

Hongyan Cui, Gangkun Wang, Yuanxin Li, and Roy E. Welsch. 2022. Self-training method based on GCN for semi-supervised short text classification. Information Sciences 611 (Sept.2022), 18–29.

Digital Library

[6]

Duoduo Feng, Xiangteng He, and Yuxin Peng. 2023. MKVSE: Multimodal knowledge enhanced visual-semantic embedding for image-text retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 19, 5, Article 162 (Mar.2023), 21 pages. DOI:

Digital Library

[7]

Fan Feng, Yue Ming, and Nannan Hu. 2022. SSLNet: A network for cross-modal sound source localization in visual scenes. Neurocomputing 500 (Aug.2022), 1052–1062.

Digital Library

[8]

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press, New Orleans, LA, USA, 776–780.

Digital Library

[9]

Philip John Gorinski and Mirella Lapata. 2015. Movie script summarization as graph-based scene extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, 1066–1076.

[10]

Lei Guo, Jinyu Zhang, Li Tang, Tong Chen, Lei Zhu, and Hongzhi Yin. 2022. Time interval-enhanced graph neural network for shared-account cross-domain sequential recommendation. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–15.

[11]

Michael Hauge. 2017. Storytelling Made Easy: Persuade and Transform Your Audiences, Buyers, and Clients — Simply, Quickly, and Profitably. Indie Books International, Oceanside, CA.

[12]

Tae Joon Jun, Youngsub Eom, Dohyeun Kim, Cherry Kim, Ji-Hye Park, Hoang Minh Nguyen, Young-Hak Kim, and Daeyoung Kim. 2021. TRk-CNN: Transferable ranking-CNN for image classification of glaucoma, glaucoma suspect, and normal eyes. Expert Systems with Applications 182 (Nov.2021), 115211.

Digital Library

[13]

Kornraphop Kawintiranon and Lisa Singh. 2021. Knowledge enhanced masked language model for stance detection. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 4725–4735.

[14]

Myungji Lee, Hongseok Kwon, Jaehun Shin, WonKee Lee, Baikjin Jung, and Jong-Hyeok Lee. 2021. Transformer-based screenplay summarization using augmented learning representation with dialogue information. In Proceedings of the Third Workshop on Narrative Understanding. Association for Computational Linguistics, Virtual, 56–61.

[15]

Xianyong Li, Jiabo Zhang, Yajun Du, Jian Zhu, Yongquan Fan, and Xiaoliang Chen. 2022. A novel deep learning-based sentiment analysis method enhanced with emojis in microblog social networks. Enterprise Information Systems 17, 5 (Feb.2022), 1–22.

[16]

Dengwen Lin, Jintao Tang, Xinyi Li, Kunyuan Pang, Shasha Li, and Ting Wang. 2022. BERT-SMAP: Paying attention to essential terms in passage ranking beyond BERT. Information Processing & Management 59, 2 (March2022), 102788.

Digital Library

[17]

Hao Liu, Xiaoshan Yang, and Changsheng Xu. 2023. Counterfactual scenario-relevant knowledge-enriched multi-modal emotion reasoning. ACM Trans. Multimedia Comput. Commun. Appl. (Feb.2023). DOI:

Digital Library

[18]

Yingying Liu, Peipei Li, and Xuegang Hu. 2022. Combining context-relevant features with multi-stage attention network for short text classification. Computer Speech & Language 71 (Jan.2022), 101268.

Digital Library

[19]

Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations. https://openreview.net/forum?id=S1jE5L5gl

[20]

C. S. Myers and L. R. Rabiner. 1981. A comparative study of several dynamic time-warping algorithms for connected-word recognition. The Bell System Technical Journal 60, 7 (Sept.1981), 1389–1409.

[21]

Pinelopi Papalampidi, Frank Keller, Lea Frermann, and Mirella Lapata. 2020. Screenplay summarization using latent narrative structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1920–1933.

[22]

Pinelopi Papalampidi, Frank Keller, and Mirella Lapata. 2019. Movie plot analysis via turning point identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 1707–1717. DOI:

[23]

Pinelopi Papalampidi, Frank Keller, and Mirella Lapata. 2020. Movie Summarization via Sparse Graph Construction. arxiv:2012.07536 [cs.CL]

[24]

Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge enhanced contextual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 43–54.

[25]

Ruijie Quan, Linchao Zhu, Yu Wu, and Yi Yang. 2021. Holistic LSTM for pedestrian trajectory prediction. IEEE Transactions on Image Processing 30 (2021), 3229–3239.

Digital Library

[26]

Mehmet Umut Salur and İlhan Aydın. 2022. A soft voting ensemble learning-based approach for multimodal sentiment analysis. Neural Computing and Applications 34, 21 (Nov.2022), 18391–18406.

Digital Library

[27]

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced Representation through Knowledge Integration. arxiv:1904.09223 [cs.CL]

[28]

K. Suresh Kumar and C. Helen Sulochana. 2022. Local search five-element cycle optimized reLU-BiLSTM for multilingual aspect-based text classification. Concurrency and Computation: Practice and Experience 34, 28 (2022), e7374.

[29]

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding stories in movies through question-answering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, 4631–4640.

[30]

Helin Wang, Yuexian Zou, Dading Chong, and Wenwu Wang. 2020. Modeling label dependencies for audio tagging with graph convolutional network. IEEE Signal Processing Letters 27 (2020), 1560–1564.

[31]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5987–5995.

[32]

Yiran Xing, Zai Shi, Zhao Meng, Gerhard Lakemeyer, Yunpu Ma, and Roger Wattenhofer. 2021. KM-BART: Knowledge enhanced multimodal BART for visual commonsense generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 525–535.

[33]

Yi Yang, Yueting Zhuang, and Yunhe Pan. 2021. Multiple knowledge representation for big data artificial intelligence: Framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering 22 (2021), 1551–1558.

[34]

Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graph. arxiv:2006.16934 [cs.CV]

[35]

Hao Zheng and Mirella Lapata. 2019. Sentence centrality revisited for unsupervised summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6236–6247.

[36]

Ling Zhu, Xiaofei Zhu, Jiafeng Guo, and Stefan Dietze. 2023. Exploring rich structure information for aspect-based sentiment classification. Journal of Intelligent Information Systems 60, 1 (Feb.2023), 97–117.

Digital Library

[37]

Wenwu Zhu, Xin Wang, and Hongzhi Li. 2020. Multi-modal deep analysis for multimedia. IEEE Transactions on Circuits and Systems for Video Technology 30, 10 (2020), 3740–3764. DOI:

Cited By

Fu LLiu YZhang YLi M(2024)Network Information Security Monitoring Under Artificial Intelligence EnvironmentInternational Journal of Information Security and Privacy10.4018/IJISP.34503818:1(1-25)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.4018/IJISP.345038
Gao HSu YWang FLi H(2024)Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647620:7(1-24)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3656476
Wang SYan YHan FTian YDing YYang PLi XOkoshi TKo JLiKamWa R(2024)MultiRider: Enabling Multi-Tag Concurrent OFDM Backscatter by Taming In-band InterferenceProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661862(292-303)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643832.3661862
Show More Cited By

Index Terms

Knowledge-integrated Multi-modal Movie Turning Point Identification
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

An effective multimodal representation and fusion method for multimodal intent recognition
Highlights
- Construct modality-shared and modality-specific encoders that effectively learn shared and specific feature representations of modalities.
- Propose an end-to-end multimodal representation and fusion method for multimodal intent ...
Abstract
Intent recognition is a crucial task in natural language understanding. Current research mainly focuses on task-specific unimodal intent recognition. However, in real-world scenes, human intentions are complex and need to be judged by integrating ...
Multi-Modal Knowledge Representation Learning via Webly-Supervised Relationships Mining
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Knowledge representation learning (KRL) encodes enormous structured information with entities and relations into a continuous low-dimensional semantic space. Most conventional methods solely focus on learning knowledge representation from single ...
MGICL: Multi-Grained Interaction Contrastive Learning for Multimodal Named Entity Recognition
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Multimodal Named Entity Recognition (MNER) aims to combine data from different modalities (e.g. text, images, videos, etc.) for recognition and classification of named entities, which is crucial for constructing Multimodal Knowledge Graphs (MMKGs). ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 5

May 2024

650 pages

EISSN:1551-6865

DOI:10.1145/3613634

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 January 2024

Online AM: 23 December 2023

Accepted: 17 December 2023

Revised: 27 October 2023

Received: 16 May 2023

Published in TOMM Volume 20, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Guangdong Provincial Key Laboratory of Cyber-Physical Systems
National Natural Science Foundation of China
Shenzhen Foundational Research Funding
Major Key Project of PCL

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
158
Total Downloads

Downloads (Last 12 months)158
Downloads (Last 6 weeks)15

Reflects downloads up to 02 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fu LLiu YZhang YLi M(2024)Network Information Security Monitoring Under Artificial Intelligence EnvironmentInternational Journal of Information Security and Privacy10.4018/IJISP.34503818:1(1-25)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.4018/IJISP.345038
Gao HSu YWang FLi H(2024)Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647620:7(1-24)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3656476
Wang SYan YHan FTian YDing YYang PLi XOkoshi TKo JLiKamWa R(2024)MultiRider: Enabling Multi-Tag Concurrent OFDM Backscatter by Taming In-band InterferenceProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661862(292-303)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643832.3661862
Xue MXu ZQiao SZheng JLi TWang YPeng D(2024)Driver intention prediction based on multi-dimensional cross-modality information interactionMultimedia Systems10.1007/s00530-024-01282-330:2Online publication date: 15-Mar-2024
https://dl.acm.org/doi/10.1007/s00530-024-01282-3

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents