Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Social Context-aware Person Search in Videos via Multi-modal Cues

Published: 22 November 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Person search has long been treated as a crucial and challenging task to support deeper insight in personalized summarization and personality discovery. Traditional methods, e.g., person re-identification and face recognition techniques, which profile video characters based on visual information, are often limited by relatively fixed poses or small variation of viewpoints and suffer from more realistic scenes with high motion complexity (e.g., movies). At the same time, long videos such as movies often have logical story lines and are composed of continuously developmental plots. In this situation, different persons usually meet on a specific occasion, in which informative social cues are performed. We notice that these social cues could semantically profile their personality and benefit person search task in two aspects. First, persons with certain relationships usually co-occur in short intervals; in case one of them is easier to be identified, the social relation cues extracted from their co-occurrences could further benefit the identification for the harder ones. Second, social relations could reveal the association between certain scenes and characters (e.g., classmate relationship may only exist among students), which could narrow down candidates into certain persons with a specific relationship. In this way, high-level social relation cues could improve the effectiveness of person search. Along this line, in this article, we propose a social context-aware framework, which fuses visual and social contexts to profile persons in more semantic perspectives and better deal with person search task in complex scenarios. Specifically, we first segment videos into several independent scene units and abstract out social contexts within these scene units. Then, we construct inner-personal links through a graph formulation operation for each scene unit, in which both visual cues and relation cues are considered. Finally, we perform a relation-aware label propagation to identify characters’ occurrences, combining low-level semantic cues (i.e., visual cues) and high-level semantic cues (i.e., relation cues) to further enhance the accuracy. Experiments on real-world datasets validate that our solution outperforms several competitive baselines.

    References

    [1]
    Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multim. Syst. 16, 6 (2010), 345–379.
    [2]
    Slawomir Bak and Peter Carr. 2016. Person re-identification using deformable patch metric learning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1–9.
    [3]
    Daphne Blunt Bugental. 2000. Acquisition of the algorithms of social life: A domain-based approach.Psychol. Bull. 126, 2 (2000), 187.
    [4]
    Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6154–6162.
    [5]
    Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2017. VGGFace2: A dataset for recognising faces across pose and age. https://ieeexplore.ieee.org/abstract/document/8373813.
    [6]
    Xiaobin Chang, Timothy M. Hospedales, and Tao Xiang. 2018. Multi-level factorisation net for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2109–2118.
    [7]
    Di Chen, Shanshan Zhang, Wanli Ouyang, Jian Yang, and Ying Tai. 2018. Person search via a mask-guided two-stream CNN model. https://openaccess.thecvf.com/content_ECCV_2018/html/Di_Chen_Person_Search_via_ECCV_2018_paper.html.
    [8]
    De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and Nanning Zheng. 2016. Person re-identification by multi-channel parts-based CNN with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1335–1344.
    [9]
    Jia Deng, Wei Dong, Richard Socher, Li Jia Li, and Fei Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    [10]
    Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos Zafeiriou. 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    [11]
    Shengyong Ding, Liang Lin, Guangrun Wang, and Hongyang Chao. 2015. Deep feature learning with relative distance comparison for person re-identification. Pattern Recog. 48, 10 (2015), 2993–3003.
    [12]
    E. Eidinger, R. Enbar, and T. Hassner. 2014. Age and gender estimation of unfiltered faces. IEEE Trans. Inf. Forens. Secur. 9, 12 (2014), 2170–2179. DOI:https://doi.org/10.1109/TIFS.2014.2359646
    [13]
    Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2007. The PASCAL visual object classes challenge 2007 (VOC2007) results. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.6547&rep=rep1&type=pdf.
    [14]
    Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, and Mohan Kankanhalli. 2018. Multi-modal preference modeling for product search. In Proceedings of the 26th ACM International Conference on Multimedia. 1865–1873.
    [15]
    Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. MS-Celeb-1M: A dataset and benchmark for large-scale face recognition. In Proceedings of the European Conference on Computer Vision. Springer, 87–102.
    [16]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
    [17]
    Qingqiu Huang, Wentao Liu, and Dahua Lin. 2018. Person search in videos with one portrait through visual and temporal links. In Proceedings of the European Conference on Computer Vision (ECCV). 425–441.
    [18]
    Vijay Kumar, Anoop M. Namboodiri, and C. V. Jawahar. 2014. Face recognition in videos by label propagation. In Proceedings of the 22nd International Conference on Pattern Recognition. IEEE, 303–308.
    [19]
    Gil Levi and Tal Hassner. 2015. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 34–42.
    [20]
    Chenchen Li, Jialin Wang, Hongwei Wang, Miao Zhao, Wenjie Li, and Xiaotie Deng. 2019. Visual-texual emotion analysis with deep coupled video and Danmu neural networks. IEEE Trans. Multim. 22, 6 (2019), 1634–1646.
    [21]
    Hongyang Li, Huchuan Lu, Zhe Lin, Xiaohui Shen, and Brian Price. 2015. Inner and inter label propagation: Salient object detection in the wild. IEEE Trans. Image Process. 24, 10 (2015), 3176–3186.
    [22]
    Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. 2014. DeepReID: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 152–159.
    [23]
    Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z. Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2197–2206.
    [24]
    Guangyi Lv, Tong Xu, Enhong Chen, Qi Liu, and Yi Zheng. 2016. Reading the videos: Temporal labeling for crowdsourced time-sync videos based on semantic embedding. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
    [25]
    Harry T. Reis, W. Andrew Collins, and Ellen Berscheid. 2000. The relationship context of human behavior and development.Psychol. Bull. 126, 6 (2000), 844.
    [26]
    Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 815–823.
    [27]
    Yantao Shen, Tong Xiao, Hongsheng Li, Shuai Yi, and Xiaogang Wang. 2018. End-to-end deep Kronecker-product matching for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6886–6895.
    [28]
    Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
    [29]
    Nitish Srivastava and Ruslan Salakhutdinov. 2012. Learning representations for multimodal data with deep belief nets. In Proceedings of the International Conference on Machine learning Workshop.
    [30]
    Nitish Srivastava and Russ R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 2222–2230.
    [31]
    Qianru Sun, Bernt Schiele, and Mario Fritz. 2017. A domain based approach to social relation recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3481–3490.
    [32]
    Subarna Tripathi, Serge Belongie, Youngbae Hwang, and Truong Nguyen. 2016. Detecting temporally consistent objects in videos through object class label propagation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1–9.
    [33]
    Tinne Tuytelaars, Marie-Francine Moens et al. 2011. Naming people in news videos with label propagation. IEEE Multim.3 (2011), 44–55.
    [34]
    Fei Wang and Changshui Zhang. 2007. Label propagation through linear neighborhoods. IEEE Trans. Knowl. Data Eng. 20, 1 (2007), 55–67.
    [35]
    Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6609–6618.
    [36]
    Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. 2017. Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3415–3424.
    [37]
    Xiaojin Zhu and Zoubin Ghahramani. 2002. Learning from labeled and unlabeled data with label propagation. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.8280.
    [38]
    Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the Conference on Computer Vision and Pattern Recognition.
    [39]
    Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23, 10 (2016), 1499–1503.
    [40]
    Ning Zhang, Manohar Paluri, Yaniv Taigman, Rob Fergus, and Lubomir Bourdev. 2015. Beyond frontal faces: Improving person recognition using multiple cues. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4804–4813.
    [41]
    Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. 2017. CityPersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3213–3221.
    [42]
    Shanshan Zhang, Jian Yang, and Bernt Schiele. 2018. Occluded pedestrian detection through guided attention in CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6995–7003.
    [43]
    Wei-Shi Zheng, Xiang Li, Tao Xiang, Shengcai Liao, Jianhuang Lai, and Shaogang Gong. 2015. Partial person re-identification. In Proceedings of the IEEE International Conference on Computer Vision. 4678–4686.
    [44]
    Zhedong Zheng, Liang Zheng, and Yi Yang. 2018. Pedestrian alignment network for large-scale person re-identification. IEEE Trans. Circ. Syst. Vid. Technol.ogy 29, 10 (2018), 3037–3045.
    [45]
    Olga Zoidi, Anastasios Tefas, Nikos Nikolaidis, and Ioannis Pitas. 2014. Person identity label propagation in stereo videos. IEEE Trans. Multim. 16, 5 (2014), 1358–1368.

    Cited By

    View all
    • (2024)InteractNet: Social Interaction Recognition for Semantic-rich VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366366820:8(1-21)Online publication date: 12-Jun-2024
    • (2024)Document-level Relation Extraction with Progressive Self-distillationACM Transactions on Information Systems10.1145/365616842:6(1-34)Online publication date: 25-Jun-2024
    • (2024)Person search over security video surveillance systems using deep learning methodsImage and Vision Computing10.1016/j.imavis.2024.104930143:COnline publication date: 2-Jul-2024
    • Show More Cited By

    Index Terms

    1. Social Context-aware Person Search in Videos via Multi-modal Cues

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Information Systems
        ACM Transactions on Information Systems  Volume 40, Issue 3
        July 2022
        650 pages
        ISSN:1046-8188
        EISSN:1558-2868
        DOI:10.1145/3498357
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 22 November 2021
        Accepted: 01 August 2021
        Revised: 01 June 2021
        Received: 01 November 2020
        Published in TOIS Volume 40, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Person search
        2. graph modeling
        3. user profile
        4. label propagation
        5. social relation
        6. neural network

        Qualifiers

        • Research-article
        • Refereed

        Funding Sources

        • National Key Research and Development Program of China
        • National Natural Science Foundation of China

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)117
        • Downloads (Last 6 weeks)3

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)InteractNet: Social Interaction Recognition for Semantic-rich VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366366820:8(1-21)Online publication date: 12-Jun-2024
        • (2024)Document-level Relation Extraction with Progressive Self-distillationACM Transactions on Information Systems10.1145/365616842:6(1-34)Online publication date: 25-Jun-2024
        • (2024)Person search over security video surveillance systems using deep learning methodsImage and Vision Computing10.1016/j.imavis.2024.104930143:COnline publication date: 2-Jul-2024
        • (2023)Semantic Collaborative Learning for Cross-Modal Moment LocalizationACM Transactions on Information Systems10.1145/362066942:2(1-26)Online publication date: 7-Nov-2023
        • (2023)Learning to Relate to Previous Turns in Conversational SearchProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599411(1722-1732)Online publication date: 6-Aug-2023
        • (2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 22-Mar-2023
        • (2023)Interaction-aware Drug Package Recommendation via Policy GradientACM Transactions on Information Systems10.1145/351102041:1(1-32)Online publication date: 10-Jan-2023
        • (2023)Social Context-aware GCN for Video Character Search via Scene-prior Enhancement2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00444(2609-2614)Online publication date: Jul-2023
        • (2022)Curriculum Contrastive Context Denoising for Few-shot Conversational Dense RetrievalProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531961(176-186)Online publication date: 6-Jul-2022
        • (2022)Residual objectness for imbalance reductionPattern Recognition10.1016/j.patcog.2022.108781130:COnline publication date: 1-Oct-2022
        • Show More Cited By

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media