Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action Localization

Published: 08 March 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Weakly supervised temporal action localization (WTAL) aims to classify and localize actions in untrimmed videos with only video-level labels. Recent studies have attempted to obtain more accurate temporal boundaries by exploiting latent action instances in ambiguous snippets or propagating representative action features. However, empirically handcrafted ambiguous snippet extraction and the imprecise alignment of representative snippet propagation lead to challenges in modeling the completeness of actions for these methods. In this article, we propose a Discriminative Action Snippet Propagation Network (DASP-Net) to accurately discover ambiguous snippets in videos and propagate discriminative instance-level features throughout the video for improving action completeness. Specifically, we introduce a novel discriminative feature propagation module for capturing the global contextual attention and propagating the action concept across the whole video by perceiving the discriminative action snippets with instance information from the same video. Simultaneously, we incorporate denoised pseudo-labels as supervision, where we correct the controversial prediction based on the feature space distribution during training, thereby alleviating false detection caused by noise background features. Furthermore, we design an ambiguous feature mining module, which maximizes the feature affinity information of action and background in ambiguous snippets to generate more accurate latent action and background snippets and learns more precise action instance boundaries through contrastive learning of action and background snippets. Extensive experiments show that DASP-Net achieves state-of-the-art results on THUMOS14 and ActivityNet1.2 datasets.

    References

    [1]
    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 961–970.
    [2]
    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning. 1597–1607.
    [3]
    Arridhana Ciptadi, Matthew S. Goodwin, and James M. Rehg. 2014. Movement pattern histogram for action recognition and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 695–710.
    [4]
    Peng Dou, Ying Zeng, Zhuoqun Wang, and Haifeng Hu. 2023. Multiple temporal pooling mechanisms for weakly supervised temporal action localization. ACM Transactions on Multimedia Computing, Communications and Applications 19, 3 (2023), 1–19.
    [5]
    Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2022. Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19999–20009.
    [6]
    Bo He, Xitong Yang, Le Kang, Zhiyu Cheng, Xin Zhou, and Abhinav Shrivastava. 2022. ASM-Loc: Action-aware segment modeling for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13925–13935.
    [7]
    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9729–9738.
    [8]
    Tong He, Yifan Liu, Chunhua Shen, Xinlong Wang, and Changming Sun. 2020. Instance-aware embedding for point cloud instance segmentation. In Proceedings of the16th European Conference on Computer Vision–ECCV 2020. Springer, 255–270.
    [9]
    Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, and Wei-Shi Zheng. 2021. Cross-modal consensus network for weakly supervised temporal action localization. In Proceedings of the 29th ACM International Conference on Multimedia. 1591–1599.
    [10]
    Linjiang Huang, Yan Huang, Wanli Ouyang, and Liang Wang. 2021. Two-branch relational prototypical network for weakly supervised temporal action localization. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 5729–5746.
    [11]
    Linjiang Huang, Liang Wang, and Hongsheng Li. 2021. Foreground-action consistency network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8002–8011.
    [12]
    Linjiang Huang, Liang Wang, and Hongsheng Li. 2022. Weakly supervised temporal action localization via representative snippet knowledge propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3272–3281.
    [13]
    Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. CCNet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 603–612.
    [14]
    Ashraful Islam, Chengjiang Long, and Richard Radke. 2021. A hybrid attention mechanism for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35, 1637–1645.
    [15]
    Ashraful Islam and Richard Radke. 2020. Weakly supervised temporal action localization using deep metric learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 547–556.
    [16]
    Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 221–231.
    [17]
    Yuan Ji, Xu Jia, Huchuan Lu, and Xiang Ruan. 2021. Weakly-supervised temporal action localization via cross-stream collaborative learning. In Proceedings of the 29th ACM International Conference on Multimedia. 853–861.
    [18]
    Yu-Gang Jiang, Jingen Liu, A. Roshan Zamir, George Toderici, Ivan Laptev, Mubarak Shah, and Rahul Sukthankar. 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. Retrieved from http://crcv.ucf.edu/THUMOS14/ Accessed 7/20/2023.
    [19]
    Sunkyung Kim, Hyesong Choi, and Dongbo Min. 2022. Sequential cross attention based multi-task learning. In Proceedings of the IEEE International Conference on Image Processing (ICIP’22). IEEE, 2311–2315.
    [20]
    Krishna Kumar Singh and Yong Jae Lee. 2017. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3524–3533.
    [21]
    Pilhyeon Lee, Youngjung Uh, and Hyeran Byun. 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34, 11320–11327.
    [22]
    Pilhyeon Lee, Jinglu Wang, Yan Lu, and Hyeran Byun. 2021. Weakly-supervised temporal action localization by uncertainty modeling. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35, 1854–1862.
    [23]
    Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman. 2012. Discovering important people and objects for egocentric video summarization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1346–1353.
    [24]
    Jingjing Li, Tianyu Yang, Wei Ji, Jue Wang, and Li Cheng. 2022. Exploring denoised cross-video contrast for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19914–19924.
    [25]
    Ziqiang Li, Yongxin Ge, Jiaruo Yu, and Zhongming Chen. 2022. Forcing the whole video as background: An adversarial learning strategy for weakly temporal action localization. In Proceedings of the 30th ACM International Conference on Multimedia. 5371–5379.
    [26]
    Hezheng Lin, Xing Cheng, Xiangyu Wu, and Dong Shen. 2022. Cat: Cross attention in vision transformer. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’22). 1–6.
    [27]
    Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 3–19.
    [28]
    Chen Liu, Yanwei Fu, Chengming Xu, Siqian Yang, Jilin Li, Chengjie Wang, and Li Zhang. 2021. Learning a few-shot embedding model with contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35, 8635–8643.
    [29]
    Daochang Liu, Tingting Jiang, and Yizhou Wang. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1298–1307.
    [30]
    Zemin Liu, Yuan Fang, Chenghao Liu, and Steven C. H. Hoi. 2021. Relative and absolute location embedding for few-shot node classification on graph. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35, 4267–4275.
    [31]
    Wang Luo, Tianzhu Zhang, Wenfei Yang, Jingen Liu, Tao Mei, Feng Wu, and Yongdong Zhang. 2021. Action unit memory network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9969–9979.
    [32]
    Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, and Zheng Shou. 2020. SF-Net: Single-frame supervision for temporal action localization. In Proceedings of the European Conference on Computer Vision (ECCV’20). Springer, 420–437.
    [33]
    Yu-Fei Ma, Xian-Sheng Hua, Lie Lu, and Hong-Jiang Zhang. 2005. A generic framework of user attention model and its application in video summarization. IEEE Transactions on Multimedia 7, 5 (2005), 907–919.
    [34]
    Sanath Narayan, Hisham Cholakkal, Fahad Shahbaz Khan, and Ling Shao. 2019. 3C-Net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8679–8687.
    [35]
    Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6752–6761.
    [36]
    Phuc Xuan Nguyen, Deva Ramanan, and Charless C. Fowlkes. 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5502–5511.
    [37]
    Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, and Wei Liu. 2021. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11205–11214.
    [38]
    Sujoy Paul, Sourya Roy, and Amit K. Roy-Chowdhury. 2018. W-TALC: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, 563–579.
    [39]
    Sanqing Qu, Guang Chen, Zhijun Li, Lijun Zhang, Fan Lu, and Alois Knoll. 2021. ACM-Net: Action context modeling network for weakly-supervised temporal action localization. arXiv:2104.02967. Retrieved from https://arxiv.org/abs/2104.02967
    [40]
    Huan Ren, Wenfei Yang, Tianzhu Zhang, and Yongdong Zhang. 2023. Proposal-based multiple instance learning for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2394–2404.
    [41]
    Baifeng Shi, Qi Dai, Yadong Mu, and Jingdong Wang. 2020. Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1009–1019.
    [42]
    Haichao Shi, Xiao-Yu Zhang, Changsheng Li, Lixing Gong, Yong Li, and Yongjun Bao. 2022. Dynamic graph modeling for weakly-supervised temporal action localization. In Proceedings of the 30th ACM International Conference on Multimedia. 3820–3828.
    [43]
    Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. 2018. AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, 154–171.
    [44]
    Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1049–1058.
    [45]
    Xiangbo Shu, Binqian Xu, Liyan Zhang, and Jinhui Tang. 2022. Multi-granularity anchor-contrastive representation learning for semi-supervised skeleton-based action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 6 (2022), 7559–7576.
    [46]
    Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6479–6488.
    [47]
    Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605.
    [48]
    Yu Wang, Yadong Li, and Hongbin Wang. 2023. Two-stream networks for weakly-supervised temporal action localization with semantic-aware mechanisms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18878–18887.
    [49]
    Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3733–3742.
    [50]
    Chi Xie, Zikun Zhuang, Shengjie Zhao, and Shuang Liang. 2023. Temporal dropout for weakly supervised action localization. ACM Transactions on Multimedia Computing, Communications and Applications 19, 3 (2023), 1–24.
    [51]
    Binqian Xu, Xiangbo Shu, and Yan Song. 2022. X-invariant contrastive augmentation and representation learning for semi-supervised skeleton-based action recognition. IEEE Transactions on Image Processing 31 (2022), 3852–3867.
    [52]
    Binqian Xu, Xiangbo Shu, Jiachao Zhang, Guangzhao Dai, and Yan Song. 2023. Spatiotemporal decouple-and-squeeze contrastive learning for semisupervised skeleton-based action recognition. IEEE Transactions on Neural Networks and Learning Systems (2023), 1–14. DOI:
    [53]
    Yunlu Xu, Chengwei Zhang, Zhanzhan Cheng, Jianwen Xie, Yi Niu, Shiliang Pu, and Fei Wu. 2019. Segregated temporal assembly recurrent networks for weakly supervised multiple action detection. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33, 9070–9078.
    [54]
    Rui Yan, Lingxi Xie, Xiangbo Shu, Liyan Zhang, and Jinhui Tang. 2023. Progressive instance-aware feature learning for compositional action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 8 (2023), 10317–10330.
    [55]
    Zichen Yang, Jie Qin, and Di Huang. 2022. ACGNet: Action complement graph network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36, 3090–3098.
    [56]
    Mang Ye, Xu Zhang, Pong C. Yuen, and Shih-Fu Chang. 2019. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6210–6219.
    [57]
    Yuan Yuan, Yueming Lyu, Xi Shen, Ivor W. Tsang, and Dit-Yan Yeung. 2019. Marginalized average attentional network for weakly-supervised learning. arXiv:1905.08586. Retrieved from https://arxiv.org/abs/1905.08586
    [58]
    Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7094–7103.
    [59]
    Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In Proceedings of the European Conference on Computer Vision (ECCV’20). Springer, 37–54.
    [60]
    Can Zhang, Meng Cao, Dongming Yang, Jie Chen, and Yuexian Zou. 2021. CoLA: Weakly-supervised temporal action localization with snippet contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16010–16019.
    [61]
    Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2914–2923.
    [62]
    Jia-Xing Zhong, Nannan Li, Weijie Kong, Tao Zhang, Thomas H. Li, and Ge Li. 2018. Step-by-step erasion, one-by-one collection: A weakly supervised temporal action detector. In Proceedings of the 26th ACM International Conference on Multimedia. 35–44.
    [63]
    Dingfu Zhou, Xibin Song, Yuchao Dai, Junbo Yin, Feixiang Lu, Miao Liao, Jin Fang, and Liangjun Zhang. 2020. IAFA: Instance-aware feature aggregation for 3D object detection from a single image. In Proceedings of the Asian Conference on Computer Vision.
    [64]
    Jianxiong Zhou and Ying Wu. 2023. Temporal feature enhancement dilated convolution network for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 6028–6037.

    Index Terms

    1. Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action Localization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 6
      June 2024
      715 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613638
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 March 2024
      Online AM: 31 January 2024
      Accepted: 27 January 2024
      Revised: 27 January 2024
      Received: 28 July 2023
      Published in TOMM Volume 20, Issue 6

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Temporal action localization
      2. weakly supervised
      3. contrastive learning
      4. cross attention
      5. pseudo labels
      6. feature propagation

      Qualifiers

      • Research-article

      Funding Sources

      • Natural Science Foundation of China
      • Zhejiang Provincial Natural Science Foundation of China
      • Ten Thousand Talent Program of Zhejiang Province

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 196
        Total Downloads
      • Downloads (Last 12 months)196
      • Downloads (Last 6 weeks)35

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media