Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664647.3681197acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Revisiting Unsupervised Temporal Action Localization: The Primacy of High-Quality Actionness and Pseudolabels

Published: 28 October 2024 Publication History

Abstract

Recently, temporal action localization (TAL) methods, especially the weakly-supervised and unsupervised ones, have become a hot research topic. Existing unsupervised methods follow an iterative ''clustering and training'' strategy with diverse model designs during training stage, while they often overlook maintaining consistency between these stages, which is crucial: more accurate clustering results can reduce the noises of pseudolabels and thus enhance model training, while more robust training can in turn enrich clustering feature representation. We identify two critical challenges in unsupervised scenarios: 1. What features should the model generate for clustering? 2. Which pseudolabeled instances from clustering should be chosen for model training? After extensive explorations, we proposed a novel yet simple framework called Consistency-Oriented Progressive high actionness Learning to address these issues. For feature generation, our framework adopts a High Actionness snippet Selection (HAS) module to generate more discriminative global video features for clustering from the enhanced actionness features obtained from a designed Inner-Outer Consistency Network (IOCNet). For pseudolabel selection, we introduces a Progressive Learning With Representative Instances (PLRI) strategy to identify the most reliable and informative instances within each cluster for model training. These three modules, HAS, IOCNet, and PLRI, synergistically improve consistency in model training and clustering performance. Extensive experiments on THUMOS'14 and ActivityNet v1.2 datasets under both unsupervised and weakly-supervised settings demonstrate that our framework achieves the state-of-the-art results.

References

[1]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 961--970.
[2]
Tianyue Cao, Yongxin Wang, Yifan Xing, Tianjun Xiao, Tong He, Zheng Zhang, Hao Zhou, and Joseph Tighe. 2022. PSS: Progressive sample selection for open-world visual representation learning. In European Conference on Computer Vision. Springer, 278--294.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6299--6308.
[4]
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1130--1139.
[5]
Chaoqi Chen, Weiping Xie, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, Tingyang Xu, and Junzhou Huang. 2019. Progressive feature alignment for unsupervised domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 627--636.
[6]
Ruoyi Du, Jiyang Xie, Zhanyu Ma, Dongliang Chang, Yi-Zhe Song, and Jun Guo. 2021. Progressive learning of category-consistent multi-granularity features for fine-grained visual classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 12 (2021), 9521--9535.
[7]
Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2022. Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19999--20009.
[8]
Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, et al. 2020. Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. Advances in neural information processing systems, Vol. 33 (2020), 11309--11321.
[9]
Guoqiang Gong, Xinghan Wang, Yadong Mu, and Qi Tian. 2020. Learning temporal co-attention models for unsupervised video action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 9819--9828.
[10]
Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. 2019. Efficient training of bert by progressively stacking. In International conference on machine learning. PMLR, 2337--2346.
[11]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729--9738.
[12]
Zhijun He, Hongbo Zhao, Jianrong Wang, and Wenquan Feng. 2022. Multi-Level Progressive Learning for Unsupervised Vehicle Re-identification. IEEE Transactions on Vehicular Technology (2022).
[13]
Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, and Wei-Shi Zheng. 2021. Cross-modal consensus network for weakly supervised temporal action localization. In Proceedings of the 29th ACM international conference on multimedia. 1591--1599.
[14]
Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. 2021. Video moment localization via deep cross-modal hashing. IEEE Transactions on Image Processing, Vol. 30 (2021), 4667--4677.
[15]
Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xian-Sheng Hua. 2021. Coarse-to-fine semantic alignment for cross-modal moment localization. IEEE Transactions on Image Processing, Vol. 30 (2021), 5933--5943.
[16]
Linjiang Huang, Liang Wang, and Hongsheng Li. 2021. Foreground-action consistency network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF international conference on computer vision. 8002--8011.
[17]
Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The THUMOS challenge on action recognition for videos ?in the wild?. Computer Vision and Image Understanding, Vol. 155 (2017), 1--23.
[18]
Yuan Ji, Xu Jia, Huchuan Lu, and Xiang Ruan. 2021. Weakly-supervised temporal action localization via cross-stream collaborative learning. In Proceedings of the 29th ACM international conference on multimedia. 853--861.
[19]
Kwang-Eun Ko and Kwee-Bo Sim. 2018. Deep convolutional framework for abnormal behavior detection in a smart surveillance system. Engineering Applications of Artificial Intelligence, Vol. 67 (2018), 226--234.
[20]
Pilhyeon Lee, Youngjung Uh, and Hyeran Byun. 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 11320--11327.
[21]
Changlin Li, Bohan Zhuang, Guangrun Wang, Xiaodan Liang, Xiaojun Chang, and Yi Yang. 2022. Automated progressive learning for efficient training of vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12486--12496.
[22]
Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Xiaoyu Wang, and Xinbo Gao. 2023. Boosting weakly-supervised temporal action localization with text information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10648--10657.
[23]
Ziqiang Li, Yongxin Ge, Jiaruo Yu, and Zhongming Chen. 2022. Forcing the whole video as background: An adversarial learning strategy for weakly temporal action localization. In Proceedings of the 30th ACM international conference on multimedia. 5371--5379.
[24]
Zhilin Li, Zilei Wang, and Qinying Liu. 2023. Actionness inconsistency-guided contrastive learning for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 1513--1521.
[25]
Fan Liu, Huilin Chen, Zhiyong Cheng, Anan Liu, Liqiang Nie, and Mohan Kankanhalli. 2023. Disentangled Multimodal Representation Learning for Recommendation. IEEE Transactions on Multimedia, Vol. 25, 11 (2023), 7149--7159.
[26]
Fan Liu, Huilin Chen, Zhiyong Cheng, Liqiang Nie, and Mohan Kankanhalli. 2023. Semantic-Guided Feature Distillation for Multimodal Recommendation. In Proceedings of the 31st ACM International Conference on Multimedia. ACM, 6567--6575.
[27]
Fan Liu, Zhiyong Cheng, Lei Zhu, Zan Gao, and Liqiang Nie. 2021. Interest-Aware Message-Passing GCN for Recommendation. In Proceedings of the Web Conference 2021. ACM, 1296--1305.
[28]
Qinying Liu, Zilei Wang, Shenghai Rong, Junjie Li, and Yixin Zhang. 2023. Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10433--10443.
[29]
Yuanyuan Liu, Ning Zhou, Fayong Zhang, Wenbin Wang, Yu Wang, Kejun Liu, and Ziyuan Liu. 2023 d. APSL: Action-positive separation learning for unsupervised temporal action localization. Information Sciences, Vol. 630 (2023), 206--221.
[30]
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 344--353.
[31]
Junwei Ma, Satya Krishna Gorti, Maksims Volkovs, and Guangwei Yu. 2021. Weakly Supervised Action Selection Learning in Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[32]
Md Moniruzzaman and Zhaozheng Yin. 2023. Collaborative Foreground, Background, and Action Modeling Network for Weakly Supervised Temporal Action Localization. IEEE Transactions on Circuits and Systems for Video Technology (2023).
[33]
Sanath Narayan, Hisham Cholakkal, Fahad Shahbaz Khan, and Ling Shao. 2019. 3c-net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE/CVF international conference on computer vision. 8679--8687.
[34]
Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6752--6761.
[35]
Phuc Xuan Nguyen, Deva Ramanan, and Charless C Fowlkes. 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5502--5511.
[36]
Trong-Nguyen Nguyen and Jean Meunier. 2019. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF international conference on computer vision. 1273--1283.
[37]
Huan Ren, Wenfei Yang, Tianzhu Zhang, and Yongdong Zhang. 2023. Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2394--2404.
[38]
Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, Vol. 20 (1987), 53--65.
[39]
Yuxiang Shao, Feifei Zhang, and Changsheng Xu. 2024. Snippet-to-Prototype Contrastive Consensus Network for Weakly Supervised Temporal Action Localization. IEEE Transactions on Multimedia (2024).
[40]
Haichao Shi, Xiao-Yu Zhang, Changsheng Li, Lixing Gong, Yong Li, and Yongjun Bao. 2022. Dynamic graph modeling for weakly-supervised temporal action localization. In Proceedings of the 30th ACM international conference on multimedia. 3820--3828.
[41]
Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, Vol. 22, 8 (2000), 888--905.
[42]
Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1049--1058.
[43]
Weiqi Sun, Rui Su, Qian Yu, and Dong Xu. 2022. Slow motion matters: A slow motion enhanced network for weakly supervised temporal action localization. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 33, 1 (2022), 354--366.
[44]
Haoyu Tang, Jihua Zhu, Meng Liu, Zan Gao, and Zhiyong Cheng. 2021. Frame-wise cross-modal matching for video moment retrieval. IEEE Transactions on Multimedia, Vol. 24 (2021), 1338--1349.
[45]
Haoyu Tang, Jihua Zhu, Lin Wang, Qinghai Zheng, and Tianwei Zhang. 2021. Multi-level query interaction for temporal language grounding. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, 12 (2021), 25479--25488.
[46]
Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, Vol. 30 (2017).
[47]
Guangcong Wang, Xiaohua Xie, Jianhuang Lai, and Jiaxuan Zhuo. 2017. Deep growing learning. In Proceedings of the IEEE international conference on computer vision. 2812--2820.
[48]
Xiang Wang, Shiwei Zhang, Zhiwu Qing, Yuanjie Shao, Changxin Gao, and Nong Sang. 2021. Self-Supervised Learning for Semi-Supervised Temporal Action Proposal. In CVPR.
[49]
Yu Wang, Yadong Li, and Hongbin Wang. 2023. Two-Stream Networks for Weakly-Supervised Temporal Action Localization With Semantic-Aware Mechanisms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18878--18887.
[50]
Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wei Bian, and Yi Yang. 2019. Progressive learning for person re-identification with one example. IEEE Transactions on Image Processing, Vol. 28, 6 (2019), 2872--2881.
[51]
Wenfei Yang, Tianzhu Zhang, Yongdong Zhang, and Feng Wu. 2022. Uncertainty Guided Collaborative Training for Weakly Supervised and Unsupervised Temporal Action Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[52]
Wulian Yun, Mengshi Qi, Chuanming Wang, and Huadong Ma. 2024. Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 6908--6916.
[53]
Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime tv-l 1 optical flow. In Pattern Recognition: 29th DAGM Symposium, Heidelberg, Germany, September 12--14, 2007. Proceedings 29. Springer, 214--223.
[54]
Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In Proceedings of the European Conference on Computer Vision. Springer, 37--54.
[55]
Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Nanning Zheng, David Doermann, Junsong Yuan, and Gang Hua. 2022. Adaptive two-stream consensus network for weakly-supervised temporal action localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 4 (2022), 4136--4151.
[56]
Can Zhang, Meng Cao, Dongming Yang, Jie Chen, and Yuexian Zou. 2021. Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16010--16019.
[57]
Minjia Zhang and Yuxiong He. 2020. Accelerating training of transformer-based language models with progressive layer dropping. Advances in Neural Information Processing Systems, Vol. 33 (2020), 14011--14023.
[58]
Yifei Zhang, Chang Liu, Yu Zhou, Wei Wang, Weiping Wang, and Qixiang Ye. 2021. Progressive cluster purification for unsupervised feature learning. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 8476--8483.
[59]
Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE international conference on computer vision. 2914--2923.
[60]
Zhengguang Zhou, Wengang Zhou, Xutao Lv, Xuan Huang, Xiaoyu Wang, and Houqiang Li. 2020. Progressive learning of low-precision networks for image classification. IEEE Transactions on Multimedia, Vol. 23 (2020), 871--882.

Index Terms

  1. Revisiting Unsupervised Temporal Action Localization: The Primacy of High-Quality Actionness and Pseudolabels

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. consistency constraint
    2. multimodal understanding
    3. progressive learning
    4. unsupervised temporal action localization

    Qualifiers

    • Research-article

    Funding Sources

    • the NSF of Shandong Province
    • the Alibaba Group through Alibaba Innovative Research Program
    • the Key R\&D Program of Shandong Province, China (Major Scientific and Technological Innovation Projects)
    • the Key Laboratory of Computing Power Network and Information Security, Ministry of Education under Grant
    • the National Natural Science Foundation (NSF) of China

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 66
      Total Downloads
    • Downloads (Last 12 months)66
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 12 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media