Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Transform-Equivariant Consistency Learning for Temporal Sentence Grounding

Published: 11 January 2024 Publication History
  • Get Citation Alerts
  • Abstract

    This paper addresses the temporal sentence grounding (TSG). Although existing methods have made decent achievements in this task, they not only severely rely on abundant video-query paired data for training, but also easily fail into the dataset distribution bias. To alleviate these limitations, we introduce a novel Equivariant Consistency Regulation Learning (ECRL) framework to learn more discriminative query-related frame-wise representations for each video, in a self-supervised manner. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted under various video-level transformations. Concretely, we first design a series of spatio-temporal augmentations on both foreground and background video segments to generate a set of synthetic video samples. In particular, we devise a self-refine module to enhance the completeness and smoothness of the augmented video. Then, we present a novel self-supervised consistency loss (SSCL) applied on the original and augmented videos to capture their invariant query-related semantic by minimizing the KL-divergence between the sequence similarity of two videos and a prior Gaussian distribution of timestamp distance. At last, a shared grounding head is introduced to predict the transform-equivariant query-guided segment boundaries for both the original and augmented videos. Extensive experiments on three challenging datasets (ActivityNet, TACoS, and Charades-STA) demonstrate both effectiveness and efficiency of our proposed ECRL framework.

    References

    [1]
    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803–5812.
    [2]
    Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. 2020. SpeedNet: Learning the speediness in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9922–9931.
    [3]
    Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, and Yuexian Zou. 2021. On pursuit of designing multi-modal transformer for video grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9810–9823.
    [4]
    Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
    [5]
    Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 162–171.
    [6]
    Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, Chilie Tan, and Xiaolin Li. 2020. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence. 10551–10558.
    [7]
    Shaoxiang Chen, Wenhao Jiang, Wei Liu, and Yu-Gang Jiang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 333–351.
    [8]
    Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8199–8206.
    [9]
    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning. PMLR, 1597–1607.
    [10]
    Wen-Sheng Chu, Yale Song, and Alejandro Jaimes. 2015. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3584–3592.
    [11]
    Carl Doersch, Abhinav Gupta, and Alexei A. Efros. 2015. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision. 1422–1430.
    [12]
    Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, and Cees G. M. Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the 24th ACM International Conference on Multimedia. 1082–1086.
    [13]
    Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377–3388.
    [14]
    Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2022. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 8 (2022), 4065–4080.
    [15]
    Jianfeng Dong, Shengkai Sun, Zhonglin Liu, Shujie Chen, Baolong Liu, and Xun Wang. 2022. Hierarchical contrast for unsupervised skeleton-based action representation learning. arXiv preprint arXiv:2212.02082 (2022).
    [16]
    Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, and Kaiming He. 2021. A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3299–3309.
    [17]
    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.
    [18]
    Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, and Heng Tao Shen. 2019. Structured two-stream attention network for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence.
    [19]
    Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. 2019. MAC: Mining activity concepts for language-based temporal localization. In IEEE Winter Conference on Applications of Computer Vision (WACV). 245–253.
    [20]
    Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised representation learning by predicting image rotations. arXiv (2018).
    [21]
    Jacob Goldberger, Shiri Gordon, and Hayit Greenspan. 2003. An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 3. 487–493.
    [22]
    Tengda Han, Weidi Xie, and Andrew Zisserman. 2019. Video representation learning by dense predictive coding. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 0–0.
    [23]
    Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. 2021. Video moment localization via deep cross-modal hashing. IEEE Transactions on Image Processing 30 (2021), 4667–4677.
    [24]
    Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xian-Sheng Hua. 2021. Coarse-to-fine semantic alignment for cross-modal moment localization. IEEE Transactions on Image Processing 30 (2021), 5933–5943.
    [25]
    Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, and Tong Zhang. 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 499–515.
    [26]
    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.
    [27]
    Haofei Kuang, Yi Zhu, Zhi Zhang, Xinyu Li, Joseph Tighe, Sören Schwertfeger, Cyrill Stachniss, and Mu Li. 2021. Video contrastive learning with global context. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3195–3204.
    [28]
    Xiaohan Lan, Yitian Yuan, Xin Wang, Long Chen, Zhi Wang, Lin Ma, and Wenwu Zhu. 2022. A closer look at debiased temporal sentence grounding in videos: Dataset, metric, and approach. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) (Oct. 2022).
    [29]
    Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang, and Wenwu Zhu. 2023. A Survey on temporal sentence grounding in videos. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 19, 2, Article 51 (Feb. 2023), 33 pages.
    [30]
    Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2020. Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9972–9981.
    [31]
    Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhihong Zhu, and Yuexian Zou. 2023. Exploiting prompt caption for video grounding. arXiv preprint arXiv:2301.05997 (2023).
    [32]
    Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. BMN: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3889–3898.
    [33]
    Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In ACM MM. 988–996.
    [34]
    Daizong Liu, Xiaoye Qu, Xing Di, Yu Cheng, Zichuan Xu, and Pan Zhou. 2022. Memory-guided semantic learning network for temporal sentence grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1665–1673.
    [35]
    Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. 2021. Adaptive proposal generation network for temporal sentence localization in videos. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 9292–9301.
    [36]
    Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. 2021. Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11235–11244.
    [37]
    Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Jointly cross-and self-modal graph attention network for query-based moment localization. In Proceedings of the 28th ACM International Conference on Multimedia. 4070–4078.
    [38]
    Daizong Liu, Xiaoye Qu, Yinzhen Wang, Xing Di, Kai Zou, Yu Cheng, Zichuan Xu, and Pan Zhou. 2022. Unsupervised temporal video grounding with deep semantic clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1683–1691.
    [39]
    Daizong Liu, Xiaoye Qu, and Pan Zhou. 2021. Progressively guide to attend: An iterative alignment framework for temporal sentence grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9302–9311.
    [40]
    Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 15–24.
    [41]
    Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems (NIPS). 289–297.
    [42]
    Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, and Liqiang Nie. 2021. Hierarchical deep residual reasoning for temporal moment localization. In ACM Multimedia Asia. 1–7.
    [43]
    Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and learn: Unsupervised learning using temporal order verification. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 527–544.
    [44]
    Jonghwan Mun, Minsu Cho, and Bohyung Han. 2020. Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10810–10819.
    [45]
    Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, Hao Zhang, and Wei Lu. 2021. Interventional video grounding with dual contrastive learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2765–2775.
    [46]
    Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. 2017. Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision. 5898–5906.
    [47]
    Mayu Otani, Yuta Nakashima, Esa Rahtu, and Janne Heikkilä. 2020. Uncovering hidden challenges in query-based video moment retrieval. arXiv (2020).
    [48]
    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 1532–1543.
    [49]
    A. J. Piergiovanni and Michael Ryoo. 2019. Temporal Gaussian mixture layer for videos. In International Conference on Machine Learning. PMLR, 5152–5161.
    [50]
    Rizard Renanda Adhi Pramono, Yie-Tarng Chen, and Wen-Hsien Fang. 2021. Spatial-temporal action localization with hierarchical self-attention. IEEE Transactions on Multimedia 24 (2021), 625–639.
    [51]
    Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. 2021. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6964–6974.
    [52]
    Tingting Qiao, Jianfeng Dong, and Duanqing Xu. 2018. Exploring human-like attention supervision in visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
    [53]
    Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1 (2013), 25–36.
    [54]
    Cristian Rodriguez, Edison Marrese-Taylor, Fatemeh Sadat Saleh, Hongdong Li, and Stephen Gould. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In IEEE Winter Conference on Applications of Computer Vision (WACV). 2464–2473.
    [55]
    Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (1997), 2673–2681.
    [56]
    Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1049–1058.
    [57]
    Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision (ECCV). 510–526.
    [58]
    Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5179–5187.
    [59]
    Che Sun, Hao Song, Xinxiao Wu, Yunde Jia, and Jiebo Luo. 2021. Exploiting informative video segments for temporal action localization. IEEE Transactions on Multimedia 24 (2021), 274–287.
    [60]
    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.
    [61]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS).
    [62]
    Jingwen Wang, Lin Ma, and Wenhao Jiang. 2020. Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence.
    [63]
    Zhenzhi Wang, Limin Wang, Tao Wu, Tianhao Li, and Gangshan Wu. 2022. Negative sample matters: A renaissance of metric learning for temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence.
    [64]
    Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, and Jun Xiao. 2021. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence.
    [65]
    Huijuan Xu, Abir Das, and Kate Saenko. 2019. Two-stream region convolutional 3D network for temporal activity detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 10 (2019), 2319–2332.
    [66]
    Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9062–9069.
    [67]
    Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-TAD: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10156–10165.
    [68]
    Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, and Junwei Han. 2020. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing 29 (2020), 8535–8548.
    [69]
    Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339–1348.
    [70]
    Xun Yang, Shanshan Wang, Jian Dong, Jianfeng Dong, Meng Wang, and Tat-Seng Chua. 2022. Video moment retrieval with cross-modal neural architecture search. IEEE Transactions on Image Processing 31 (2022), 1204–1216.
    [71]
    Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, and Tao Mei. 2021. SeCo: Exploring sequence supervision for unsupervised representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
    [72]
    Jin Yuan, Yi-Liang Zhao, Huanbo Luan, Meng Wang, and Tat-Seng Chua. 2014. Memory recall based video search: Finding videos you have seen before based on your memory. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 10, 2, Article 21 (2014), 21 pages.
    [73]
    Yitian Yuan, Xiaohan Lan, Xin Wang, Long Chen, Zhi Wang, and Wenwu Zhu. 2021. A closer look at temporal sentence grounding in videos: Dataset and metric. In Human-centric Multimedia Analysis.
    [74]
    Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In Advances in Neural Information Processing Systems (NIPS). 534–544.
    [75]
    Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9159–9166.
    [76]
    Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense regression network for video grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10287–10296.
    [77]
    Yawen Zeng, Da Cao, Shaofei Lu, Hanling Zhang, Jiao Xu, and Zheng Qin. 2022. Moment is important: Language-based video moment retrieval via adversarial learning. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 18, 2, Article 56 (2022), 21 pages.
    [78]
    Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Nanning Zheng, and Gang Hua. 2021. Action coherence network for weakly-supervised temporal action localization. IEEE Transactions on Multimedia 24 (2021), 1857–1870.
    [79]
    Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S. Davis. 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1247–1257.
    [80]
    Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localizing network for natural language video localization. In The Annual Meeting of the Association for Computational Linguistics.
    [81]
    Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2021. Towards debiasing temporal sentence grounding in video. arXiv (2021).
    [82]
    Richard Zhang, Phillip Isola, and Alexei A. Efros. 2016. Colorful image colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 649–666.
    [83]
    Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2D temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence.
    [84]
    Yaqing Zhang, Xi Li, and Zhongfei Zhang. 2019. Learning a key-value memory co-attention matching network for person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9235–9242.
    [85]
    Zongmeng Zhang, Xianjing Han, Xuemeng Song, Yan Yan, and Liqiang Nie. 2021. Multi-modal interaction graph convolutional network for temporal language localization in videos. IEEE Transactions on Image Processing 30 (2021), 8265–8277.
    [86]
    Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 655–664.
    [87]
    Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision. 2914–2923.
    [88]
    Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, and Xun Wang. 2023. Progressive localization networks for language-based moment localization. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2 (2023), 1–21.

    Cited By

    View all
    • (2024)Parameterized multi-perspective graph learning network for temporal sentence grounding in videosApplied Intelligence10.1007/s10489-024-05618-4Online publication date: 24-Jun-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 4
    April 2024
    676 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3613617
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 January 2024
    Online AM: 27 November 2023
    Accepted: 23 November 2023
    Revised: 05 October 2023
    Received: 06 May 2023
    Published in TOMM Volume 20, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Temporal sentence grounding
    2. transformation
    3. equivariant
    4. consistency learning

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)178
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Parameterized multi-perspective graph learning network for temporal sentence grounding in videosApplied Intelligence10.1007/s10489-024-05618-4Online publication date: 24-Jun-2024

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media