research-article

Progressive Localization Networks for Language-Based Moment Localization

Authors:

Xun WangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 19, Issue 2

Article No.: 55, Pages 1 - 21

https://doi.org/10.1145/3543857

Published: 06 February 2023 Publication History

Abstract

This article targets the task of language-based video moment localization. The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments. Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, then match them with the given query to determine the target moment. However, candidate moments generated with a fixed temporal granularity may be suboptimal to handle the large variation in moment lengths. To this end, we propose a novel multi-stage Progressive Localization Network (PLN) that progressively localizes the target moment in a coarse-to-fine manner. Specifically, each stage of PLN has a localization branch and focuses on candidate moments that are generated with a specific temporal granularity. The temporal granularities of candidate moments are different across the stages. Moreover, we devise a conditional feature manipulation module and an upsampling connection to bridge the multiple localization branches. In this fashion, the later stages are able to absorb the previously learned information, thus facilitating the more fine-grained localization. Extensive experiments on three public datasets demonstrate the effectiveness of our proposed PLN for language-based moment localization, especially for localizing short moments in long videos.

References

[1]

Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. SST: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2911–2920.

[2]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.

[3]

Da Cao, Yawen Zeng, Xiaochi Wei, Liqiang Nie, Richang Hong, and Zheng Qin. 2020. Adversarial video moment retrieval by jointly modeling ranking and localization. In Proceedings of the ACM International Conference on Multimedia. 898–906.

Digital Library

[4]

Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1130–1139.

[5]

Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 162–171.

[6]

Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, Chilie Tan, and Xiaolin Li. 2020. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10551–10558.

[7]

Peihao Chen, Chuang Gan, Guangyao Shen, Wenbing Huang, Runhao Zeng, and Mingkui Tan. 2019. Relation attention for temporal action localization. IEEE Transactions on Multimedia 22, 10 (2019), 2723–2733.

[8]

Shaoxiang Chen, Wenhao Jiang, Wei Liu, and Yu-Gang Jiang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In Proceedings of the European Conference on Computer Vision. 333–351.

Digital Library

[9]

Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8199–8206.

Digital Library

[10]

Shaoxiang Chen and Yu-Gang Jiang. 2020. Hierarchical visual-textual graph for temporal activity localization via language. In Proceedings of the European Conference on Computer Vision. 601–618.

Digital Library

[11]

Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377–3388.

Digital Library

[12]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence. Early access, February 15, 2021.

Digital Library

[13]

Jianfeng Dong, Xirong Li, and Duanqing Xu. 2018. Cross-media similarity evaluation for web image retrieval in the wild. IEEE Transactions on Multimedia 20, 9 (2018), 2371–2384.

Digital Library

[14]

Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, and Xun Wang. 2022. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology. Early access, January 23, 2022.

Digital Library

[15]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.

[16]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.

[17]

Junyu Gao and Changsheng Xu. 2021. Fast video moment retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 1523–1532.

[18]

Junyu Gao and Changsheng Xu. 2021. Learning video moment retrieval without a single annotated video. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 1646–1657.

[19]

Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision. 3628–3636.

[20]

Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. 2019. MAC: Mining activity concepts for language-based temporal localization. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 245–253.

[21]

Meera Hahn, Asim Kadav, James M. Rehg, and Hans Peter Graf. 2020. Tripping through time: Efficient localization of activities in videos. In Proceedings of the British Machine Vision Conference. 549.1–549.14.

[22]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803–5812.

[23]

Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. 2021. Video moment localization via deep cross-modal hashing. IEEE Transactions on Image Processing 30 (2021), 4667–4677.

Digital Library

[24]

Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In Proceedings of the International Conference on Multimedia Retrieval. 217–225.

Digital Library

[25]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference for Learning Representations. 1–15.

[26]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.

[27]

Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the ACM International Conference on Multimedia. 988–996.

Digital Library

[28]

Tianwei Lin, Xu Zhao, and Haisheng Su. 2020. Joint learning of local and global context for temporal action proposal generation. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2020), 4899–4912.

Digital Library

[29]

Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, and Huasheng Liu. 2020. Weakly-supervised video moment retrieval via semantic completion network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11539–11546.

[30]

Zhijie Lin, Zhou Zhao, Zhu Zhang, Zijian Zhang, and Deng Cai. 2020. Moment retrieval via cross-modal interaction networks with query reconstruction. IEEE Transactions on Image Processing 29 (2020), 3750–3762.

Digital Library

[31]

Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. 2020. Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In Proceedings of the 28th International Conference on Computational Linguistics. 1841–1851.

[32]

Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. 2021. Context-aware Biaffine Localizing Network for temporal sentence grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11235–11244.

[33]

Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 15–24.

Digital Library

[34]

Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In Proceedings of the ACM International Conference on Multimedia. 843–851.

Digital Library

[35]

Qinying Liu and Zilei Wang. 2020. Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11612–11619.

[36]

Xinfang Liu, Xiushan Nie, Junya Teng, Li Lian, and Yilong Yin. 2021. Single-shot semantic matching network for moment localization in videos. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3 (2021), 1–14.

Digital Library

[37]

Yuan Liu, Jingyuan Chen, Xinpeng Chena, Bing Deng, Jianqiang Huang, and Xiansheng Hua. 2021. Centerness-aware network for temporal action proposal. IEEE Transactions on Circuits and Systems for Video Technology 32, 1 (2021), 5–16.

[38]

Chujie Lu, Long Chen, Chilie Tan, Xiaolin Li, and Jun Xiao. 2019. DEBUG: A dense bottom-up grounding approach for natural language video localization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing. 5147–5156.

[39]

Tao Mei, Lin-Xie Tang, Jinhui Tang, and Xian-Sheng Hua. 2013. Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 3 (2013), 1–23.

Digital Library

[40]

Niluthpol Chowdhury Mithun, Sujoy Paul, and Amit K. Roy-Chowdhury. 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11592–11601.

[41]

Jonghwan Mun, Minsu Cho, and Bohyung Han. 2020. Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10810–10819.

[42]

Ke Ning, Lingxi Xie, Jianzhuang Liu, Fei Wu, and Qi Tian. 2021. Interaction-integrated network for natural language moment localization. IEEE Transactions on Image Processing 30 (2021), 2538–2548.

[43]

Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 1–24.

Digital Library

[44]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532–1543.

[45]

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. FiLM: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. 3942–3951.

[46]

Xiaoye Qu, Pengwei Tang, Zhikang Zou, Yu Cheng, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the ACM International Conference on Multimedia. 4280–4288.

Digital Library

[47]

Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).

[48]

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1 (2013), 25–36.

[49]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.

Digital Library

[50]

Cristian Rodriguez, Edison Marrese-Taylor, Fatemeh Sadat Saleh, Hongdong Li, and Stephen Gould. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2464–2473.

[51]

Marcus Rohrbach, Michaela Regneri, Mykhaylo Andriluka, Sikandar Amin, Manfred Pinkal, and Bernt Schiele. 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the European Conference on Computer Vision. 144–157.

Digital Library

[52]

Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5734–5743.

[53]

Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.

[54]

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. 510–526.

[55]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference for Learning Representations. 1–14.

[56]

Xiaoyang Sun, Hanli Wang, and Bin He. 2021. MABAN: Multi-agent boundary-aware network for natural language moment retrieval. IEEE Transactions on Image Processing 30 (2021), 5589–5599.

[57]

Reuben Tan, Huijuan Xu, Kate Saenko, and Bryan A. Plummer. 2021. LoGAN: Latent graph co-attention network for weakly-supervised video moment retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2083–2092.

[58]

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision. 9627–9636.

[59]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.

Digital Library

[60]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.

[61]

Jingwen Wang, Lin Ma, and Wenhao Jiang. 2020. Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12168–12175.

[62]

Weining Wang, Yan Huang, and Liang Wang. 2019. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 334–343.

[63]

Jie Wu, Guanbin Li, Xiaoguang Han, and Liang Lin. 2020. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In Proceedings of the ACM International Conference on Multimedia. 1283–1291.

Digital Library

[64]

Jie Wu, Guanbin Li, Si Liu, and Liang Lin. 2020. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12386–12393.

[65]

Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, and Jun Xiao. 2021. Boundary proposal network for two-stage natural language video localization. Proceedings of the AAAI Conference on Artificial Intelligence 35, 04, 2986–2994.

[66]

Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9062–9069.

Digital Library

[67]

Xun Yang, Shanshan Wang, Jian Dong, Jianfeng Dong, Meng Wang, and Tat-Seng Chua. 2022. Video moment retrieval with cross-modal neural architecture search. IEEE Transactions on Image Processing 31 (2022), 1204–1216.

[68]

Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2016. Semantic feature mining for video event understanding. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 4 (2016), 1–22.

Digital Library

[69]

Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In Advances in Neural Information Processing Systems. 536–546.

[70]

Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2020. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 5 (2020), 2725–2741.

[71]

Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9159–9166.

Digital Library

[72]

Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision. 7094–7103.

[73]

Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense regression network for video grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10287–10296.

[74]

Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S. Davis. 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1247–1257.

[75]

Da Zhang, Xiyang Dai, and Yuan-Fang Wang. 2020. METAL: Minimum effort temporal activity localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3882–3892.

[76]

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localizing network for natural language video localization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 6543–6554.

[77]

Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2D temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12870–12877.

[78]

Songyang Zhang, Jinsong Su, and Jiebo Luo. 2019. Exploiting temporal relationships in video moment localization with natural language. In Proceedings of the ACM International Conference on Multimedia. 1230–1238.

Digital Library

[79]

Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 655–664.

Digital Library

[80]

Zijian Zhang, Zhou Zhao, Zhu Zhang, Zhijie Lin, Qi Wang, and Richang Hong. 2020. Temporal textual localization in video via adversarial bi-directional interaction networks. IEEE Transactions on Multimedia 23 (2020), 3306–3317.

Cited By

Li FZhai HLiu TZhang XQin C(2024)Learning Compressed Artifact for JPEG Manipulation Localization Using Wide-Receptive-Field NetworkACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3678883Online publication date: 18-Jul-2024
https://doi.org/10.1145/3678883
Liu DQu XDong JZhou PXu ZWang HDi XLu WCheng Y(2024)Transform-Equivariant Consistency Learning for Temporal Sentence GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363474920:4(1-19)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3634749
Zhang TLu XZhang HNie XYin YShen J(2024)Relational Network via Cascade CRF for Video Language GroundingIEEE Transactions on Multimedia10.1109/TMM.2023.330371226(8297-8311)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3303712
Show More Cited By

Index Terms

Progressive Localization Networks for Language-Based Moment Localization
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Cross-modal Moment Localization in Videos
MM '18: Proceedings of the 26th ACM international conference on Multimedia

In this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the ...
A Polygonal Method for Ranging-Based Localization in an Indoor Wireless Sensor Network

In this paper, we propose an indoor localization method in a wireless sensor network based on IEEE 802.15.4 specification. The proposed method follows a ranging-based approach using not only the measurements of received signal strength (RSS) but also ...
Indoor localization using signal strength

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 2

March 2023

540 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3572860

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 February 2023

Online AM: 11 June 2022

Accepted: 31 May 2022

Revised: 30 April 2022

Received: 26 December 2021

Published in TOMM Volume 19, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Key R&D Program of China
NSFC
Public Welfare Technology Research Project of Zhejiang Province
Fundamental Research Funds for the Provincial Universities of Zhejiang
Open Projects Program of the National Laboratory of Pattern Recognition

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
395
Total Downloads

Downloads (Last 12 months)207
Downloads (Last 6 weeks)18

Reflects downloads up to 29 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li FZhai HLiu TZhang XQin C(2024)Learning Compressed Artifact for JPEG Manipulation Localization Using Wide-Receptive-Field NetworkACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3678883Online publication date: 18-Jul-2024
https://doi.org/10.1145/3678883
Liu DQu XDong JZhou PXu ZWang HDi XLu WCheng Y(2024)Transform-Equivariant Consistency Learning for Temporal Sentence GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363474920:4(1-19)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3634749
Zhang TLu XZhang HNie XYin YShen J(2024)Relational Network via Cascade CRF for Video Language GroundingIEEE Transactions on Multimedia10.1109/TMM.2023.330371226(8297-8311)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3303712
Song PGuo DYang XTang SWang M(2024)Emotional Video Captioning With Vision-Based Emotion Interpretation NetworkIEEE Transactions on Image Processing10.1109/TIP.2024.335904533(1122-1135)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3359045
Sun SLiu DDong JQu XGao JYang XWang XWang MEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action UnderstandingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612449(2973-2984)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612449
Wu ZGao JXu CEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Weakly-supervised Video Scene Graph Generation via Unbiased Cross-modal LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612019(4574-4583)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612019
Dong JPeng XMa ZLiu DQu XYang XZhu JLiu BChen HDuh WHuang HKato MMothe JPoblete B(2023)From Region to Patch: Attribute-Aware Foreground-Background Contrastive Learning for Fine-Grained Fashion RetrievalProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591690(1273-1282)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591690
Zhang HSun AJing WZhou J(2023)Temporal Sentence Grounding in Videos: A Survey and Future DirectionsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.325862845:8(10443-10465)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TPAMI.2023.3258628
Pan YHe XGong BLv YShen YPeng YZhao D(2023)Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01266(13721-13731)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01266
Dong JZhang MZhang ZChen XLiu DQu XWang XLiu B(2023)Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01038(11268-11278)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01038

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents