research-article

Video Retrieval with Similarity-Preserving Deep Temporal Hashing

Authors:

Meng WangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 15, Issue 4

Article No.: 109, Pages 1 - 16

https://doi.org/10.1145/3356316

Published: 16 December 2019 Publication History

Abstract

Despite the fact that remarkable progress has been made in recent years, Content-based Video Retrieval (CBVR) is still an appealing research topic due to increasing search demands in the Internet era of big data. This article aims to explore an efficient CBVR system by discriminately hashing videos into short binary codes. Existing video hashing methods usually encounter two weaknesses originating from the following sources: (1) Most works adopt the separated stages method or the frame-pooling based end-to-end architecture. However, the spatial-temporal properties of videos cannot be fully explored or kept well in the follow-up hashing step. (2) Discriminative learning based on pairwise or triplet constraints often suffers from slow convergence and poor local optimization, mainly because of the limited samples for each update. To alleviate these problems, we propose an end-to-end video retrieval framework called the Similarity-Preserving Deep Temporal Hashing (SPDTH) network. Specifically, we equip the model with the ability to capture spatial-temporal properties of videos and to generate binary codes by stacked Gated Recurrent Units (GRUs). It unifies video temporal modeling and learning to hash into one step to allow for maximum retention of information. We also introduce a deep metric learning objective called ℓ₂All_loss for network training by preserving intra-class similarity and inter-class separability, and a quantization loss between the real-valued outputs and the binary codes is minimized. Extensive experiments on several challenging datasets demonstrate that SPDTH can consistently outperform state-of-the-art methods.

References

[1]

Liangliang Cao, Zhenguo Li, Yadong Mu, and Shih Fu Chang. 2012. Submodular video hashing: A unified framework towards video pooling and indexing. In ACM International Conference on Multimedia. ACM, 299--308.

Digital Library

[2]

Yue Cao, Mingsheng Long, Jianmin Wang, Han Zhu, and Qingfu Wen. 2016. Deep quantization network for efficient image retrieval. In 30th AAAI Conference on Artificial Intelligence. AAAI, 3457--3463.

[3]

Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. 2017. Hashnet: Deep learning to hash by continuation. Arxiv Preprint Arxiv:1702.00758 (2017).

[4]

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

[5]

Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity search in high dimensions via hashing. In International Conference on Very Large Data Bases (VLDB'99). 518--529.

[6]

Yunchao Gong and S. Lazebnik. 2011. Iterative quantization: A procrustean approach to learning binary codes. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 817--824.

[7]

Amarnath Gupta. 1997. Visual information retrieval. Communications of the ACM 40, 5, 70--79.

Digital Library

[8]

Yanbin Hao, Tingting Mu, John Y. Goulermas, Jianguo Jiang, Richang Hong, and Wang Meng. 2017. Unsupervised t-distributed video hashing and its deep hashing extension. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society 26, 11, 5531--5544.

Digital Library

[9]

Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, and John Y. Goulermas. 2017. Stochastic multiview hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia 19, 1, 1--14.

Digital Library

[10]

Kaiming He, Fang Wen, and Jian Sun. 2013. K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2938--2945.

Digital Library

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.

[12]

Richang Hong, Jinhui Tang, Hung Khoon Tan, Shuicheng Yan, Chongwah Ngo, and Tat Seng Chua. 2009. Event driven summarization for web videos. In SIGMM Workshop on Social Media. ACM, 43--48.

Digital Library

[13]

Richang Hong, Xiao Tong Yuan, Mengdi Xu, Meng Wang, Shuicheng Yan, and Tat Seng Chua. 2010. Movie2Comics:A feast of multimedia artwork. In 18th ACM International Conference on Multimedia 2010, Firenze, Italy, October. 611--614.

Digital Library

[14]

Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. 2011. A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41, 6, 797--819.

Digital Library

[15]

Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. 2013. Towards understanding action recognition. In IEEE International Conference on Computer Vision. IEEE, 3192--3199.

[16]

Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval. ACM, 29.

[17]

Z. Jin, C. Li, Y. Lin, and D. Cai. 2014. Density sensitive hashing. IEEE Trans Cybern 44, 8, 1362--1371.

[18]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems, Vol. 25. Curran Associates Inc., 1097--1105.

Digital Library

[19]

Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communications, and Applications 2, 1, 1--19.

Digital Library

[20]

Peng Li, Meng Wang, Jian Cheng, Changsheng Xu, and Hanqing Lu. 2013. Spectral hashing with semantically consistent graph for image indexing. IEEE Transactions on Multimedia 15, 1, 141--152.

Digital Library

[21]

Qi Li, Zhenan Sun, Ran He, and Tieniu Tan. 2017. Deep supervised discrete hashing. In Advances in Neural Information Processing Systems. PP 99, 2482--2491.

[22]

Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. 2015. Feature learning based deep supervised hashing with pairwise labels. Arxiv Preprint Arxiv:1511.03855 (2015).

Digital Library

[23]

Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton Van Den Hengel, and David Suter. 2014. Fast supervised hashing with decision trees for high-dimensional data. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1971--1978.

Digital Library

[24]

Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. 2017. Deep video hashing. IEEE Transactions on Multimedia 19, 6, 1209--1219.

Digital Library

[25]

Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. 2015. Deep hashing for compact binary codes learning. In Computer Vision and Pattern Recognition (CVPR'15). IEEE, 2475--2483.

[26]

Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2016. Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2064--2072.

[27]

Wei Liu, Sanjiv Kumar, Sanjiv Kumar, and Shih Fu Chang. 2014. Discrete graph hashing. In International Conference on Neural Information Processing Systems (NIPS'14). 3419--3427.

[28]

Wei Liu, Jun Wang, Rongrong Ji, and Yu Gang Jiang. 2012. Supervised hashing with kernels. In Computer Vision and Pattern Recognition. 2074--2081.

[29]

Wei Liu, Jun Wang, Sanjiv Kumar, and Shih Fu Chang. 2011. Hashing with Graphs. In Proceedings of the 28th International Conference on International Conference on Machine Learning. Omnipress, 1--8.

[30]

Viet-Anh Nguyen and Minh N. Do. 2016. Deep learning based supervised hashing for efficient image retrieval. In IEEE International Conference on Multimedia and Expo (ICME’16). IEEE, 1--6.

[31]

Mohammad Norouzi, David J. Fleet, and Ruslan Salakhutdinov. 2012. Hamming distance metric learning. Advances in Neural Information Processing Systems 2, 1061--1069.

[32]

Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4004--4012.

[33]

Cees G. M. Snoek and Marcel Worring. 2008. Concept-based video retrieval. Foundations and Trends in Information Retrieval 2, 4, 215--322.

Digital Library

[34]

Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In International Conference on Multimedia. ACM, 423--432.

Digital Library

[35]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv preprint arXiv:1212.0402.

[36]

Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. 2016. Learning to hash for indexing big data-a survey. Proc. IEEE 104, 1, 34--57.

[37]

Xiaofang Wang, Yi Shi, and Kris M. Kitani. 2016. Deep supervised hashing with triplet labels. In Asian Conference on Computer Vision. Springer, 70--84.

[38]

Kilian Q. Weinberger and Lawrence K. Saul. 2006. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10, 1, 207--244.

[39]

Xun Yang, Peicheng Zhou, and Meng Wang. 2018. Person reidentification via structural deep metric learning. IEEE Transactions on Neural Networks and Learning Systems. 1--12.

[40]

Guangnan Ye, Dong Liu, Jun Wang, and Shih Fu Chang. 2014. Large-scale video hashing via structure learning. In IEEE International Conference on Computer Vision. IEEE, 2272--2279.

[41]

Hanwang Zhang, Meng Wang, Richang Hong, and Tat Seng Chua. 2016. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, 781--790.

Digital Library

[42]

Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. 2015. Deep semantic ranking based hashing for multi-label image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, 1556--1564.

[43]

Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. 2016. Deep hashing network for efficient similarity retrieval. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI Press, 2415--2421.

Cited By

Megala GSwarnalatha P(2024)Stacked collaborative transformer network with contrastive learning for video moment localizationIntelligent Data Analysis10.3233/IDA-240138(1-18)Online publication date: 20-Jun-2024
https://doi.org/10.3233/IDA-240138
Han KLiu YWei RZhou KXu JLong K(2024)Supervised Hierarchical Online Hashing for Cross-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363252720:4(1-23)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3632527
Duan JHao YZhu BCheng LZhou PWang X(2024)Efficient Unsupervised Video Hashing With Contextual Modeling and Structural ControllingIEEE Transactions on Multimedia10.1109/TMM.2024.336892426(7438-7450)Online publication date: 22-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3368924
Show More Cited By

Index Terms

Video Retrieval with Similarity-Preserving Deep Temporal Hashing
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Classification-enhancement deep hashing for large-scale video retrieval
Abstract
With the explosive growth of video data on the Internet, retrieving and detecting similar video contents effectively has become a challenging problem. Whereas hashing is a mature technique for dealing with this problem, especially in ...
Highlights
- Triplet-wise loss is applied into video hashing for similarity preserving.
- Add ...
Jointly Modeling Static Visual Appearance and Temporal Pattern for Unsupervised Video Hashing
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Recently, hashing has been evidenced as an efficient and effective method to facilitate large-scale video retrieval. Most of existing hashing methods are based on visual features, which are expected to capture the appearance of videos. The intrinsic ...
Submodular video hashing: a unified framework towards video pooling and indexing
MM '12: Proceedings of the 20th ACM international conference on Multimedia

This paper develops a novel framework for efficient large-scale video retrieval. We aim to find video according to higher level similarities, which is beyond the scope of traditional near duplicate search. Following the popular hashing technique we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 4

November 2019

322 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3376119

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2019

Accepted: 01 August 2019

Revised: 01 August 2019

Received: 01 September 2018

Published in TOMM Volume 15, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
401
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)3

Reflects downloads up to 23 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Megala GSwarnalatha P(2024)Stacked collaborative transformer network with contrastive learning for video moment localizationIntelligent Data Analysis10.3233/IDA-240138(1-18)Online publication date: 20-Jun-2024
https://doi.org/10.3233/IDA-240138
Han KLiu YWei RZhou KXu JLong K(2024)Supervised Hierarchical Online Hashing for Cross-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363252720:4(1-23)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3632527
Duan JHao YZhu BCheng LZhou PWang X(2024)Efficient Unsupervised Video Hashing With Contextual Modeling and Structural ControllingIEEE Transactions on Multimedia10.1109/TMM.2024.336892426(7438-7450)Online publication date: 22-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3368924
Ma LWu XTang RZhong CZhang K(2023)YuYin: a multi-task learning model of multi-modal e-commerce background music recommendationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-023-00306-62023:1Online publication date: 19-Oct-2023
https://dl.acm.org/doi/10.1186/s13636-023-00306-6
He QZheng ZHu H(2023)A Feature Map is Worth a Video Frame: Rethinking Convolutional Features for Visible-Infrared Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361737520:2(1-20)Online publication date: 18-Oct-2023
https://dl.acm.org/doi/10.1145/3617375
Du YWang MLu ZZhou WLi H(2023)Weakly Supervised Hashing with Reconstructive Cross-modal AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358918519:6(1-19)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3589185
Li KLi JGuo DYang XWang M(2023)Transformer-Based Visual Grounding with Cross-Modality InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725119:6(1-19)Online publication date: 9-Mar-2023
https://dl.acm.org/doi/10.1145/3587251
Wang KDing CPang JXu X(2023)Context Sensing Attention Network for Video-based Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357320319:4(1-20)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3573203
Zhang PDou HZhang WZhao YQin ZHu DFang YLi X(2023)A Large-Scale Synthetic Gait Dataset Towards in-the-Wild Simulation and Comparison StudyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351719919:1(1-23)Online publication date: 5-Jan-2023
https://dl.acm.org/doi/10.1145/3517199
Shen XZhou YYuan YYang XLan LZheng Y(2023)Contrastive Transformer Hashing for Compact Video RepresentationIEEE Transactions on Image Processing10.1109/TIP.2023.332699432(5992-6003)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIP.2023.3326994
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents