Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Video Retrieval with Similarity-Preserving Deep Temporal Hashing

Published: 16 December 2019 Publication History

Abstract

Despite the fact that remarkable progress has been made in recent years, Content-based Video Retrieval (CBVR) is still an appealing research topic due to increasing search demands in the Internet era of big data. This article aims to explore an efficient CBVR system by discriminately hashing videos into short binary codes. Existing video hashing methods usually encounter two weaknesses originating from the following sources: (1) Most works adopt the separated stages method or the frame-pooling based end-to-end architecture. However, the spatial-temporal properties of videos cannot be fully explored or kept well in the follow-up hashing step. (2) Discriminative learning based on pairwise or triplet constraints often suffers from slow convergence and poor local optimization, mainly because of the limited samples for each update. To alleviate these problems, we propose an end-to-end video retrieval framework called the Similarity-Preserving Deep Temporal Hashing (SPDTH) network. Specifically, we equip the model with the ability to capture spatial-temporal properties of videos and to generate binary codes by stacked Gated Recurrent Units (GRUs). It unifies video temporal modeling and learning to hash into one step to allow for maximum retention of information. We also introduce a deep metric learning objective called ℓ2All_loss for network training by preserving intra-class similarity and inter-class separability, and a quantization loss between the real-valued outputs and the binary codes is minimized. Extensive experiments on several challenging datasets demonstrate that SPDTH can consistently outperform state-of-the-art methods.

References

[1]
Liangliang Cao, Zhenguo Li, Yadong Mu, and Shih Fu Chang. 2012. Submodular video hashing: A unified framework towards video pooling and indexing. In ACM International Conference on Multimedia. ACM, 299--308.
[2]
Yue Cao, Mingsheng Long, Jianmin Wang, Han Zhu, and Qingfu Wen. 2016. Deep quantization network for efficient image retrieval. In 30th AAAI Conference on Artificial Intelligence. AAAI, 3457--3463.
[3]
Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. 2017. Hashnet: Deep learning to hash by continuation. Arxiv Preprint Arxiv:1702.00758 (2017).
[4]
Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
[5]
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity search in high dimensions via hashing. In International Conference on Very Large Data Bases (VLDB'99). 518--529.
[6]
Yunchao Gong and S. Lazebnik. 2011. Iterative quantization: A procrustean approach to learning binary codes. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 817--824.
[7]
Amarnath Gupta. 1997. Visual information retrieval. Communications of the ACM 40, 5, 70--79.
[8]
Yanbin Hao, Tingting Mu, John Y. Goulermas, Jianguo Jiang, Richang Hong, and Wang Meng. 2017. Unsupervised t-distributed video hashing and its deep hashing extension. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society 26, 11, 5531--5544.
[9]
Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, and John Y. Goulermas. 2017. Stochastic multiview hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia 19, 1, 1--14.
[10]
Kaiming He, Fang Wen, and Jian Sun. 2013. K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2938--2945.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.
[12]
Richang Hong, Jinhui Tang, Hung Khoon Tan, Shuicheng Yan, Chongwah Ngo, and Tat Seng Chua. 2009. Event driven summarization for web videos. In SIGMM Workshop on Social Media. ACM, 43--48.
[13]
Richang Hong, Xiao Tong Yuan, Mengdi Xu, Meng Wang, Shuicheng Yan, and Tat Seng Chua. 2010. Movie2Comics:A feast of multimedia artwork. In 18th ACM International Conference on Multimedia 2010, Firenze, Italy, October. 611--614.
[14]
Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. 2011. A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41, 6, 797--819.
[15]
Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. 2013. Towards understanding action recognition. In IEEE International Conference on Computer Vision. IEEE, 3192--3199.
[16]
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval. ACM, 29.
[17]
Z. Jin, C. Li, Y. Lin, and D. Cai. 2014. Density sensitive hashing. IEEE Trans Cybern 44, 8, 1362--1371.
[18]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems, Vol. 25. Curran Associates Inc., 1097--1105.
[19]
Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communications, and Applications 2, 1, 1--19.
[20]
Peng Li, Meng Wang, Jian Cheng, Changsheng Xu, and Hanqing Lu. 2013. Spectral hashing with semantically consistent graph for image indexing. IEEE Transactions on Multimedia 15, 1, 141--152.
[21]
Qi Li, Zhenan Sun, Ran He, and Tieniu Tan. 2017. Deep supervised discrete hashing. In Advances in Neural Information Processing Systems. PP 99, 2482--2491.
[22]
Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. 2015. Feature learning based deep supervised hashing with pairwise labels. Arxiv Preprint Arxiv:1511.03855 (2015).
[23]
Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton Van Den Hengel, and David Suter. 2014. Fast supervised hashing with decision trees for high-dimensional data. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1971--1978.
[24]
Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. 2017. Deep video hashing. IEEE Transactions on Multimedia 19, 6, 1209--1219.
[25]
Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. 2015. Deep hashing for compact binary codes learning. In Computer Vision and Pattern Recognition (CVPR'15). IEEE, 2475--2483.
[26]
Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2016. Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2064--2072.
[27]
Wei Liu, Sanjiv Kumar, Sanjiv Kumar, and Shih Fu Chang. 2014. Discrete graph hashing. In International Conference on Neural Information Processing Systems (NIPS'14). 3419--3427.
[28]
Wei Liu, Jun Wang, Rongrong Ji, and Yu Gang Jiang. 2012. Supervised hashing with kernels. In Computer Vision and Pattern Recognition. 2074--2081.
[29]
Wei Liu, Jun Wang, Sanjiv Kumar, and Shih Fu Chang. 2011. Hashing with Graphs. In Proceedings of the 28th International Conference on International Conference on Machine Learning. Omnipress, 1--8.
[30]
Viet-Anh Nguyen and Minh N. Do. 2016. Deep learning based supervised hashing for efficient image retrieval. In IEEE International Conference on Multimedia and Expo (ICME’16). IEEE, 1--6.
[31]
Mohammad Norouzi, David J. Fleet, and Ruslan Salakhutdinov. 2012. Hamming distance metric learning. Advances in Neural Information Processing Systems 2, 1061--1069.
[32]
Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4004--4012.
[33]
Cees G. M. Snoek and Marcel Worring. 2008. Concept-based video retrieval. Foundations and Trends in Information Retrieval 2, 4, 215--322.
[34]
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In International Conference on Multimedia. ACM, 423--432.
[35]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv preprint arXiv:1212.0402.
[36]
Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. 2016. Learning to hash for indexing big data-a survey. Proc. IEEE 104, 1, 34--57.
[37]
Xiaofang Wang, Yi Shi, and Kris M. Kitani. 2016. Deep supervised hashing with triplet labels. In Asian Conference on Computer Vision. Springer, 70--84.
[38]
Kilian Q. Weinberger and Lawrence K. Saul. 2006. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10, 1, 207--244.
[39]
Xun Yang, Peicheng Zhou, and Meng Wang. 2018. Person reidentification via structural deep metric learning. IEEE Transactions on Neural Networks and Learning Systems. 1--12.
[40]
Guangnan Ye, Dong Liu, Jun Wang, and Shih Fu Chang. 2014. Large-scale video hashing via structure learning. In IEEE International Conference on Computer Vision. IEEE, 2272--2279.
[41]
Hanwang Zhang, Meng Wang, Richang Hong, and Tat Seng Chua. 2016. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, 781--790.
[42]
Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. 2015. Deep semantic ranking based hashing for multi-label image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, 1556--1564.
[43]
Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. 2016. Deep hashing network for efficient similarity retrieval. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI Press, 2415--2421.

Cited By

View all
  • (2024)Stacked collaborative transformer network with contrastive learning for video moment localizationIntelligent Data Analysis10.3233/IDA-240138(1-18)Online publication date: 20-Jun-2024
  • (2024)Supervised Hierarchical Online Hashing for Cross-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363252720:4(1-23)Online publication date: 11-Jan-2024
  • (2024)Efficient Unsupervised Video Hashing With Contextual Modeling and Structural ControllingIEEE Transactions on Multimedia10.1109/TMM.2024.336892426(7438-7450)Online publication date: 22-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 4
November 2019
322 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3376119
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2019
Accepted: 01 August 2019
Revised: 01 August 2019
Received: 01 September 2018
Published in TOMM Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Content-based video retrieval
  2. convolutional neural network
  3. recurrent neural network
  4. video hashing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)3
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Stacked collaborative transformer network with contrastive learning for video moment localizationIntelligent Data Analysis10.3233/IDA-240138(1-18)Online publication date: 20-Jun-2024
  • (2024)Supervised Hierarchical Online Hashing for Cross-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363252720:4(1-23)Online publication date: 11-Jan-2024
  • (2024)Efficient Unsupervised Video Hashing With Contextual Modeling and Structural ControllingIEEE Transactions on Multimedia10.1109/TMM.2024.336892426(7438-7450)Online publication date: 22-Feb-2024
  • (2023)YuYin: a multi-task learning model of multi-modal e-commerce background music recommendationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-023-00306-62023:1Online publication date: 19-Oct-2023
  • (2023)A Feature Map is Worth a Video Frame: Rethinking Convolutional Features for Visible-Infrared Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361737520:2(1-20)Online publication date: 18-Oct-2023
  • (2023)Weakly Supervised Hashing with Reconstructive Cross-modal AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358918519:6(1-19)Online publication date: 12-Jul-2023
  • (2023)Transformer-Based Visual Grounding with Cross-Modality InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725119:6(1-19)Online publication date: 9-Mar-2023
  • (2023)Context Sensing Attention Network for Video-based Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357320319:4(1-20)Online publication date: 27-Feb-2023
  • (2023)A Large-Scale Synthetic Gait Dataset Towards in-the-Wild Simulation and Comparison StudyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351719919:1(1-23)Online publication date: 5-Jan-2023
  • (2023)Contrastive Transformer Hashing for Compact Video RepresentationIEEE Transactions on Image Processing10.1109/TIP.2023.332699432(5992-6003)Online publication date: 1-Jan-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media