Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3547836acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Unsupervised Video Hashing with Multi-granularity Contextualization and Multi-structure Preservation

Published: 10 October 2022 Publication History

Abstract

Unsupervised video hashing typically aims to learn a compact binary vector to represent complex video content without using manual annotations. Existing unsupervised hashing methods generally suffer from incomplete exploration of various perspective dependencies (e.g., long-range and short-range) and data structures that exist in visual contents, resulting in less discriminative hash codes. In this paper, we propose aMulti-granularity Contextualized and Multi-Structure preserved Hashing (MCMSH) method, exploring multiple axial contexts for discriminative video representation generation and various structural information for unsupervised learning simultaneously. Specifically, we delicately design three self-gating modules to separately model three granularities of dependencies (i.e., long/middle/short-range dependencies) and densely integrate them into MLP-Mixer for feature contextualization, leading to a novel model MC-MLP. To facilitate unsupervised learning, we investigate three kinds of data structures, including clusters, local neighborhood similarity structure, and inter/intra-class variations, and design a multi-objective task to train MC-MLP. These data structures show high complementarities in hash code learning. We conduct extensive experiments using three video retrieval benchmark datasets, demonstrating that our MCMSH not only boosts the performance of the backbone MLP-Mixer significantly but also outperforms the competing methods notably. Code is available at: https://github.com/haoyanbin918/MCMSH.

Supplementary Material

MP4 File (mmfp0387.mp4)
This is the presentation video of the paper "Unsupervised Video Hashing with Multi-granularity Contextualization and Multi-structure Preservation", which briefly introduces the paper from three aspects: background, methods and experiments. Our method simultaneously explores multiple axial contexts for discriminative video representation generation and various structural information for unsupervised learning. You can learn more about the paper through the video.

References

[1]
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).
[2]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961--970.
[3]
Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. 2017. Beyond Triplet Loss: A Deep Quadruplet Network for Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4]
Yajiao Dong and Jianguo Li. 2018. Video retrieval based on deep convolutional neural network. In Proceedings of the 3rd International Conference on Multimedia Systems and Signal Processing. 12--16.
[5]
Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. 2015. Deep Hashing for Compact Binary Codes Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6]
Chaowei Fang, Dingwen Zhang, Liang Wang, Yulun Zhang, Lechao Cheng, and Junwei Han. 2022. Cross-Modality High-Frequency Transformer for MR Image Super-Resolution. arXiv preprint arXiv:2203.15314 (2022).
[7]
Praveen Gauravaram and Lars R Knudsen. 2009. On randomizing hash functions to strengthen the security of digital signatures. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 88--105.
[8]
Ivan Giangreco, Ihab Al Kabary, and Heiko Schuldt. 2014. Adam-a database and information retrieval system for big multimedia collections. In 2014 IEEE International Congress on Big Data. IEEE, 406--413.
[9]
Yun Gu, Chao Ma, and Jie Yang. 2016. Supervised recurrent hashing for large scale video retrieval. In Proceedings of the 24th ACM international conference on Multimedia. 272--276.
[10]
Shai Halevi and Hugo Krawczyk. 2006. Strengthening digital signatures via randomized hashing. In Annual International Cryptology Conference. Springer, 41--59.
[11]
Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained cross-modal alignment network for text-video retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3826--3834.
[12]
Yanbin Hao, Tingting Mu, John Y Goulermas, Jianguo Jiang, Richang Hong, and Meng Wang. 2017. Unsupervised t-distributed video hashing and its deep hashing extension. IEEE Transactions on Image Processing, Vol. 26, 11 (2017), 5531--5544.
[13]
Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, and John Y Goulermas. 2016. Stochastic multiview hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia, Vol. 19, 1 (2016), 1--14.
[14]
Yanbin Hao, Chong-Wah Ngo, and Benoit Huet. 2019. Neighbourhood structure preserving cross-modal embedding for video hyperlinking. IEEE Transactions on Multimedia, Vol. 22, 1 (2019), 188--200.
[15]
Yanbin Hao, Shuo Wang, Pei Cao, Xinjian Gao, Tong Xu, Jinmeng Wu, and Xiangnan He. 2022a. Attention in Attention: Modeling Context Correlation for Efficient Video Classification. IEEE Transactions on Circuits and Systems for Video Technology (2022).
[16]
Yanbin Hao, Hao Zhang, Chong-Wah Ngo, and Xiangnan He. 2022b. Group Contextualization for Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 928--938.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[18]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[19]
Qibin Hou, Zihang Jiang, Li Yuan, Ming-Ming Cheng, Shuicheng Yan, and Jiashi Feng. 2022. Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[20]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.
[21]
Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE transactions on pattern analysis and machine intelligence, Vol. 40, 2 (2017), 352--364.
[22]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[23]
Chao Li, Yang Yang, Jiewei Cao, and Zi Huang. 2017. Jointly modeling static visual appearance and temporal pattern for unsupervised video hashing. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 9--17.
[24]
Shuyan Li, Zhixiang Chen, Xiu Li, Jiwen Lu, and Jie Zhou. 2019a. Unsupervised variational video hashing with 1d-cnn-lstm networks. IEEE Transactions on Multimedia, Vol. 22, 6 (2019), 1542--1554.
[25]
Shuyan Li, Zhixiang Chen, Jiwen Lu, Xiu Li, and Jie Zhou. 2019b. Neighborhood preserving hashing for scalable video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8212--8221.
[26]
Shuyan Li, Xiu Li, Jiwen Lu, and Jie Zhou. 2021a. Self-Supervised Video Hashing via Bidirectional Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13549--13558.
[27]
Shuyan Li, Xiu Lia, Jiwen Lu, and Jie Zhou. 2021b. Structure-adaptive Neighborhood Preserving Hashing for Scalable Video Search. IEEE Transactions on Circuits and Systems for Video Technology (2021).
[28]
Yunqiang Li and Jan van Gemert. 2020. Deep unsupervised image hashing by maximizing bit entropy. arXiv preprint arXiv:2012.12334 (2020).
[29]
Dongze Lian, Zehao Yu, Xing Sun, and Shenghua Gao. 2021. As-mlp: An axial shifted mlp architecture for vision. arXiv preprint arXiv:2107.08391 (2021).
[30]
Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. 2021. Pay attention to mlps. Advances in Neural Information Processing Systems, Vol. 34 (2021), 9204--9215.
[31]
Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. 2011. Hashing with graphs. In Icml.
[32]
Xiang Long, Chuang Gan, Gerard De Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.
[33]
David G Lowe. 1999. Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, Vol. 2. Ieee, 1150--1157.
[34]
Xiao Luo, Daqing Wu, Zeyu Ma, Chong Chen, Minghua Deng, Jinwen Ma, Zhongming Jin, Jianqiang Huang, and Xian-Sheng Hua. 2020. Cimon: Towards high-quality hash codes. arXiv preprint arXiv:2010.07804 (2020).
[35]
Xiushan Nie, Xin Zhou, Yang Shi, Jiande Sun, and Yilong Yin. 2021. Classification-enhancement deep hashing for large-scale video retrieval. Applied Soft Computing, Vol. 109 (2021), 107467.
[36]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.
[37]
Ling Shen, Richang Hong, and Yanbin Hao. 2020. Advance on large scale near-duplicate video retrieval. Frontiers of Computer Science, Vol. 14, 5 (2020), 1--24.
[38]
Ling Shen, Richang Hong, Haoran Zhang, Xinmei Tian, and Meng Wang. 2019. Video retrieval with similarity-preserving deep temporal hashing. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15, 4 (2019), 1--16.
[39]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[40]
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In Proceedings of the 19th ACM international conference on Multimedia. 423--432.
[41]
Jingkuan Song, Hanwang Zhang, Xiangpeng Li, Lianli Gao, Meng Wang, and Richang Hong. 2018. Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Transactions on Image Processing, Vol. 27, 7 (2018), 3210--3221.
[42]
Yi Tan, Yanbin Hao, Xiangnan He, Yinwei Wei, and Xun Yang. 2021. Selective dependency aggregation for action classification. In Proceedings of the 29th ACM International Conference on Multimedia. 592--601.
[43]
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2015. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817, Vol. 1, 8 (2015).
[44]
Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. 2021. Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, Vol. 34 (2021).
[45]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).
[46]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[47]
Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. 2015. Learning to hash for indexing big data-A survey. Proc. IEEE, Vol. 104, 1 (2015), 34--57.
[48]
Shuo Wang, Dan Guo, Xin Xu, Li Zhuo, and Meng Wang. 2019a. Cross-modality retrieval by joint correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15, 2s (2019), 1--16.
[49]
Yingxin Wang, Xiushan Nie, Yang Shi, Xin Zhou, and Yilong Yin. 2019b. Attention-based video hashing for large-scale video retrieval. IEEE Transactions on Cognitive and Developmental Systems, Vol. 13, 3 (2019), 491--502.
[50]
Yair Weiss, Antonio Torralba, and Rob Fergus. 2008. Spectral hashing. Advances in neural information processing systems, Vol. 21 (2008).
[51]
Gengshen Wu, Jungong Han, Yuchen Guo, Li Liu, Guiguang Ding, Qiang Ni, and Ling Shao. 2018. Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Transactions on Image Processing, Vol. 28, 4 (2018), 1993--2007.
[52]
Xiao Wu, Alexander G Hauptmann, and Chong-Wah Ngo. 2007. Practical elimination of near-duplicates from web video search. In Proceedings of the 15th ACM international conference on Multimedia. 218--227.
[53]
Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, and Ping Li. 2022. S2-mlp: Spatial-shift mlp architecture for vision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 297--306.
[54]
Li Yuan, Tao Wang, Xiaopeng Zhang, Francis EH Tay, Zequn Jie, Wei Liu, and Jiashi Feng. 2020. Central similarity quantization for efficient image and video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3083--3092.
[55]
Dell Zhang, Jun Wang, Deng Cai, and Jinsong Lu. 2010. Self-taught hashing for fast similarity search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 18--25.
[56]
Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021. Token shift transformer for video classification. In Proceedings of the 29th ACM International Conference on Multimedia. 917--925.
[57]
Hanwang Zhang, Meng Wang, Richang Hong, and Tat-Seng Chua. 2016. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In Proceedings of the 24th ACM international conference on Multimedia. 781--790.
[58]
Guoying Zhao and Matti Pietikainen. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE transactions on pattern analysis and machine intelligence, Vol. 29, 6 (2007), 915--928.
[59]
Bin Zhu and Chong-Wah Ngo. 2020. CookGAN: Causality based text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5519--5527.
[60]
Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao. 2019. R2gan: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11477--11486.

Cited By

View all
  • (2025)Correlation embedding semantic-enhanced hashing for multimedia retrievalImage and Vision Computing10.1016/j.imavis.2025.105421(105421)Online publication date: Jan-2025
  • (2024)Large Model based Sequential Keyframe Extraction for Video SummarizationInternational Conference on Computing, Machine Learning and Data Science10.1145/3661725.3661781(1-5)Online publication date: 20-Jun-2024
  • (2024)Two-Step Discrete Hashing for Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.338182826(8730-8741)Online publication date: 1-Apr-2024
  • Show More Cited By

Index Terms

  1. Unsupervised Video Hashing with Multi-granularity Contextualization and Multi-structure Preservation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. feature contextualization
    2. hashing
    3. unsupervised learning
    4. video retrieval

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)52
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 23 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Correlation embedding semantic-enhanced hashing for multimedia retrievalImage and Vision Computing10.1016/j.imavis.2025.105421(105421)Online publication date: Jan-2025
    • (2024)Large Model based Sequential Keyframe Extraction for Video SummarizationInternational Conference on Computing, Machine Learning and Data Science10.1145/3661725.3661781(1-5)Online publication date: 20-Jun-2024
    • (2024)Two-Step Discrete Hashing for Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.338182826(8730-8741)Online publication date: 1-Apr-2024
    • (2024)Efficient Unsupervised Video Hashing With Contextual Modeling and Structural ControllingIEEE Transactions on Multimedia10.1109/TMM.2024.336892426(7438-7450)Online publication date: 22-Feb-2024
    • (2023)Multi-Hop Correlation Preserving Hashing for Efficient Hamming Space Retrieval2023 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM58522.2023.00130(1097-1102)Online publication date: 1-Dec-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media