Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3652583.3658067acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article
Open access

Anchor-aware Deep Metric Learning for Audio-visual Retrieval

Published: 07 June 2024 Publication History

Abstract

Metric learning minimizes the gap between similar (positive) pairs of data points and increases the separation of dissimilar (negative) pairs, aiming at capturing the underlying data structure and enhancing the performance of tasks like audio-visual cross-modal retrieval (AV-CMR). Recent works employ sampling methods to select impactful data points from the embedding space during training. However, the model training fails to fully explore the space due to the scarcity of training data points, resulting in an incomplete representation of the overall positive and negative distributions. In this paper, we propose an innovative Anchor-aware Deep Metric Learning (AADML) method to address this challenge by uncovering the underlying correlations among existing data points, which enhances the quality of the shared embedding space. Specifically, our method establishes a correlation graph-based manifold structure by considering the dependencies between each sample as the anchor and its semantically similar samples. Through dynamic weighting of the correlations within this underlying manifold structure using an attention-driven mechanism, Anchor Awareness (AA) scores are obtained for each anchor. These AA scores serve as data proxies to compute relative distances in metric learning approaches. Extensive experiments conducted on two audio-visual benchmark datasets demonstrate the effectiveness of our proposed AADML method, significantly surpassing state-of-the-art models. Furthermore, we investigate the integration of AA proxies with various metric learning methods, further highlighting the efficacy of our approach.

References

[1]
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. Vol.28). PMLR, Atlanta, Georgia, USA, pp.1247--1255.
[2]
Christian Bailer, Kiran Varanasi, and Didier Stricker. 2017. CNN-based patch matching for optical flow with thresholded hinge embedding loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3250--3259.
[3]
Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, and Dong Shen. 2021. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290 (2021).
[4]
Hardoon David, R., Szedmák Sándor, and Shawe-Taylor John. 2004. Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Computation. Vol.16, no.12 (2004), pp.2639--2664. https://doi.org/10.1162/0899766042321814
[5]
Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. 2000. The mahalanobis distance. Chemometrics and intelligent laboratory systems 50, 1 (2000), 1--18.
[6]
Yueqi Duan, Wenzhao Zheng, Xudong Lin, Jiwen Lu, and Jie Zhou. 2018. Deep adversarial metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2780--2789.
[7]
Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan Oseledets. 2022. Hyperbolic vision transformers: Combining improvements in metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7409--7419.
[8]
Wen Gu, Xiaoyan Gu, Jingzi Gu, Bo Li, Zhi Xiong, and Weiping Wang. 2019. Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. ACM, New York, NY, USA, pp.159--167. https://doi.org/10.1145/3323873.3325045
[9]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.
[10]
Ning Han, Jingjing Chen, Guangyi Xiao, Yawen Zeng, Chuhao Shi, and Hao Chen. 2021. Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval. https://arxiv.org/abs/2110.15609
[11]
Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).
[12]
Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12--14, 2015. Proceedings 3. Springer, 84--92.
[13]
Mahmut Kaya and Hasan ?akir Bilge. 2019. Deep metric learning: A survey. Symmetry 11, 9 (2019), 1066.
[14]
Prannay Khosla, Piotr Teterwak, ChenWang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems 33 (2020), 18661-- 18673.
[15]
Sungyeon Kim, Boseung Jeong, and Suha Kwak. 2023. HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19903--19912.
[16]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[17]
Byungsoo Ko and Geonmo Gu. 2020. Embedding expansion: Augmentation in embedding space for deep metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7255--7264.
[18]
Brian Kulis et al. 2013. Metric learning: A survey. Foundations and Trends® in Machine Learning 5, 4 (2013), 287--364.
[19]
Pei Ling Lai and Colin Fyfe. 2000. Kernel and Nonlinear Canonical Correlation Analysis. Int. J. Neural Syst. Vol.10, no.5 (2000), pp.365--377. https://doi.org/10. 1142/S012906570000034X
[20]
Jung-Eun Lee, Rong Jin, and Anil K Jain. 2008. Rank-based distance metric learning: An application to image retrieval. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.
[21]
Pandeng Li, Yan Li, Hongtao Xie, and Lei Zhang. 2022. Neighborhood-adaptive structure augmented metric learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1367--1375.
[22]
Lizhao Liu, Shangxin Huang, Zhuangwei Zhuang, Ran Yang, Mingkui Tan, and Yaowei Wang. 2022. Das: Densely-anchored sampling for deep metric learning. In European Conference on Computer Vision. Springer, 399--417.
[23]
David G Lowe. 1995. Similarity metric learning for a variable-kernel classifier. Neural computation 7, 1 (1995), 72--85.
[24]
Alexis Mignon and Frédéric Jurie. 2012. CMML: A new metric learning approach for cross modal matching. In Asian Conference on Computer Vision. 14--pages.
[25]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. Vol.139). PMLR, Virtual Event, pp.8748--8763.
[26]
Nikhil Rasiwasia, Dhruv Mahajan, Vijay Mahadevan, and Gaurav Aggarwal. 2014. Cluster Canonical Correlation Analysis. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics. JMLR.org, Reykjavik, Iceland, pp.823--831. https://doi.org/10.1201/b18358--8
[27]
Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems 29 (2016).
[28]
Didac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i Nieto. 2018. Cross-modal embeddings for video and audio retrieval. In Proceedings of the european conference on computer vision (eccv) workshops. 0--0.
[29]
Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio- Visual Event Localization in Unconstrained Videos. In Computer Vision - ECCV 2018 - 15th European Conference (Lecture Notes in Computer Science, Vol. Vol.11206). Springer, Munich, Germany, pp.252--268. https://doi.org/10.1007/978--3-030-01216--8_16
[30]
Shashanka Venkataramanan, Bill Psomas, Ewa Kijak, Laurent Amsaleg, Konstantinos Karantzalos, and Yannis Avrithis. 2021. It takes two to tango: Mixup for deep metric learning. arXiv preprint arXiv:2106.04990 (2021).
[31]
Lucas Vinh Tran, Yi Tay, Shuai Zhang, Gao Cong, and Xiaoli Li. 2020. Hyperml: A boosting metric learning approach in hyperbolic space for recommender systems. In Proceedings of the 13th international conference on web search and data mining. 609--617.
[32]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the 25th ACM International Conference on Multimedia (Mountain View, California, USA) (MM '17). ACM, New York, NY, USA, pp.154----162. https://doi.org/10.1145/3123266.3123326
[33]
YananWang, Donghuo Zeng, ShinyaWada, and Satoshi Kurihara. 2023. VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning. IEEE Access (2023).
[34]
Kilian Q Weinberger, John Blitzer, and Lawrence Saul. 2005. Distance metric learning for large margin nearest neighbor classification. Advances in neural information processing systems 18 (2005).
[35]
Eric Xing, Michael Jordan, Stuart J Russell, and Andrew Ng. 2002. Distance metric learning with application to clustering with side-information. Advances in neural information processing systems 15 (2002).
[36]
Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. 2019. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE transactions on neural networks and learning systems Vol.30, no.4 (2019), pp.1250--1258.
[37]
Donghuo Zeng and Kazushi Ikeda. 2023. Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval. arXiv preprint arXiv:2310.13451 (2023).
[38]
Donghuo Zeng, Yanan Wang, Jianming Wu, and Kazushi Ikeda. 2022. Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval. In 2022 IEEE International Symposium on Multimedia (ISM). IEEE, 1--9.
[39]
Donghuo Zeng, Jianming Wu, Gen Hattori, Rong Xu, and Yi Yu. 2023. Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval. ACM Transactions on Multimedia Computing, Communications and Applications vol.19, no.2s (2023), pp.1--23.
[40]
Donghuo Zeng, Yi Yu, and Keizo Oyama. 2018. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA. In 2018 IEEE International Symposium on Multimedia (ISM). IEEE, 143--150.
[41]
Donghuo Zeng, Yi Yu, and Keizo Oyama. 2020. Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval. ACM Transactions on Multimedia Computing, Communications and Applications vol.16, no.3 (2020), pp.1--23. https://doi.org/10.1145/3387164
[42]
Jiwei Zhang, Yi Yu, Suhua Tang, Wei Li, and Jianming Wu. 2023. Multi-scale network with shared cross-attention for audio--visual correlation learning. Neural Computing and Applications (2023), 1--15.
[43]
Jiwei Zhang, Yi Yu, Suhua Tang, Jianming Wu, and Wei Li. 2023. Variational Autoencoder with CCA for Audio--Visual Cross-Modal Retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 19, 3s, Article 130 (feb 2023), 21 pages. https://doi.org/10.1145/3575658
[44]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, CA, USA, pp.10394--10403. https://doi.org/10.1109/ CVPR.2019.01064
[45]
Wenzhao Zheng, Zhaodong Chen, Jiwen Lu, and Jie Zhou. 2019. Hardness-aware deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 72--81.
[46]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications Vol.16, no.2 (2020), pp.1--23.
[47]
Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L. Berg. 2018. Visual to Sound: Generating Natural Sound for Videos in the Wild. In 2018 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, Utah, USA, pp.3550--3558. https://doi.org/10.1109/CVPR.2018.00374
[48]
Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. 2021. Crossclr: Cross-modal contrastive learning for multi-modal video representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1450--1459.

Index Terms

  1. Anchor-aware Deep Metric Learning for Audio-visual Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval
    May 2024
    1379 pages
    ISBN:9798400706196
    DOI:10.1145/3652583
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 June 2024

    Check for updates

    Author Tags

    1. anchor-aware
    2. audio-visual retrieval
    3. deep metric learning
    4. triplet loss

    Qualifiers

    • Research-article

    Conference

    ICMR '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 80
      Total Downloads
    • Downloads (Last 12 months)80
    • Downloads (Last 6 weeks)43
    Reflects downloads up to 22 Sep 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media