research-article

Open access

Anchor-aware Deep Metric Learning for Audio-visual Retrieval

Authors:

Yi YuAuthors Info & Claims

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Pages 211 - 219

https://doi.org/10.1145/3652583.3658067

Published: 07 June 2024 Publication History

Abstract

Metric learning minimizes the gap between similar (positive) pairs of data points and increases the separation of dissimilar (negative) pairs, aiming at capturing the underlying data structure and enhancing the performance of tasks like audio-visual cross-modal retrieval (AV-CMR). Recent works employ sampling methods to select impactful data points from the embedding space during training. However, the model training fails to fully explore the space due to the scarcity of training data points, resulting in an incomplete representation of the overall positive and negative distributions. In this paper, we propose an innovative Anchor-aware Deep Metric Learning (AADML) method to address this challenge by uncovering the underlying correlations among existing data points, which enhances the quality of the shared embedding space. Specifically, our method establishes a correlation graph-based manifold structure by considering the dependencies between each sample as the anchor and its semantically similar samples. Through dynamic weighting of the correlations within this underlying manifold structure using an attention-driven mechanism, Anchor Awareness (AA) scores are obtained for each anchor. These AA scores serve as data proxies to compute relative distances in metric learning approaches. Extensive experiments conducted on two audio-visual benchmark datasets demonstrate the effectiveness of our proposed AADML method, significantly surpassing state-of-the-art models. Furthermore, we investigate the integration of AA proxies with various metric learning methods, further highlighting the efficacy of our approach.

References

[1]

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. Vol.28). PMLR, Atlanta, Georgia, USA, pp.1247--1255.

[2]

Christian Bailer, Kiran Varanasi, and Didier Stricker. 2017. CNN-based patch matching for optical flow with thresholded hinge embedding loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3250--3259.

[3]

Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, and Dong Shen. 2021. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290 (2021).

[4]

Hardoon David, R., Szedmák Sándor, and Shawe-Taylor John. 2004. Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Computation. Vol.16, no.12 (2004), pp.2639--2664. https://doi.org/10.1162/0899766042321814

Digital Library

[5]

Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. 2000. The mahalanobis distance. Chemometrics and intelligent laboratory systems 50, 1 (2000), 1--18.

[6]

Yueqi Duan, Wenzhao Zheng, Xudong Lin, Jiwen Lu, and Jie Zhou. 2018. Deep adversarial metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2780--2789.

[7]

Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan Oseledets. 2022. Hyperbolic vision transformers: Combining improvements in metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7409--7419.

[8]

Wen Gu, Xiaoyan Gu, Jingzi Gu, Bo Li, Zhi Xiong, and Weiping Wang. 2019. Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. ACM, New York, NY, USA, pp.159--167. https://doi.org/10.1145/3323873.3325045

Digital Library

[9]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.

[10]

Ning Han, Jingjing Chen, Guangyi Xiao, Yawen Zeng, Chuhao Shi, and Hao Chen. 2021. Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval. https://arxiv.org/abs/2110.15609

[11]

Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).

[12]

Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12--14, 2015. Proceedings 3. Springer, 84--92.

[13]

Mahmut Kaya and Hasan ?akir Bilge. 2019. Deep metric learning: A survey. Symmetry 11, 9 (2019), 1066.

[14]

Prannay Khosla, Piotr Teterwak, ChenWang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems 33 (2020), 18661-- 18673.

[15]

Sungyeon Kim, Boseung Jeong, and Suha Kwak. 2023. HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19903--19912.

[16]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[17]

Byungsoo Ko and Geonmo Gu. 2020. Embedding expansion: Augmentation in embedding space for deep metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7255--7264.

[18]

Brian Kulis et al. 2013. Metric learning: A survey. Foundations and Trends® in Machine Learning 5, 4 (2013), 287--364.

[19]

Pei Ling Lai and Colin Fyfe. 2000. Kernel and Nonlinear Canonical Correlation Analysis. Int. J. Neural Syst. Vol.10, no.5 (2000), pp.365--377. https://doi.org/10. 1142/S012906570000034X

[20]

Jung-Eun Lee, Rong Jin, and Anil K Jain. 2008. Rank-based distance metric learning: An application to image retrieval. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.

[21]

Pandeng Li, Yan Li, Hongtao Xie, and Lei Zhang. 2022. Neighborhood-adaptive structure augmented metric learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1367--1375.

[22]

Lizhao Liu, Shangxin Huang, Zhuangwei Zhuang, Ran Yang, Mingkui Tan, and Yaowei Wang. 2022. Das: Densely-anchored sampling for deep metric learning. In European Conference on Computer Vision. Springer, 399--417.

Digital Library

[23]

David G Lowe. 1995. Similarity metric learning for a variable-kernel classifier. Neural computation 7, 1 (1995), 72--85.

[24]

Alexis Mignon and Frédéric Jurie. 2012. CMML: A new metric learning approach for cross modal matching. In Asian Conference on Computer Vision. 14--pages.

[25]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. Vol.139). PMLR, Virtual Event, pp.8748--8763.

[26]

Nikhil Rasiwasia, Dhruv Mahajan, Vijay Mahadevan, and Gaurav Aggarwal. 2014. Cluster Canonical Correlation Analysis. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics. JMLR.org, Reykjavik, Iceland, pp.823--831. https://doi.org/10.1201/b18358--8

[27]

Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems 29 (2016).

[28]

Didac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i Nieto. 2018. Cross-modal embeddings for video and audio retrieval. In Proceedings of the european conference on computer vision (eccv) workshops. 0--0.

[29]

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio- Visual Event Localization in Unconstrained Videos. In Computer Vision - ECCV 2018 - 15th European Conference (Lecture Notes in Computer Science, Vol. Vol.11206). Springer, Munich, Germany, pp.252--268. https://doi.org/10.1007/978--3-030-01216--8_16

[30]

Shashanka Venkataramanan, Bill Psomas, Ewa Kijak, Laurent Amsaleg, Konstantinos Karantzalos, and Yannis Avrithis. 2021. It takes two to tango: Mixup for deep metric learning. arXiv preprint arXiv:2106.04990 (2021).

[31]

Lucas Vinh Tran, Yi Tay, Shuai Zhang, Gao Cong, and Xiaoli Li. 2020. Hyperml: A boosting metric learning approach in hyperbolic space for recommender systems. In Proceedings of the 13th international conference on web search and data mining. 609--617.

Digital Library

[32]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the 25th ACM International Conference on Multimedia (Mountain View, California, USA) (MM '17). ACM, New York, NY, USA, pp.154----162. https://doi.org/10.1145/3123266.3123326

Digital Library

[33]

YananWang, Donghuo Zeng, ShinyaWada, and Satoshi Kurihara. 2023. VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning. IEEE Access (2023).

[34]

Kilian Q Weinberger, John Blitzer, and Lawrence Saul. 2005. Distance metric learning for large margin nearest neighbor classification. Advances in neural information processing systems 18 (2005).

[35]

Eric Xing, Michael Jordan, Stuart J Russell, and Andrew Ng. 2002. Distance metric learning with application to clustering with side-information. Advances in neural information processing systems 15 (2002).

[36]

Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. 2019. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE transactions on neural networks and learning systems Vol.30, no.4 (2019), pp.1250--1258.

[37]

Donghuo Zeng and Kazushi Ikeda. 2023. Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval. arXiv preprint arXiv:2310.13451 (2023).

[38]

Donghuo Zeng, Yanan Wang, Jianming Wu, and Kazushi Ikeda. 2022. Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval. In 2022 IEEE International Symposium on Multimedia (ISM). IEEE, 1--9.

[39]

Donghuo Zeng, Jianming Wu, Gen Hattori, Rong Xu, and Yi Yu. 2023. Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval. ACM Transactions on Multimedia Computing, Communications and Applications vol.19, no.2s (2023), pp.1--23.

Digital Library

[40]

Donghuo Zeng, Yi Yu, and Keizo Oyama. 2018. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA. In 2018 IEEE International Symposium on Multimedia (ISM). IEEE, 143--150.

[41]

Donghuo Zeng, Yi Yu, and Keizo Oyama. 2020. Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval. ACM Transactions on Multimedia Computing, Communications and Applications vol.16, no.3 (2020), pp.1--23. https://doi.org/10.1145/3387164

Digital Library

[42]

Jiwei Zhang, Yi Yu, Suhua Tang, Wei Li, and Jianming Wu. 2023. Multi-scale network with shared cross-attention for audio--visual correlation learning. Neural Computing and Applications (2023), 1--15.

[43]

Jiwei Zhang, Yi Yu, Suhua Tang, Jianming Wu, and Wei Li. 2023. Variational Autoencoder with CCA for Audio--Visual Cross-Modal Retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 19, 3s, Article 130 (feb 2023), 21 pages. https://doi.org/10.1145/3575658

Digital Library

[44]

Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, CA, USA, pp.10394--10403. https://doi.org/10.1109/ CVPR.2019.01064

[45]

Wenzhao Zheng, Zhaodong Chen, Jiwen Lu, and Jie Zhou. 2019. Hardness-aware deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 72--81.

[46]

Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications Vol.16, no.2 (2020), pp.1--23.

[47]

Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L. Berg. 2018. Visual to Sound: Generating Natural Sound for Videos in the Wild. In 2018 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, Utah, USA, pp.3550--3558. https://doi.org/10.1109/CVPR.2018.00374

[48]

Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. 2021. Crossclr: Cross-modal contrastive learning for multi-modal video representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1450--1459.

Index Terms

Anchor-aware Deep Metric Learning for Audio-visual Retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Similarity measures

Recommendations

Semi-supervised Clustering with Deep Metric Learning
Database Systems for Advanced Applications
Abstract
Semi-supervised clustering has attracted lots of reserach interest due to its broad applications, and many methods have been presented. However there is still much space for improvement, (1) How to learn more discriminative feature representations ...
Margin-based Sampling in Deep Metric Learning
ICBDC '19: Proceedings of the 4th International Conference on Big Data and Computing

Deep metric learning is routinely trained with a pair or a triplet of data samples, either of which converges slowly. It is usually to adopt hard negative mining for a fast convergence. Many existing methods train with the hardest examples, which are ...
Deep Metric Learning with Hierarchical Triplet Loss
Computer Vision – ECCV 2018
Abstract
We present a novel hierarchical triplet loss (HTL) capable of automatically collecting informative training samples (triplets) via a defined hierarchical tree that encodes global context information. This allows us to cope with the main limitation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

May 2024

1379 pages

ISBN:9798400706196

DOI:10.1145/3652583

General Chairs:
Cathal Gurrin
Dublin City University, Ireland
,
Rachada Kongkachandra
Thammasat University, Thailand
,
Klaus Schoeffmann
Klagenfurt University, Austria
,
Program Chairs:
Duc-Tien Dang-Nguyen
University of Bergen, Norway
,
Luca Rossetto
University of Zurich, Switzerland
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Liting Zhou
Dublin City University, Ireland

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMR '24

Sponsor:

ICMR '24: International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket, Thailand

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
80
Total Downloads

Downloads (Last 12 months)80
Downloads (Last 6 weeks)43

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents