short-paper

FedPAM: Federated Personalized Augmentation Model for Text-to-Image Retrieval

Authors:

Yi YangAuthors Info & Claims

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Pages 1185 - 1189

https://doi.org/10.1145/3652583.3657627

Published: 07 June 2024 Publication History

Abstract

CLIP-based models have made significant advancements in text-to-image retrieval tasks. However, these retrieval models are typically trained on public datasets with optimizing all parameters, which limits their ability to generalize and adapt quickly to personalized private datasets. In this paper, we introduce a lightweight personalized federated learning solution, namely <u>Fed</u>erated <u>P</u>ersonalized <u>A</u>ugmentation <u>M</u>odel (<u>FedPAM</u>), to achieve personalized text-to-image retrieval from multiple private database. Specifically, for the query text, we fetch the top-k most similar text-image pairs from the private database. We then use an attention-based module to generate personalized representations for different clients. The updated representation includes client-specific information for text-to-image matching, resolving issues of data heterogeneity. Additionally, we ensure efficient and secure communication by fine-tuning a small portion of network parameters. Our experiments demonstrate the effectiveness of the proposed framework, exhibiting a significant performance improvement over recently proposed methods: +5.36 on IAPR TC-12, +2.86 on CC3M, and +1.72 on Flickr30k.

References

[1]

Dongqi Cai, Yaozong Wu, Shangguang Wang, and Mengwei Xu. 2023. FedAdapter: Efficient Federated Learning for Mobile NLP. In Proceedings of the ACM Turing Award Celebration Conference-China 2023. 27--28.

Digital Library

[2]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv 2015. arXiv preprint arXiv:1504.00325 (2015).

[3]

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. 2023. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2818--2829.

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[5]

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-Tuning Pretrained Language Models: Weight Initializations. Data Orders, and Early Stopping. arXiv (2020).

[6]

Jingyi Ge. 2022. White-box Inference Attacks against Centralized Machine Learning and Federated Learning. arXiv preprint arXiv:2301.03595 (2022).

[7]

Michael Grubinger, Paul Clough, Henning Müller, and Thomas Deselaers. 2006. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International workshop ontoImage, Vol. 2.

[8]

Tao Guo, Song Guo, and Junxiao Wang. 2023. pFedPrompt: Learning Personalized Prompt for Vision-Language Models in Federated Learning. In Proceedings of the ACM Web Conference 2023. 1364--1374.

Digital Library

[9]

Tao Guo, Song Guo, Junxiao Wang, Xueyang Tang, and Wenchao Xu. 2023. Promptfl: Let federated participants cooperatively learn prompts instead of models-federated learning in age of foundation model. IEEE Transactions on Mobile Computing (2023).

[10]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning. PMLR, 2790--2799.

[11]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535--547.

[12]

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).

[13]

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems 2 (2020), 429--450.

[14]

Wang Lu, Xixu Hu, Jindong Wang, and Xing Xie. 2023. FedCLIP: Fast Generalization and Personalization for CLIP in Federated Learning. arXiv preprint arXiv:2302.13485 (2023).

[15]

Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. 2023. Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens. arXiv preprint arXiv:2312.08870 (2023).

[16]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 1273--1282.

[17]

Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. 2019. Exploiting unintended feature leakage in collaborative learning. In 2019 IEEE symposium on security and privacy (SP). IEEE, 691--706.

[18]

Edwin G. Ng, Bo Pang, Piyush Sharma, and Radu Soricut. 2020. Understanding Guided Image Captioning Performance across Domains. arXiv preprint arXiv:2012.02339 (2020).

[19]

Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. 2020. Adapterhub: A framework for adapting transformers. arXiv preprint arXiv:2007.07779 (2020).

[20]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[21]

Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. 2022. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5227--5237.

[22]

Yucheng Suo, Fan Ma, Linchao Zhu, and Yi Yang. 2024. Knowledge-enhanced dual-stream zero-shot composed image retrieval. arXiv preprint arXiv:2403.16005 (2024).

[23]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).

[24]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[25]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67--78.

[26]

Haodong Zhao, Wei Du, Fangqi Li, Peixuan Li, and Gongshen Liu. 2023. Fed-Prompt: Communication-Efficient and Privacy-Preserving Prompt Tuning in Federated Learning. In ICASSP 2023--2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.

[27]

Ligeng Zhu, Zhijian Liu, and Song Han. 2019. Deep leakage from gradients. Advances in neural information processing systems 32 (2019).

Cited By

Yan WWang YLin WGuo ZZhao ZJin TCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Low-rank Prompt Interaction for Continual Vision-Language RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681264(8257-8266)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681264

Index Terms

FedPAM: Federated Personalized Augmentation Model for Text-to-Image Retrieval
1. Information systems
  1. Information retrieval

Recommendations

Image retrieval based on bag of images
ICIP'09: Proceedings of the 16th IEEE international conference on Image processing

Conventional relevance feedback schemes may not be suitable to all practical applications of content-based image retrieval (CBIR), since most ordinary users would like to complete their search in a single interaction, especially on the web search. In ...
Personalised Information Retrieval: survey and classification

Information Retrieval (IR) systems assist users in finding information from the myriad of information resources available on the Web. A traditional characteristic of IR systems is that if different users submit the same query, the system would yield the ...
Document expansion for image retrieval
RIAO '10: Adaptivity, Personalization and Fusion of Heterogeneous Information

Successful information retrieval requires effective matching between the user's search request and the contents of relevant documents. Often the request entered by a user may not use the same topic relevant terms as the authors' of these documents. One ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

May 2024

1379 pages

ISBN:9798400706196

DOI:10.1145/3652583

General Chairs:
Cathal Gurrin
Dublin City University, Ireland
,
Rachada Kongkachandra
Thammasat University, Thailand
,
Klaus Schoeffmann
Klagenfurt University, Austria
,
Program Chairs:
Duc-Tien Dang-Nguyen
University of Bergen, Norway
,
Luca Rossetto
University of Zurich, Switzerland
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Liting Zhou
Dublin City University, Ireland

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

the Key Research and Development Program of Zhejiang Province
China Postdoctoral Science Foundation

Conference

ICMR '24

Sponsor:

ICMR '24: International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket, Thailand

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
125
Total Downloads

Downloads (Last 12 months)125
Downloads (Last 6 weeks)18

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yan WWang YLin WGuo ZZhao ZJin TCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Low-rank Prompt Interaction for Continual Vision-Language RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681264(8257-8266)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681264

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents