Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3524273.3528181acmconferencesArticle/Chapter ViewAbstractPublication PagesmmsysConference Proceedingsconference-collections
research-article

CrispSearch: low-latency on-device language-based image retrieval

Published: 05 August 2022 Publication History

Abstract

Advances in deep learning have enabled accurate language-based search and retrieval, e.g., over user photos, in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency, but requires a lot more computational resources, and an order of magnitude more training data (i.e. large web-scraped datasets consisting of millions of image-caption pairs) making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage, on two standard benchmark datasets (namely, MSCOCO and Flickr30k) show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.

References

[1]
2022. Smartphone Unit Shipments by Price Category Worldwide from 2012 to 2022. https://www.statista.com/statistics/934471/smartphone-shipments-by-price-category-worldwide/. (2022). Online; accessed 28 March 2022.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR.
[3]
Lingjiao Chen, Matei Zaharia, and James Zou. 2020. FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply. arXiv preprint:2006.07512 (2020).
[4]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal Image-Text Representation Learning. In ECCV. 104--120.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of NAACL. ACL, 4171--4186.
[6]
J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. 2017. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE TPAMI 39, 4 (2017), 677--691.
[7]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. (2018). https://github.com/fartashf/vsepp
[8]
Gregor Geigle, Jonas Pfeiffer, Nils Reimers, Ivan Vulić, and Iryna Gurevych. 2021. Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval. arXiv preprint arXiv:2103.11920 (2021).
[9]
Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost. In USENIX OSDI. 269--286.
[10]
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-Bert: Aligning Image Pixels with Text by Deep MultiModal Transformers. arXiv preprint arXiv:2004.00849 (2020).
[11]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. arXiv preprint arXiv:2102.05918 (2021).
[12]
Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. Noscope: Optimizing Neural Network Queries over Video at Scale. arXiv preprint arXiv:1703.02529 (2017).
[13]
Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In CVPR.
[14]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. arXiv preprint arXiv:1411.2539 (2014).
[15]
Peter Kraft, Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, and Matei Zaharia. 2019. Willump: A Statistically-Aware End-to-End Optimizer for Machine Learning Inference. arXiv preprint arXiv:1906.01974 (2019).
[16]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching.
[17]
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh D. Gotmare, Shafiq Joty, Caiming Xiong, and Steven C.H. Hoi. 2021. Align Before Fuse: Vision and Language Representation Learning with Momentum Distillation. In arXiv preprint arXiv: 2107.07651.
[18]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual Semantic Reasoning for Image-Text Matching. In ICCV. 4653--4661.
[19]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV. Springer, 121--137.
[20]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr DollÃąr, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context.
[21]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems 32. 13--23.
[22]
Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and StÃl'phane Marchand-Maillet. 2020. Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders. arXiv preprint arXiv:2008.05231 (2020).
[23]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26. 3111--3119.
[24]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Proc. of EMNLP. 1532--1543.
[25]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models from Natural Language Supervision. arXiv preprint arXiv:2103.00020 (2021).
[26]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv preprint arXiv:1506.01497 (2015).
[27]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, Vol. 28.
[28]
Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, and Jingjing Liu. 2021. LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. In NAACL.
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30. 5998--6008.
[30]
Paul Viola and Michael Jones. 2001. Rapid Object Detection using a Boosted Cascade of Simple Features. In CVPR.
[31]
Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In CVPR. 10941--10950.
[32]
Kilian Q. Weinberger and Lawrence K. Saul. 2008. Fast Solvers and Efficient Implementations for Distance Metric Learning.
[33]
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations. 6609--6618.
[34]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67--78.

Cited By

View all
  • (2024)Realizing Efficient On-Device Language-based Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3649896Online publication date: 15-Mar-2024
  • (2023)AdaCLIP: Towards Pragmatic Multimodal Video RetrievalProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612161(5623-5633)Online publication date: 26-Oct-2023
  • (2023)Bi-Encoder Cascades for Efficient Image Search2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00146(1350-1355)Online publication date: 2-Oct-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MMSys '22: Proceedings of the 13th ACM Multimedia Systems Conference
June 2022
432 pages
ISBN:9781450392839
DOI:10.1145/3524273
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. image retrieval
  2. multimodal learning

Qualifiers

  • Research-article

Conference

MMSys '22
Sponsor:
MMSys '22: 13th ACM Multimedia Systems Conference
June 14 - 17, 2022
Athlone, Ireland

Acceptance Rates

Overall Acceptance Rate 176 of 530 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Realizing Efficient On-Device Language-based Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3649896Online publication date: 15-Mar-2024
  • (2023)AdaCLIP: Towards Pragmatic Multimodal Video RetrievalProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612161(5623-5633)Online publication date: 26-Oct-2023
  • (2023)Bi-Encoder Cascades for Efficient Image Search2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00146(1350-1355)Online publication date: 2-Oct-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media