Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3539618.3591930acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article
Open access

Multimodal Neural Databases

Published: 18 July 2023 Publication History

Abstract

The rise in loosely-structured data available through text, images, and other modalities has called for new ways of querying them. Multimedia Information Retrieval has filled this gap and has witnessed exciting progress in recent years. Tasks such as search and retrieval of extensive multimedia archives have undergone massive performance improvements, driven to a large extent by recent developments in multimodal deep learning. However, methods in this field remain limited in the kinds of queries they support and, in particular, their inability to answer database-like queries. For this reason, inspired by recent work on neural databases, we propose a new framework, which we name Multimodal Neural Databases (MMNDBs). MMNDBs can answer complex database-like queries that involve reasoning over different input modalities, such as text and images, at scale. In this paper, we present the first architecture able to fulfill this set of requirements and test it with several baselines, showing the limitations of currently available models. The results show the potential of these new techniques to process unstructured data coming from different modalities, paving the way for future research in the area.

Supplemental Material

MP4 File
PRESENTATION VIDEO - Discover the groundbreaking concept of Multimodal Neural Databases (MMNDBs) in this captivating video. With the exponential growth of data and the emergence of generative AI, MMNDBs offer a transformative solution for handling expressive database-like queries on multimedia collections. Explore the potential of MMNDBs through real-world examples and witness how they combine large multimodal models, multimedia information retrieval, and database query processing. Dive into the feasibility, architecture, and future research directions of MMNDBs, revolutionizing how we interact with and extract insights from diverse multimedia data. Join us on this exciting journey at the forefront of AI and information retrieval!

References

[1]
Ion Androutsopoulos, Graeme D Ritchie, and Peter Thanisch. 1995. Natural language interfaces to databases--an introduction. Natural language engineering, Vol. 1, 1 (1995), 29--81.
[2]
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning. PMLR, 1298--1312.
[3]
Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, and Furu Wei. 2021. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358 (2021).
[4]
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1533--1544.
[5]
Alessandro Bozzon and Piero Fraternali. 2010. Multimedia and multimodal information retrieval. In Search Computing. Springer, 135--155.
[6]
Deng Cai, Yan Wang, Lemao Liu, and Shuming Shi. 2022. Recent Advances in Retrieval-Augmented Text Generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR '22). Association for Computing Machinery, New York, NY, USA, 3417--3419. https://doi.org/10.1145/3477495.3532682
[7]
Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. 2021. WebQA: Multihop and Multimodal QA. CoRR, Vol. abs/2109.00590 (2021). showeprint[arXiv]2109.00590 https://arxiv.org/abs/2109.00590
[8]
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. 2022. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022).
[9]
Mengjun Cheng, Yipeng Sun, Longchao Wang, Xiongwei Zhu, Kun Yao, Jie Chen, Guoli Song, Junyu Han, Jingtuo Liu, Errui Ding, et al. 2022. ViSTA: vision and scene text aggregation for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5184--5193.
[10]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6904--6913.
[11]
Alon Y Halevy, Oren Etzioni, AnHai Doan, Zachary G Ives, Jayant Madhavan, Luke K McDowell, and Igor Tatarinov. 2003. Crossing the Structure Chasm. In CIDR.
[12]
Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggregation. SIGMOD Rec., Vol. 26, 2 (jun 1997), 171--182. https://doi.org/10.1145/253262.253291
[13]
Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu. 2019. Scalable deep multimodal learning for cross-modal retrieval. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 635--644.
[14]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature, Vol. 521, 7553 (2015), 436--444.
[15]
Michael S Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 2, 1 (2006), 1--19.
[16]
Fei Li and Hosagrahar V Jagadish. 2014. Constructing an interactive natural language interface for relational databases. Proceedings of the VLDB Endowment, Vol. 8, 1 (2014), 73--84.
[17]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv preprint arXiv:2301.12597 (2023).
[18]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740--755.
[19]
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. 2023. Augmented Language Models: a Survey. https://doi.org/10.48550/ARXIV.2302.07842
[20]
Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305 (2015).
[21]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104--1113.
[22]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[23]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog. https://openai.com/blog/better-language-models/
[24]
Jun Rao, Fei Wang, Liang Ding, Shuhan Qi, Yibing Zhan, Weifeng Liu, and Dacheng Tao. 2022. Where Does the Performance Improvement Come From? -A Reproducibility Concern about Image-Text Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2727--2737.
[25]
Yong Rui, Thomas S. Huang, and Shih-Fu Chang. 1999. Image Retrieval: Current Techniques, Promising Directions, and Open Issues. Journal of Visual Communication and Image Representation, Vol. 10, 1 (1999), 39--62. https://doi.org/10.1006/jvci.1999.0413
[26]
Artsiom Sauchuk, James Thorne, Alon Halevy, Nicola Tonellotto, and Fabrizio Silvestri. 2022. On the Role of Relevance in Natural Language Processing Tasks. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1785--1789.
[27]
A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, 12 (2000), 1349--1380. https://doi.org/10.1109/34.895972
[28]
James Thorne, Majid Yazdani, Marzieh Saeidi, Fabrizio Silvestri, Sebastian Riedel, and Alon Halevy. 2021a. Database reasoning over text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 3091--3104.
[29]
James Thorne, Majid Yazdani, Marzieh Saeidi, Fabrizio Silvestri, Sebastian Riedel, and Alon Halevy. 2021b. From natural language processing to neural databases. In Proceedings of the VLDB Endowment, Vol. 14. VLDB Endowment, 1033--1039.
[30]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
[31]
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016).
[32]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022b. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052 (2022).
[33]
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. 2022a. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442 (2022).
[34]
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. 2022c. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022).
[35]
Tan Yu, Hongliang Fei, and Ping Li. 2022a. Cross-Probe BERT for Fast Cross-Modal Search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2178--2183.
[36]
Tan Yu, Hongliang Fei, and Ping Li. 2022b. U-BERT for Fast and Scalable Text-Image Retrieval. In Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval. 193--203.
[37]
Tan Yu, Yi Yang, Yi Li, Lin Liu, Hongliang Fei, and Ping Li. 2021. Heterogeneous attention network for effective and efficient cross-modal retrieval. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 1146--1156.
[38]
Jichuan Zeng, Xi Victoria Lin, Caiming Xiong, Richard Socher, Michael R Lyu, Irwin King, and Steven CH Hoi. 2020. Photon: a robust cross-domain text-to-SQL system. arXiv preprint arXiv:2007.15280 (2020).
[39]
Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming Zheng, Soujanya Poria, and Tat-Seng Chua. 2021. Retrieving and Reading: A Comprehensive Survey on Open-domain Question Answering. CoRR, Vol. abs/2101.00774 (2021). showeprint[arXiv]2101.00774 https://arxiv.org/abs/2101.00774

Cited By

View all
  • (2024)IR-RAG @ SIGIR24: Information Retrieval's Role in RAG SystemsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657984(3036-3039)Online publication date: 10-Jul-2024
  • (2024)The Power of Noise: Redefining Retrieval for RAG SystemsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657834(719-729)Online publication date: 10-Jul-2024
  • (2024)Sparse Vicious Attacks on Graph Neural NetworksIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33193065:5(2293-2303)Online publication date: May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. databases
  2. multimedia information retrieval
  3. neural networks

Qualifiers

  • Research-article

Funding Sources

  • Progetti di Rilevante Interesse Nazionale
  • MUR National Recovery and Resilience Plan
  • EC H2020RIA

Conference

SIGIR '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)448
  • Downloads (Last 6 weeks)49
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)IR-RAG @ SIGIR24: Information Retrieval's Role in RAG SystemsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657984(3036-3039)Online publication date: 10-Jul-2024
  • (2024)The Power of Noise: Redefining Retrieval for RAG SystemsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657834(719-729)Online publication date: 10-Jul-2024
  • (2024)Sparse Vicious Attacks on Graph Neural NetworksIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33193065:5(2293-2303)Online publication date: May-2024
  • (2024)The Impact of Source-Target Node Distance on Vicious Adversarial Attacks in Social Network Recommendation SystemsAdvances on Graph-Based Approaches in Information Retrieval10.1007/978-3-031-71382-8_6(73-87)Online publication date: 10-Oct-2024
  • (2024)TEXT2TASTE: A Versatile Egocentric Vision System for Intelligent Reading Assistance Using Large Language ModelComputers Helping People with Special Needs10.1007/978-3-031-62849-8_35(285-291)Online publication date: 8-Jul-2024
  • (2023)MaaSDB: Spatial Databases in the Era of Large Language Models (Vision Paper)Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems10.1145/3589132.3625597(1-4)Online publication date: 13-Nov-2023
  • (2023)Prompt-to-OS (P2OS): Revolutionizing Operating Systems and Human-Computer Interaction with Integrated AI Generative Models2023 IEEE 5th International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI58952.2023.00027(128-134)Online publication date: 1-Nov-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media