Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

nsDB: Architecting the Next Generation Database by Integrating Neural and Symbolic Systems

Published: 01 July 2024 Publication History

Abstract

In this paper, we propose nsDB, a novel neuro-symbolic database system that integrates neural and symbolic system architectures natively to address the weaknesses of each, providing a strong database capable of data managing, model learning, and complex analytical query processing over multi-modal data. We employ a real-world NBA data analytical query as an example to illustrate the functionality of each component in nsDB and highlight the research challenges to build it. We then present the key design principles and our preliminary attempts to address them.
In a nutshell, we envision that the next generation database system nsDB integrates the complex neural system with the simple symbolic system. Undoubtedly, nsDB will serve as a bridge between databases with AI models, which abstracts away the AI complexities but allows end users to enjoy the strong capabilities of them. We are in the early stages of the journey to build nsDB, there are many opening challenges, e.g., in-database model training, multi-objective query optimization, and database agent development. We hope the researchers from different communities (e.g., system, architecture, database, artificial intelligence) could tackle them together.

References

[1]
2024. 2023 NBA All-Star Game Promo. https://www.nba.com/watch/video/2023-nba-all-star-game-promo.
[2]
2024. AI Functions on Databricks. https://docs.databricks.com/en/large-language-models/ai-functions.html.
[3]
2024. Amazon Redshift ML. https://aws.amazon.com/blogs/aws/amazon-redshift-ml-is-now-generally-available-use-sql-to-create-machine-learning-models-and-make-predictions-from-your-data/.
[4]
2024. Amazon SageMaker. https://aws.amazon.com/sagemaker/.
[5]
2024. Apache Flink. https://flink.apache.org/.
[6]
2024. Apache Hive. https://hive.apache.org/.
[7]
2024. Apache Spark. https://spark.apache.org/.
[8]
2024. Boost Fiber. https://www.boost.org/doc/libs/1_83_0/libs/fiber/doc/html/fiber/overview.html.
[9]
2024. DAPHNE: Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning. https://daphne-eu.github.io/.
[10]
2024. EvaDB: Database system for AI-powered apps. https://github.com/georgia-tech-db/evadb.
[11]
2024. Introduction to AI and ML in BigQuery. https://cloud.google.com/bigquery/docs/bqml-introduction.
[12]
2024. Large Language Models for sentiment analysis with Amazon Red-shift ML. https://aws.amazon.com/blogs/big-data/large-language-models-for-sentiment-analysis-with-amazon-redshift-ml-preview/.
[13]
2024. libco. https://github.com/Tencent/libco.
[14]
2024. ModelDB: An open-source system for Machine Learning model versioning, metadata, and experiment management. https://github.com/VertaAI/modeldb.
[15]
2024. MySQL. https://www.mysql.com/.
[16]
2024. PostgresML. https://postgresml.org/.
[17]
2024. PostgreSQL. https://www.postgresql.org/.
[18]
2024. SQL-only LLM for text generation using Vertex AI model in Big-Query. https://cloud.google.com/blog/products/ai-machine-learning/llm-with-vertex-ai-only-using-sql-queries-in-bigquery.
[19]
2024. Symbolic System. https://en.wikipedia.org/wiki/Computer_algebra.
[20]
Arvind Arasu, Christopher Ré, and Dan Suciu. 2009. Large-scale deduplication with constraints using dedupalog. In ICDE. 952--963.
[21]
Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In PODS. 68--79.
[22]
Xianchun Bao, Zian Bao, Binbin Bie, Qingsong Duan, Wenfei Fan, Hui Lei, Daji Li, Wei Lin, Peng Liu, Zhicong Lv, Mingliang Ouyang, Shuai Tang, Yaoshu Wang, Qiyuan Wei, Min Xie, Jing Zhang, Xin Zhang, Runxiao Zhao, and Shuping Zhou. 2024. Rock: Cleaning Data by Embedding ML in Logic Rules. In SIGMOD.
[23]
Nils Boeschen and Carsten Binnig. 2022. GaccO-A GPU-accelerated OLTP DBMS. In SIGMOD. 1003--1016.
[24]
Zui Chen, Zihui Gu, Lei Cao, Ju Fan, Sam Madden, and Nan Tang. 2023. Symphony: Towards natural language query answering over multi-modal data lakes. In CIDR. 8--151.
[25]
Dawei Cheng, Sheng Xiang, Chencheng Shang, Yiyi Zhang, Fangzhou Yang, and Liqing Zhang. 2020. Spatio-temporal attention-based neural network for credit card fraud detection. In AAAI. 362--369.
[26]
Monica Chiosa, Thomas B Preußer, Michaela Blott, and Gustavo Alonso. 2023. AMNES: Accelerating the computation of data correlation using FPGAs. PVLDB 16, 13 (2023), 4174--4187.
[27]
Periklis Chrysogelos, Panagiotis Sioulas, and Anastasia Ailamaki. 2019. Hardware-conscious query processing in gpu-accelerated analytical engines. In CIDR.
[28]
Francesco Del Buono, Matteo Paganelli, Paolo Sottovia, Matteo Interlandi, and Francesco Guerra. 2021. Transforming ML predictive pipelines into SQL with MASQ. In SIGMOD. 2696--2700.
[29]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In CVPR. 4690--4699.
[30]
Wenfei Fan, Ping Lu, Kehan Pan, Ruochun Jin, and Wenyuan Yu. 2024. Linking entities across relations and graphs. In TODS. 634--647.
[31]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-fast networks for video recognition. In ICCV. 6202--6211.
[32]
Vijay Gadepally, Peinan Chen, Jennie Duggan, Aaron Elmore, Brandon Haynes, Jeremy Kepner, Samuel Madden, Tim Mattson, and Michael Stonebraker. 2016. The BigDAWG polystore system and architecture. In HPEC. 1--6.
[33]
Apurva Gandhi, Yuki Asada, Victor Fu, Advitya Gemawat, Lihao Zhang, Rathijit Sen, Carlo Curino, Jesús Camacho-Rodríguez, and Matteo Interlandi. 2023. The tensor data platform: Towards an ai-centric database system. CIDR.
[34]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
[35]
Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J Gao. 2020. Declarative recursive computation on an RDBMS: or, why you should use a database for distributed machine learning. ACM SIGMOD Record 49, 1 (2020), 43--50.
[36]
Michael Jungmair, André Kohn, and Jana Giceva. 2022. Designing an open framework for query optimization and compilation. PVLDB 15, 11 (2022), 2389--2401.
[37]
Md Kamruzzaman Sarker, Lu Zhou, Aaron Eberhart, and Pascal Hitzler. 2021. Neuro-Symbolic Artificial Intelligence: Current Trends. arXiv e-prints, arXiv-2105.
[38]
Daniel Kang, Francisco Romero, Peter D Bailis, Christos Kozyrakis, and Matei Zaharia. 2022. VIVA: An End-to-End System for Interactive Video Analytics. In CIDR.
[39]
Konstantinos Karanasos, Matteo Interlandi, Doris Xin, Fotis Psallidas, Rathijit Sen, Kwanghyun Park, Ivan Popivanov, Supun Nakandal, Subru Krishnan, Markus Weimer, et al. 2020. Extending Relational Query Processing with ML Inference. CIDR.
[40]
Haotian Liu, Runzhong Li, Ziyang Zhang, and Bo Tang. 2025. Tao: Improving Resource Utilization while Guaranteeing SLO in Multi-tenant Relational Database-as-a-Service. In SIGMOD.
[41]
Haotian Liu, Bo Tang, Jiashu Zhang, Yangshen Deng, Xiao Yan, Xinying Zheng, Qiaomu Shen, Dan Zeng, Zunyao Mao, Chaozu Zhang, Zhengxin You, Zhihao Wang, Runzhe Jiang, Fang Wang, Yiu Man Lung, Huan Li, Mingji Han, Qian Li, and Zhenghai Luo. 2022. GHive: accelerating analytical query processing in apache hive via CPU-GPU heterogeneous computing. In SoCC. 158--172.
[42]
Shangyu Luo, Zekai J Gao, Michael Gubanov, Luis L Perez, and Christopher Jermaine. 2018. Scalable linear algebra on a relational database system. ACM SIGMOD Record 47, 1 (2018), 24--31.
[43]
Amir Netz, Surajit Chaudhuri, Jeff Bernhardt, and Usama Fayyad. 2000. Integration of data mining and relational databases. In VLDB. 285--296.
[44]
Beng Chin Ooi, Shaofeng Cai, Gang Chen, Kian Lee Tan, Yuncheng Wu, Xiaokui Xiao, Naili Xing, Cong Yue, Lingze Zeng, Meihui Zhang, et al. 2024. NeurDB: An AI-powered Autonomous Data System. arXiv preprint arXiv:2405.03924 (2024).
[45]
Kwanghyun Park, Karla Saur, Dalitso Banda, Rathijit Sen, Matteo Interlandi, and Konstantinos Karanasos. 2022. End-to-end optimization of machine learning prediction queries. In SIGMOD. 587--601.
[46]
Cui Pengjie, Liu Haotian, Tang Bo, and Yuan Ye. 2024. CGgraph: An Ultrafast Graph Processing System on Modern Commodity CPU-GPU Co-processor. PVLDB 17, 6 (2024), 1405--1417.
[47]
Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout. 2018. Arachne: Core-Aware thread management. In OSDI. 145--160.
[48]
Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10, 11.
[49]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR. 815--823.
[50]
Qiaomu Shen, Zhengxin You, Xiao Yan, Chaozu Zhang, Ke Xu, Dan Zeng, Jianbin Qin, and Bo Tang. 2024. QEVIS: Multi-grained Visualization of Distributed Query Execution. TVCG, 153--163.
[51]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV. 4489--4497.
[52]
Matthew A Turk and Alex P Pentland. 1991. Face recognition using eigenfaces. In CVPR. 586--587.
[53]
Fang Wang, Xiao Yan, Man Lung Yiu, Shuai LI, Zunyao Mao, and Bo Tang. 2023. Speeding Up End-to-end Query Execution via Learning-based Progressive Cardinality Estimation. SIGMOD 1, 1 (2023), 1--25.
[54]
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. 2023. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR. 14549--14560.
[55]
Binhang Yuan, Dimitrije Jankov, Jia Zou, Yuxin Tang, Daniel Bourgeois, and Chris Jermaine. 2021. Tensor relational algebra for distributed machine learning system design. PVLDB 14, 8 (2021).
[56]
Tianxiong Zhong, Zhiwei Zhang, Guo Lu, Ye Yuan, Yu-Ping Wang, and Guoren Wang. 2024. TVM: A Tile-based Video Management Framework. PVLDB 17, 4, 671--684.
[57]
Lixi Zhou, Qi Lin, Kanchan Chowdhury, Saif Masood, Alexandrem Eichenberger, Hong Min, Alexander Sim, Jie Wang, Yida Wang, Kesheng Wu, Binhang Yuan, and jia Zou. 2024. Serving Deep Learning Models from Relational Databases. In EDBT. 717--724.
[58]
Zhi-Hua Zhou and Zhi-Hao Tan. 2024. Learnware: Small models do big. Science China Information Sciences 67, 1 (2024), 112102.

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 17, Issue 11
July 2024
1039 pages
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2024
Published in PVLDB Volume 17, Issue 11

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 137
    Total Downloads
  • Downloads (Last 12 months)137
  • Downloads (Last 6 weeks)9
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media