Databases
See recent articles
- [1] arXiv:2407.20256 [pdf, other]
-
Title: Making LLMs Work for Enterprise Data TasksComments: Poster at North East Database Day 2024Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Large language models (LLMs) know little about enterprise database tables in the private data ecosystem, which substantially differ from web text in structure and content. As LLMs' performance is tied to their training data, a crucial question is how useful they can be in improving enterprise database management and analysis tasks. To address this, we contribute experimental results on LLMs' performance for text-to-SQL and semantic column-type detection tasks on enterprise datasets. The performance of LLMs on enterprise data is significantly lower than on benchmark datasets commonly used. Informed by our findings and feedback from industry practitioners, we identify three fundamental challenges -- latency, cost, and quality -- and propose potential solutions to use LLMs in enterprise data workflows effectively.
- [2] arXiv:2407.20431 [pdf, other]
-
Title: Limitations of Validity Intervals in Data Freshness ManagementSubjects: Databases (cs.DB)
In data-intensive real-time applications, such as smart transportation and manufacturing, ensuring data freshness is essential, as using obsolete data can lead to negative outcomes. Validity intervals serve as the standard means to specify freshness requirements in real-time databases. In this paper, we bring attention to significant drawbacks of validity intervals that have largely been unnoticed and introduce a new definition of data freshness, while discussing future research directions to address these limitations.
- [3] arXiv:2407.20782 [pdf, other]
-
Title: Boundedness for Unions of Conjunctive Regular Path Queries over Simple Regular ExpressionsSubjects: Databases (cs.DB)
The problem of checking whether a recursive query can be rewritten as query without recursion is a fundamental reasoning task, known as the boundedness problem. Here we study the boundedness problem for Unions of Conjunctive Regular Path Queries (UCRPQs), a navigational query language extensively used in ontology and graph database querying. The boundedness problem for UCRPQs is ExpSpace-complete. Here we focus our analysis on UCRPQs using simple regular expressions, which are of high practical relevance and enjoy a lower reasoning complexity. We show that the complexity for the boundedness problem for this UCRPQs fragment is $\Pi^P_2$-complete, and that an equivalent bounded query can be produced in polynomial time whenever possible. When the query turns out to be unbounded, we also study the task of finding an equivalent maximally bounded query, which we show to be feasible in $\Pi^P_2$. As a side result of independent interest stemming from our developments, we study a notion of succinct finite automata and prove that its membership problem is in NP.
- [4] arXiv:2407.20932 [pdf, other]
-
Title: Complete Approximations of Incomplete QueriesComments: accepted at RuleML+RR 2024Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI)
This paper studies the completeness of conjunctive queries over a partially complete database and the approximation of incomplete queries. Given a query and a set of completeness rules (a special kind of tuple generating dependencies) that specify which parts of the database are complete, we investigate whether the query can be fully answered, as if all data were available. If not, we explore reformulating the query into either Maximal Complete Specializations (MCSs) or the (unique up to equivalence) Minimal Complete Generalization (MCG) that can be fully answered, that is, the best complete approximations of the query from below or above in the sense of query containment. We show that the MSG can be characterized as the least fixed-point of a monotonic operator in a preorder. Then, we show that an MCS can be computed by recursive backward application of completeness rules. We study the complexity of both problems and discuss implementation techniques that rely on an ASP and Prolog engines, respectively.
New submissions for Wednesday, 31 July 2024 (showing 4 of 4 entries )
- [5] arXiv:2407.20446 (cross-list from cs.CV) [pdf, other]
-
Title: MEVDT: Multi-Modal Event-Based Vehicle Detection and Tracking DatasetSubjects: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
In this data article, we introduce the Multi-Modal Event-based Vehicle Detection and Tracking (MEVDT) dataset. This dataset provides a synchronized stream of event data and grayscale images of traffic scenes, captured using the Dynamic and Active-Pixel Vision Sensor (DAVIS) 240c hybrid event-based camera. MEVDT comprises 63 multi-modal sequences with approximately 13k images, 5M events, 10k object labels, and 85 unique object tracking trajectories. Additionally, MEVDT includes manually annotated ground truth labels $\unicode{x2014}$ consisting of object classifications, pixel-precise bounding boxes, and unique object IDs $\unicode{x2014}$ which are provided at a labeling frequency of 24 Hz. Designed to advance the research in the domain of event-based vision, MEVDT aims to address the critical need for high-quality, real-world annotated datasets that enable the development and evaluation of object detection and tracking algorithms in automotive environments.
- [6] arXiv:2407.20754 (cross-list from cs.LO) [pdf, other]
-
Title: Cost-Based Semantics for Querying Inconsistent Weighted Knowledge BasesComments: This is an extended version of a paper appearing at the 21st International Conference on Principles of Knowledge Representation and Reasoning (KR 2024). 20 pagesSubjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Databases (cs.DB)
In this paper, we explore a quantitative approach to querying inconsistent description logic knowledge bases. We consider weighted knowledge bases in which both axioms and assertions have (possibly infinite) weights, which are used to assign a cost to each interpretation based upon the axioms and assertions it violates. Two notions of certain and possible answer are defined by either considering interpretations whose cost does not exceed a given bound or restricting attention to optimal-cost interpretations. Our main contribution is a comprehensive analysis of the combined and data complexity of bounded cost satisfiability and certain and possible answer recognition, for description logics between ELbot and ALCO.
Cross submissions for Wednesday, 31 July 2024 (showing 2 of 2 entries )
- [7] arXiv:2405.01510 (replaced) [pdf, other]
-
Title: Reverse Influential Community Search Over Social Networks (Technical Report)Subjects: Social and Information Networks (cs.SI); Databases (cs.DB)
As an important fundamental task of numerous real-world applications such as social network analysis and online advertising/marketing, several prior works studied influential community search, which retrieves a community with high structural cohesiveness and maximum influences on other users in social networks. However, previous works usually considered the influences of the community on arbitrary users in social networks, rather than specific groups (e.g., customer groups, or senior communities). Inspired by this, we propose a novel Top-M Reverse Influential Community Search (TopM-RICS) problem, which obtains a seed community with the maximum influence on a user-specified target community, satisfying both structural and keyword constraints. To efficiently tackle the TopM-RICS problem, we design effective pruning strategies to filter out false alarms of candidate seed communities, and propose an effective index mechanism to facilitate the community retrieval. We also formulate and tackle a TopM-RICS variant, named Top-M Relaxed Reverse Influential Community Search} (TopM-R2ICS), which returns top-M subgraphs with relaxed structural constraints and having the maximum influence on a user-specified target community. Comprehensive experiments have been conducted to verify the efficiency and effectiveness of our TopM-RICS and TopM-R2ICS approaches on both real-world and synthetic social networks under various parameter settings.
- [8] arXiv:2406.00019 (replaced) [pdf, other]
-
Title: EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health RecordsComments: ACL 2024 (Findings)Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
In this paper, we introduce EHR-SeqSQL, a novel sequential text-to-SQL dataset for Electronic Health Record (EHR) databases. EHR-SeqSQL is designed to address critical yet underexplored aspects in text-to-SQL parsing: interactivity, compositionality, and efficiency. To the best of our knowledge, EHR-SeqSQL is not only the largest but also the first medical text-to-SQL dataset benchmark to include sequential and contextual questions. We provide a data split and the new test set designed to assess compositional generalization ability. Our experiments demonstrate the superiority of a multi-turn approach over a single-turn approach in learning compositionality. Additionally, our dataset integrates specially crafted tokens into SQL queries to improve execution efficiency. With EHR-SeqSQL, we aim to bridge the gap between practical needs and academic research in the text-to-SQL domain. EHR-SeqSQL is available at this https URL.
- [9] arXiv:2406.13844 (replaced) [pdf, other]
-
Title: MAMA-MIA: A Large-Scale Multi-Center Breast Cancer DCE-MRI Benchmark Dataset with Expert SegmentationsLidia Garrucho, Claire-Anne Reidel, Kaisar Kushibar, Smriti Joshi, Richard Osuala, Apostolia Tsirikoglou, Maciej Bobowicz, Javier del Riego, Alessandro Catanese, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Ritse Mann, Carlos Martín-Isla, Fred Prior, Kostas Marias, Martijn P.A. Starmans, Fredrik Strand, Oliver Díaz, Laura Igual, Karim LekadirComments: 15 paes, 7 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
Current research in breast cancer Magnetic Resonance Imaging (MRI), especially with Artificial Intelligence (AI), faces challenges due to the lack of expert segmentations. To address this, we introduce the MAMA-MIA dataset, comprising 1506 multi-center dynamic contrast-enhanced MRI cases with expert segmentations of primary tumors and non-mass enhancement areas. These cases were sourced from four publicly available collections in The Cancer Imaging Archive (TCIA). Initially, we trained a deep learning model to automatically segment the cases, generating preliminary segmentations that significantly reduced expert segmentation time. Sixteen experts, averaging 9 years of experience in breast cancer, then corrected these segmentations, resulting in the final expert segmentations. Additionally, two radiologists conducted a visual inspection of the automatic segmentations to support future quality control studies. Alongside the expert segmentations, we provide 49 harmonized demographic and clinical variables and the pretrained weights of the well-known nnUNet architecture trained using the DCE-MRI full-images and expert segmentations. This dataset aims to accelerate the development and benchmarking of deep learning models and foster innovation in breast cancer diagnostics and treatment planning.