High-Dimensional Data Cubes
This paper introduces an approach to supporting high-dimensional data cubes at interactive query speeds and moderate storage cost. The approach is based on binary(-domain) data cubes that are judiciously partially materialized; the missing information ...
Fast and Scalable Mining of Time Series Motifs with Probabilistic Guarantees
Mining time series motifs is a fundamental, yet expensive task in exploratory data analytics. In this paper, we therefore propose a fast method to find the top-k motifs with probabilistic guarantees. Our probabilistic approach is based on Locality ...
FEDEX: An Explainability Framework for Data Exploration Steps
When exploring a new dataset, Data Scientists often apply analysis queries, look for insights in the resulting dataframe, and repeat to apply further queries. We propose in this paper a novel solution that assists data scientists in this laborious ...
Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous Hardware
- Maria Xekalaki,
- Juan Fumero,
- Athanasios Stratikopoulos,
- Katerina Doka,
- Christos Katsakioris,
- Constantinos Bitsakos,
- Nectarios Koziris,
- Christos Kotselidis
The ever-increasing demand for high performance Big Data analytics and data processing, has paved the way for heterogeneous hardware accelerators, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), to be integrated into ...
Discovering Polarization Niches via Dense Subgraphs with Attractors and Repulsers
Detecting niches of polarization in social media is a first step towards deploying mitigation strategies and avoiding radicalization. In this paper, we model polarization niches as close-knit dense communities of users, which are under the influence of ...
Sage: A System for Uncertain Network Analysis
We propose Sage, a system for uncertain network analysis. Algorithms for uncertain network analysis require large amounts of memory and computing resources as they sample a large number of network instances and run analysis on them. Sage makes uncertain ...
Mining Bursting Core in Large Temporal Graphs
Temporal graphs are ubiquitous. Mining communities that are bursting in a period of time is essential for seeking real emergency events in temporal graphs. Unfortunately, most previous studies on community mining in temporal networks ignore the bursting ...
Cost-Based or Learning-Based?: A Hybrid Query Optimizer for Query Plan Selection
Traditional cost-based optimizers are efficient and stable to generate optimal plans for simple SQL queries, but they may not generate high-quality plans for complicated queries. Thus learning-based optimizers have been proposed recently that can learn ...
ONe Index for All Kernels (ONIAK): A Zero Re-Indexing LSH Solution to ANNS-ALT (After Linear Transformation)
In this work, we formulate and solve a new type of approximate nearest neighbor search (ANNS) problems called ANNS after linear transformation (ALT). In ANNS-ALT, we search for the vector (in a dataset) that, after being linearly transformed by a user-...
Learned Index Benefits: Machine Learning Based Index Performance Estimation
Index selection remains one of the most challenging problems in relational database management systems. To find an optimum index configuration for a workload, accurately and efficiently quantifying the benefits of each candidate index configuration is ...
Online Ridesharing with Meeting Points
Nowadays, ridesharing becomes a popular commuting mode. Dynamically arriving riders post their origins and destinations, then the platform assigns drivers to serve them. In ridesharing, different groups of riders can be served by one driver if their ...
Exploiting the Power of Equality-Generating Dependencies in Ontological Reasoning
Equality-generating dependencies (EGDs) allow to fully exploit the power of existential quantification in ontological reasoning settings modeled via Tuple-Generating Dependencies (TGDs), by enabling value-assignment or forcing the equivalence of fresh ...
No Repetition: Fast and Reliable Sampling with Highly Concentrated Hashing
- Anders Aamand,
- Debarati Das,
- Evangelos Kipouridis,
- Jakob B. T. Knudsen,
- Peter M. R. Rasmussen,
- Mikkel Thorup
Stochastic sample-based estimators are among the most fundamental and universally applied tools in statistics. Such estimators are particularly important when processing huge amounts of data, where we need to be able to answer a wide range of ...
Witness Generation for JSON Schema
- Lyes Attouche,
- Mohamed-Amine Baazizi,
- Dario Colazzo,
- Giorgio Ghelli,
- Carlo Sartiani,
- Stefanie Scherzinger
JSON Schema is a schema language for JSON documents, based on a complex combination of structural operators, Boolean operators (negation included), and recursive variables. The static analysis of JSON Schema documents comprises practically relevant ...
Towards Observability for Production Machine Learning Pipelines
Software organizations are increasingly incorporating machine learning (ML) into their product offerings, driving a need for new data management tools. Many of these tools facilitate the initial development of ML applications, but sustaining these ...
DINOMO: An Elastic, Scalable, High-Performance Key-Value Store for Disaggregated Persistent Memory
We present Dinomo, a novel key-value store for disaggregated persistent memory (DPM). Dinomo is the first key-value store for DPM that simultaneously achieves high common-case performance, scalability, and lightweight online reconfiguration. We observe ...
Bolt-on, Compact, and Rapid Program Slicing for Notebooks
Computational notebooks are commonly used for iterative workflows, such as in exploratory data analysis. This process lends itself to the accumulation of old code and hidden state, making it hard for users to reason about the lineage of, e.g., plots ...
Fairness Matters: A Tit-for-Tat Strategy Against Selfish Mining
The proof-of-work (PoW) based blockchains are more secure nowadays since profit-oriented miners contribute more computing powers in exchange for fair revenues. This virtuous circle only works under an incentive-compatible consensus, which is found to be ...
SageDB: An Instance-Optimized Data Analytics System
- Jialin Ding,
- Ryan Marcus,
- Andreas Kipf,
- Vikram Nathan,
- Aniruddha Nrusimha,
- Kapil Vaidya,
- Alexander van Renen,
- Tim Kraska
Modern data systems are typically both complex and general-purpose. They are complex because of the numerous internal knobs and parameters that users need to manually tune in order to achieve good performance; they are general-purpose because they are ...
Budget-Conscious Fine-Grained Configuration Optimization for Spatio-Temporal Applications
Based on the performance requirements of modern spatio-temporal data mining applications, in-memory database systems are often used to store and process the data. To efficiently utilize the scarce DRAM capacities, modern database systems support various ...
Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming
Weak Supervision (WS) techniques allow users to efficiently create large training datasets by programmatically labeling data with heuristic sources of supervision. While the success of WS relies heavily on the provided labeling heuristics, the process ...
Subjects
Currently Not Available