Issue Downloads
PACMMOD V2, N6 (SIGMOD), December 2024: Editorial
The Proceedings of the ACM on Management of Data (PACMMOD) is concerned with the principles, algorithms, techniques, systems, and applications of database management systems, data management technology, and science and engineering of data. It includes ...
A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded Deletions
In the field of data stream processing, there are two prevalent models, i.e., insertion-only, and turnstile models. Most previous works were proposed for the insertion-only model, which assumes new elements arrive continuously as a stream, and neglects ...
An Efficient and Exact Algorithm for Locally h-Clique Densest Subgraph Discovery
Detecting locally, non-overlapping, near-clique densest subgraphs is a crucial problem for community search in social networks. As a vertex may be involved in multiple overlapped local cliques, detecting locally densest sub-structures considering h-...
Buffered Persistence in B+ Trees
Non-volatile Memory (NVM) offers the opportunity to build large, durable B+ trees with markedly higher performance and faster post-crash recovery than is possible with traditional disk- or flash-based persistence. Unfortunately, cache flush and fence ...
Camel: Efficient Compression of Floating-Point Time Series
Time series compression encodes the information in a time-ordered sequence of data points into fewer bits, thereby reducing storage costs and possibly other costs. Compression methods are either general or XOR-based. General compression methods are time-...
Common Neighborhood Estimation over Bipartite Graphs under Local Differential Privacy
Bipartite graphs, formed by two vertex layers, arise as a natural fit for modeling the relationships between two groups of entities. In bipartite graphs, common neighborhood computation between two vertices on the same vertex layer is a basic operator, ...
Connectivity-Oriented Property Graph Partitioning for Distributed Graph Pattern Query Processing
Graph pattern query is a powerful tool for extracting crucial information from property graphs. With the exponential growth of sizes, property graphs are typically divided into multiple subgraphs (referred to as partitions) and stored across various ...
Constant-time Connectivity Querying in Dynamic Graphs
Connectivity query processing is a fundamental problem in graph processing. Given an undirected graph and two query vertices, the problem aims to identify whether they are connected via a path. Given frequent edge updates in real graph applications, in ...
CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning
Machine learning models are only as good as their training data. Simple models trained on well-chosen features extracted from the raw data often outperform complex models trained directly on the raw data. Data preparation pipelines, which clean and ...
Directional Queries: Making Top-k Queries More Effective in Discovering Relevant Results
Top-k queries, in particular those based on a linear scoring function, are a common way to extract relevant results from large datasets. Their major advantage over alternative approaches, such as skyline queries (which return all the undominated objects ...
Disclosure-Compliant Query Answering
In today's data-driven world, organizations face increasing pressure to comply with data disclosure policies, which require data masking measures and robust access control mechanisms. This paper presents Mascara, a middleware for specifying and enforcing ...
DPconv: Super-Polynomially Faster Join Ordering
We revisit the join ordering problem in query optimization. The standard exact algorithm, DPccp, has a worst-case running time of O(3n). This is prohibitively expensive for large queries, which are not that uncommon anymore. We develop a new algorithmic ...
Finding Logic Bugs in Spatial Database Engines via Affine Equivalent Inputs
Spatial Database Management Systems (SDBMSs) aim to store, manipulate, and retrieve spatial data. SDBMSs are employed in various modern applications, such as geographic information systems, computer-aided design tools, and location-based services. ...
GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models
Data quality is critical across many applications. The utility of data is undermined by various errors, making rigorous data cleaning a necessity. Traditional data cleaning systems depend heavily on predefined rules and constraints, which necessitate ...
GOLAP: A GPU-in-Data-Path Architecture for High-Speed OLAP
In this paper, we suggest a novel GPU-in-data-path architecture that leverages a GPU to accelerate the I/O path and thus can achieve almost in-memory bandwidth using SSDs. In this architecture, the main idea is to stream data in heavy-weight compressed ...
High-Performance Query Processing with NVMe Arrays: Spilling without Killing Performance
This paper aims to bridge the gap between fast in-memory query engines and slow but robust engines that can utilize external storage. We find that current systems have to choose between fast in-memory operators and slower out-of-memory operators. We ...
iRangeGraph: Improvising Range-dedicated Graphs for Range-filtering Nearest Neighbor Search
Range-filtering approximate nearest neighbor (RFANN) search is attracting increasing attention in academia and industry. Given a set of data objects, each being a pair of a high-dimensional vector and a numeric value, an RFANN query with a vector and a ...
Live Patching for Distributed In-Memory Key-Value Stores
Providers of high-availability data stores need to roll out software updates without causing noticeable downtimes. For distributed data stores like Redis Cluster, the state-of-the-art is a rolling update, where the nodes are restarted in sequence. This ...
Transforming RDF Graphs to Property Graphs using Standardized Schemas
Knowledge Graphs can be encoded using different data models. They are especially abundant using RDF and recently also as property graphs. While knowledge graphs in RDF adhere to the subject-predicate-object structure, property graphs utilize multi-...
LSMGraph: A High-Performance Dynamic Graph Storage System with Multi-Level CSR
- Song Yu,
- Shufeng Gong,
- Qian Tao,
- Sijie Shen,
- Yanfeng Zhang,
- Wenyuan Yu,
- Pengxi Liu,
- Zhixin Zhang,
- Hongfu Li,
- Xiaojian Luo,
- Ge Yu,
- Jingren Zhou
The growing volume of graph data may exhaust the main memory. It is crucial to design a disk-based graph storage system to ingest updates and analyze graphs efficiently. However, existing dynamic graph storage systems suffer from read or write ...
Memento Filter: A Fast, Dynamic, and Robust Range Filter
Range filters are probabilistic data structures that answer approximate range emptiness queries. They aid in avoiding processing empty range queries and have use cases in many application domains such as key-value stores and social web analytics. However,...
Multivariate Time Series Cleaning under Speed Constraints
Errors are common in time series due to unreliable sensor measurements. Existing methods focus on univariate data but do not utilize the correlation between dimensions. Cleaning each dimension separately may lead to a less accurate result, as some errors ...
Navigating Labels and Vectors: A Unified Approach to Filtered Approximate Nearest Neighbor Search
Given a query vector, approximate nearest neighbor search (ANNS) aims to retrieve similar vectors from a set of high-dimensional base vectors. However, many real-world applications jointly query both vector data and structured data, imposing label ...
Online Detection of Anomalies in Temporal Knowledge Graphs with Interpretability
Temporal knowledge graphs (TKGs) are valuable resources for capturing evolving relationships among entities, yet they are often plagued by noise, necessitating robust anomaly detection mechanisms. Existing dynamic graph anomaly detection approaches ...
Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs
Data analytics tasks are often formulated as data workflows represented as directed acyclic graphs (DAGs) of operators. The recent trend of adopting machine learning (ML) techniques in workflows results in increasingly complicated DAGs with many ...
Personalized Truncation for Personalized Privacy
In the standard model of differential privacy (DP), every user's privacy is treated equally, which is captured by a single privacy parameter \varepsilon. However, in many real-world situations, users may have diverse privacy concerns and requirements, ...
Provenance-Enabled Explainable AI
Machine learning (ML) algorithms have advanced significantly in recent years, progressively evolving into artificial intelligence (AI) agents capable of solving complex, human-like intellectual challenges. Despite the advancements, the interpretability ...
SPID-Join: A Skew-resistant Processing-in-DIMM Join Algorithm Exploiting the Bank- and Rank-level Parallelisms of DIMMs
- Suhyun Lee,
- Chaemin Lim,
- Jinwoo Choi,
- Heelim Choi,
- Chan Lee,
- Yongjun Park,
- Kwanghyun Park,
- Hanjun Kim,
- Youngsok Kim
Recent advances in Dual In-line Memory Modules (DIMMs) allow DIMMs to support Processing-In-DIMM (PID) by placing In-DIMM Processors (IDPs) near their memory banks. Prior studies have shown that in-memory joins can benefit from PID by offloading their ...
Towards a Converged Relational-Graph Optimization Framework
The recent ISO SQL:2023 standard adopts SQL/PGQ (Property Graph Queries), facilitating graph-like querying within relational databases. This advancement, however, underscores a significant gap in how to effectively optimize SQL/PGQ queries within ...
Understanding and Reusing Test Suites Across Database Systems
Database Management System (DBMS) developers have implemented extensive test suites to test their DBMSs. For example, the SQLite test suites contain over 92 million lines of code. Despite these extensive efforts, test suites are not systematically reused ...