Issue Downloads
PACMMOD Volume 2 Issue 1: Editorial
Welcome to Issue 1 of Volume 2 of the Proceedings of the ACM on Management of Data, which has papers from the third round of submissions to the SIGMOD research track.
Out of 230 submissions in this round, whose submission deadline was July 15, 2023, a ...
Optimizing Distributed Protocols with Query Rewrites
- David C.Y. Chu,
- Rithvik Panchapakesan,
- Shadaj Laddad,
- Lucky E. Katahanas,
- Chris Liu,
- Kaushik Shivakumar,
- Natacha Crooks,
- Joseph M. Hellerstein,
- Heidi Howard
Distributed protocols such as 2PC and Paxos lie at the core of many systems in the cloud, but standard implementations do not scale. New scalable distributed protocols are developed through careful analysis and rewrites, but this process is ad hoc and ...
Grafite: Taming Adversarial Queries with Optimal Range Filters
Range filters allow checking whether a query range intersects a given set of keys with a chance of returning a false positive answer, thus generalising the functionality of Bloom filters from point to range queries. Existing practical range filters have ...
High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component Interpolation
- Jinyang Liu,
- Sheng Di,
- Kai Zhao,
- Xin Liang,
- Sian Jin,
- Zizhe Jian,
- Jiajun Huang,
- Shixun Wu,
- Zizhong Chen,
- Franck Cappello
Error-bounded lossy compression has been identified as a promising solution for significantly reducing scientific data volumes upon users' requirements on data distortion. For the existing scientific error-bounded lossy compressors, some of them (such as ...
MWP: Multi-Window Parallel Evaluation of Regular Path Queries on Streaming Graphs
A persistent Regular Path Query (RPQ) on a streaming graph is to continuously find every pair of vertices that are connected by a path in the graph within a sliding window, such that the edge label sequence of this path matches a given regular ...
Proximity Queries on Point Clouds using Rapid Construction Path Oracle
The prevalence of computer graphics technology boosts the developments of point clouds in recent years, which offer advantages over terrain surfaces (represented by Triangular Irregular Networks, i.e., TINs) in proximity queries, including the shortest ...
Efficient k-Clique Listing: An Edge-Oriented Branching Strategy
k-clique listing is a vital graph mining operator with diverse applications in various networks. The state-of-the-art algorithms all adopt a branch-and-bound (BB) framework with a vertex-oriented branching strategy (called VBBkC), which forms a sub-...
Relative Keys: Putting Feature Explanation into Context
Formal feature explanations strictly maintain perfect conformity but are intractable to compute, while heuristic methods are much faster but can lead to problematic explanations due to lack of conformity guarantees. We propose relative keys that have the ...
NOC-NOC: Towards Performance-optimal Distributed Transactions
Substantial research efforts have been devoted to studying the performance optimality problem for distributed database transactions. However, they focus just on optimizing transactional reads, and thus overlook crucial factors, such as the efficiency of ...
Robustness of Updatable Learning-based Index Advisors against Poisoning Attack
Despite the promising performance of recent learning-based Index Advisors (IAs), they exhibited the robustness issue when poisoning attacks polluted training data. This paper presents the first attempt to study the robustness of updatable learning-based ...
FedKNN: Secure Federated k-Nearest Neighbor Search
Nearest neighbor search is a fundamental task in various domains, such as federated learning, data mining, information retrieval, and biomedicine. With the increasing need to utilize data from different organizations while respecting privacy regulations, ...
FineMon: An Innovative Adaptive Network Telemetry Scheme for Fine-Grained, Multi-Metric Data Monitoring with Dynamic Frequency Adjustment and Enhanced Data Recovery
Network telemetry, characterized by its efficient push model and high-performance communication protocol (gRPC), offers a new avenue for collecting fine-grained real-time data. Despite its advantages, existing network telemetry systems lack a theoretical ...
PECJ: Stream Window Join on Disorder Data Streams with Proactive Error Compensation
Stream Window Join (SWJ), a vital operation in stream analytics, struggles with achieving a balance between accuracy and latency due to out-of-order data arrivals. Existing methods predominantly rely on adaptive buffering, but often fall short in ...
Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment
- Mengzhao Wang,
- Weizhi Xu,
- Xiaomeng Yi,
- Songlin Wu,
- Zhangyang Peng,
- Xiangyu Ke,
- Yunjun Gao,
- Xiaoliang Xu,
- Rentong Guo,
- Charles Xie
High-dimensional vector similarity search (HVSS) is gaining prominence as a powerful tool for various data science and AI applications. As vector data scales up, in-memory indexes pose a significant challenge due to the substantial increase in main ...
One Seed, Two Birds: A Unified Learned Structure for Exact and Approximate Counting
The modern database has many precise and approximate counting requirements. Nevertheless, a solitary multidimensional index or cardinality estimator is insufficient to cater to the escalating demands across all counting scenarios. Such approaches are ...
Optimizing Nested Recursive Queries
Datalog is a declarative programming language that has gained popularity in various domains due to its simplicity, expressiveness, and efficiency. But "pure" Datalog is limited to monotone queries, and cannot be used in most practical applications. For ...
Sub-optimal Join Order Identification with L1-error
Q-error -- the standard metric for quantifying the error of individual cardinality estimates -- has been widely adopted as a surrogate for query plan optimality in recent work on learning-based cardinality estimation. However, the only result connecting ...
Efficient Algorithm for K-Multiple-Means
K-Multiple-Means is an extension of K-means for the clustering of multiple means used in many applications, such as image segmentation, load balancing, and blind-source separation. Since K-means uses only one mean to represent each cluster, it fails to ...
Predictive and Near-Optimal Sampling for View Materialization in Video Databases
Scalable video query optimization has re-emerged as an attractive research topic in recent years. The OTIF system, a video database with cutting-edge efficiency, has introduced a new paradigm of utilizing view materialization to facilitate online query ...
LIT: Lightning-fast In-memory Temporal Indexing
We study the problem of temporal database indexing, i.e., indexing versions of a database table in an evolving database. With the larger and cheaper memory chips nowadays, we can afford to keep track of all versions of an evolving table in memory. This ...
Optimizing Dataflow Systems for Scalable Interactive Visualization
Supporting the interactive exploration of large datasets is a popular and challenging use case for data management systems. Traditionally, the interface and the back-end system are built and optimized separately, and interface design and system ...
Efficient Distributed Hop-Constrained Path Enumeration on Large-Scale Graphs
The enumeration of hop-constrained simple paths is a building block in many graph-based areas. Due to the enormous search spaces in large-scale graphs, a single machine can hardly satisfy the requirements of both efficiency and memory, which causes an ...
Efficient High-Quality Clustering for Large Bipartite Graphs
A bipartite graph contains inter-set edges between two disjoint vertex sets, and is widely used to model real-world data, such as user-item purchase records, author-article publications, and biological interactions between drugs and proteins. k-Bipartite ...
DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models
Many organizations rely on data from government and third-party sources, and those sources rarely follow the same data formatting. This introduces challenges in integrating data from multiple sources or aligning external sources with internal databases. ...
Determining Exact Quantiles with Randomized Summaries
Quantiles are fundamental statistics in various data science tasks, but costly to compute, e.g., by loading the entire data in memory for ranking. With limited memory space, prevalent in end devices or databases with heavy loads, it needs to scan the ...
An LDP Compatible Sketch for Securely Approximating Set Intersection Cardinalities
Given two sets of elements held by two different parties separately, computing the cardinality (i.e., the number of distinct elements) of their intersection set is a fundamental task in applications such as network monitoring and database systems. To ...
Spruce: a Fast yet Space-saving Structure for Dynamic Graph Storage
Dynamic graphs have been gaining increasing popularity across various application domains. With the growing size of these graphs, the update performance as well as space occupancy is becoming a crucial aspect of dynamic graph storage. Although existing ...
Controllable Tabular Data Synthesis Using Diffusion Models
Controllable tabular data synthesis plays a crucial role in numerous applications by allowing users to generate synthetic data with specific conditions. These conditions can include synthesizing tuples with predefined attribute values or creating tuples ...
HERO: A Hierarchical Set Partitioning and Join Framework for Speeding up the Set Intersection Over Graphs
As one of the most primitive operators in graph algorithms, such as the triangle counting, maximal clique enumeration, and subgraph listing, a set intersection operator returns common vertices between any two given sets of vertices in data graphs. It is ...
Local Differentially Private Heavy Hitter Detection in Data Streams with Bounded Memory
Top-k frequent items detection is a fundamental task in data stream mining. Many promising solutions are proposed to improve memory efficiency while still maintaining high accuracy for detecting the Top-k items. Despite the memory efficiency concern, the ...