Issue Downloads
PACMMOD Volume 1 Issue 4: Editorial
Welcome to this issue of the Proceedings of the ACM on Management of Data (Volume 1, Issue 4 (SIGMOD)). While this issue has papers from the SIGMOD track, PACMMOD will soon also have issues with papers from the newly created PODS track. Out of 189 ...
GEqO: ML-Accelerated Semantic Equivalence Detection
Large scale analytics engines have become a core dependency for modern data-driven enterprises to derive business insights and drive actions. These engines support a large number of analytic jobs processing huge volumes of data on a daily basis, and ...
The Battleship Approach to the Low Resource Entity Matching Problem
Entity matching, a core data integration problem, is the task of deciding whether two data tuples refer to the same real-world entity. Recent advances in deep learning methods, using pre-trained language models, were proposed for resolving entity ...
Udon: Efficient Debugging of User-Defined Functions in Big Data Systems with Line-by-Line Control
Many big data systems are written in languages such as C, C++, Java, and Scala to process large amounts of data efficiently, while data analysts often use Python to conduct data wrangling, statistical analysis, and machine learning. User-defined ...
ChainKV: A Semantics-Aware Key-Value Store for Ethereum System
The Log-Structure Merged tree (LSM-tree) based key-value (KV) store has been widely adopted as the storage engine for blockchain systems, such as Ethereum, in which blockchain data are uniformly transformed into randomly distributed KV items for ...
Proving Query Equivalence Using Linear Integer Arithmetic
Proving the equivalence between SQL queries is a fundamental problem in database research. Existing solvers model queries using algebraic representations and convert such representations into first-order logic formulas so that query equivalence can be ...
A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations
What is a minimal set of tuples to delete from a database in order to eliminate all query answers? This problem is called "the resilience of a query" and is one of the key algorithmic problems underlying various forms of reverse data management, such as ...
ADGNN: Towards Scalable GNN Training with Aggregation-Difference Aware Sampling
Distributed computing is promising to enable large-scale graph neural network (GNN) model training. However, care is needed to avoid excessive computational and communication overheads. Sampling is promising in terms of enabling scalability, and sampling ...
ALP: Adaptive Lossless floating-Point Compression
IEEE 754 doubles do not exactly represent most real values, introducing rounding errors in computations and [de]serialization to text. These rounding errors inhibit the use of existing lightweight compression schemes such as Delta and Frame Of Reference (...
Anchor: A Library for Building Secure Persistent Memory Systems
Cloud infrastructure is experiencing a shift towards disaggregated setups, especially with the introduction of the Compute Express Link (CXL) technology, where byte-addressable ersistent memory (PM) is becoming prominent. To fully utilize the potential ...
AS-Parser: Log Parsing Based on Adaptive Segmentation
System logs have long been recognized as valuable data for analyzing and diagnosing system failures. One fundamental task of log processing is to convert unstructured logs into structured logs through log parsing. All previous log parsing approaches ...
Cackle: Analytical Workload Cost and Performance Stability With Elastic Pools
Analytical query workloads are prone to rapid fluctuations in resource demands. These rapid, hard to predict resource demand changes make provisioning a challenge. Users must either over provision at excessive cost or suffer poor query latency when ...
ChainedFilter: Combining Membership Filters by Chain Rule
Membership (membership query/membership testing) is a fundamental problem across databases, networks and security. However, previous research has primarily focused on either approximate solutions, such as Bloom Filters, or exact methods, like perfect ...
Correlation Joins over Time Series Data Streams Utilizing Complementary Dimension Reduction and Transformation
A common analysis task over a stream of time series is to find all pairs of windows whose correlation is above a given threshold. For a large number of streams, doing so naively, i.e., checking the Cartesian product, is too expensive. In essence, finding ...
Demystifying the QoS and QoE of Edge-hosted Video Streaming Applications in the Wild with SNESet
- Yanan Li,
- Guangqing Deng,
- Changming Bai,
- Jingyu Yang,
- Gang Wang,
- Hao Zhang,
- Jin Bai,
- Haitao Yuan,
- Mengwei Xu,
- Shangguang Wang
Video streaming applications (VSAs) are increasingly being deployed on large-scale edge platforms, which have the potential to significantly improve the quality of service (QoS) and end-user experience (QoE), ultimately maximizing business outcomes. ...
DGC: Training Dynamic Graphs with Spatio-Temporal Non-Uniformity using Graph Partitioning by Chunks
Dynamic Graph Neural Network (DGNN) has shown a strong capability of learning dynamic graphs by exploiting both spatial and temporal features. Although DGNN has recently received considerable attention by AI community and various DGNN models have been ...
DP-starJ: A Differential Private Scheme towards Analytical Star-Join Queries
Star-join query is the fundamental task in data warehouse and has wide applications in On-line Analytical Processing (olap) scenarios. Due to the large number of foreign key constraints and the asymmetric effect in the neighboring instance between the ...
Efficient Approximation Framework for Attribute Recommendation
Trend analysis is a fundamental type of analytical query in online analytical processing (OLAP) systems. In trend analysis, a key step is to identify k valuable attributes whose distributions in two subsets under different predicates significantly differ ...
Equitable Top-k Results for Long Tail Data
For datasets exhibiting long tail phenomenon, we identify a fairness concern in existing top-k algorithms, that return a "fixed" set of k results for a given query. This causes a handful of popular records (products, items, etc) getting overexposed and ...
F3KM: Federated, Fair, and Fast k-means
This paper proposes a federated, fair, and fast k-means algorithm (F3KM) to solve the fair clustering problem efficiently in scenarios where data cannot be shared among different parties. The proposed algorithm decomposes the fair k-means problem into ...
FACET: Robust Counterfactual Explanation Analytics
Machine learning systems are deployed in domains such as hiring and healthcare, where undesired classifications can have serious ramifications for the user. Thus, there is a rising demand for explainable AI systems which provide actionable steps for lay ...
Generation of Training Examples for Tabular Natural Language Inference
Tabular data is becoming increasingly important in Natural Language Processing (NLP) tasks, such as Tabular Natural Language Inference (TNLI). Given a table and a hypothesis expressed in NL text, the goal is to assess if the former structured data ...
Hierarchical Cut Labelling - Scaling Up Distance Queries on Road Networks
Answering the shortest-path distance between two arbitrary locations is a fundamental problem in road networks. Labelling-based solutions are the current state-of-the-arts to render fast response time, which can generally be categorised into hub-based ...
High-Ratio Compression for Machine-Generated Data
- Jiujing Zhang,
- Zhitao Shen,
- Shiyu Yang,
- Lingkai Meng,
- Chuan Xiao,
- Wei Jia,
- Yue Li,
- Qinhui Sun,
- Wenjie Zhang,
- Xuemin Lin
Machine-generated data is rapidly growing and poses challenges for data-intensive systems, especially as the growth of data outpaces the growth of storage space. To cope with the storage issue, compression plays a critical role in storage engines, ...
HongTu: Scalable Full-Graph GNN Training on Multiple GPUs
Full-graph training on graph neural networks (GNN) has emerged as a promising training method for its effectiveness. Full-graph training requires extensive memory and computation resources. To accelerate this training process, researchers have proposed ...
Lemo: A Cache-Enhanced Learned Optimizer for Concurrent Queries
With the expansion of modern database services, multi-user access has become a crucial feature in various practical application scenarios, including enterprise applications and e-commerce platforms. However, if multiple users submit queries within a ...
Lightweight Materialization for Fast Dashboards Over Joins
Dashboards are vital in modern business intelligence tools, providing non-technical users with an interface to access comprehensive business data. With the rise of cloud technology, there is an increased number of data sources to provide enriched ...
MirrorKV: An Efficient Key-Value Store on Hybrid Cloud Storage with Balanced Performance of Compaction and Querying
LSM-based key-value stores have been leveraged in many state-of-the-art data-intensive applications as storage engines. As data volume scales up, a cost-efficient approach is to deploy these applications on hybrid cloud storage with hot/cold separation, ...
MOST: Model-Based Compression with Outlier Storage for Time Series Data
Time series data are used in a wide variety of applications. The explosive growth of the amount of time series data poses a significant challenge in efficient data storage and query processing. Unfortunately, existing compression techniques either show ...
Neural Attributed Community Search at Billion Scale
Community search has been extensively studied in the past decades. In recent years, there is a growing interest in attributed community search that aims to identify a community based on both the query nodes and query attributes. A set of techniques have ...