Issue Downloads
PACMMOD Volume 1, Issue 3: Editorial
We are excited to introduce this new issue of PACMMOD (Proceedings of the ACM on Management of Data). PACMMOD is a new journal, concerned with the principles, algorithms, techniques, systems, and applications of database management systems, data ...
AirIndex: Versatile Index Tuning Through Data and Storage
The end-to-end lookup latency of a hierarchical index---such as a B-tree or a learned index---is determined by its structure such as the number of layers, the kinds of branching functions appearing in each layer, the amount of data we must fetch from ...
Closest Pairs Search Over Data Stream
k-closest pair (KCP for short) search is a fundamental problem in database research. Given a set of d-dimensional streaming data S, KCP search aims to retrieve k pairs with the shortest distances between them. While existing works have studied continuous ...
BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach
- Zhen Zheng,
- Zaifeng Pan,
- Dalin Wang,
- Kai Zhu,
- Wenyi Zhao,
- Tianyou Guo,
- Xiafei Qiu,
- Minmin Sun,
- Junjie Bai,
- Feng Zhang,
- Xiaoyong Du,
- Jidong Zhai,
- Wei Lin
Compiler optimization plays an increasingly important role to boost the performance of machine learning models for data processing and management. With increasingly complex data, the dynamic tensor shape phenomenon emerges for ML models. However, ...
Efficient Algorithm for Budgeted Adaptive Influence Maximization: An Incremental RR-set Update Approach
Given a graph G, a cost associated with each node, and a budget B, the budgeted influence maximization (BIM) aims to find the optimal set S of seed nodes that maximizes the influence among all possible sets such that the total cost of nodes in S is no ...
Efficient Core Maintenance in Large Bipartite Graphs
As an important cohesive subgraph model in bipartite graphs, the (α, β)-core (a.k.a. bi-core) has found a wide spectrum of real-world applications, such as product recommendation, fraudster detection, and community search. In these applications, the ...
Efficient Maximum k-Defective Clique Computation with Improved Time Complexity
k-defective cliques relax cliques by allowing up-to k missing edges from being a complete graph. This relaxation enables us to find larger near-cliques and has applications in link prediction, cluster detection, social network analysis and transportation ...
Enriching Recommendation Models with Logic Conditions
This paper proposes RecLogic, a framework for improving the accuracy of machine learning (ML) models for recommendation. It aims to enhance existing ML models with logic conditions to reduce false positives and false negatives, without training a new ...
Fast Maximal Quasi-clique Enumeration: A Pruning and Branching Co-Design Approach
Mining cohesive subgraphs from a graph is a fundamental problem in graph data analysis. One notable cohesive structure is γ-quasi-clique (QC), where each vertex connects at least a fraction γ of the other vertices inside. Enumerating maximal γ-quasi-...
FedCSS: Joint Client-and-Sample Selection for Hard Sample-Aware Noise-Robust Federated Learning
Federated Learning (FL) enables a large number of data owners (a.k.a. FL clients) to jointly train a machine learning model without disclosing private local data. The importance of local data samples to the FL model vary widely. This is exacerbated by ...
Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads
LSM-trees are widely adopted as the storage backend of key-value stores. However, optimizing the system performance under dynamic workloads has not been sufficiently studied or evaluated in previous work. To fill the gap, we present RusKey, a key-value ...
Memory-Efficient and Flexible Detection of Heavy Hitters in High-Speed Networks
Heavy-hitter detection is a fundamental task in network traffic measurement and security. Existing work faces the dilemma of suffering dynamic and imbalanced traffic characteristics or lowering the detection efficiency and flexibility. In this paper, we ...
Modularity-based Hypergraph Clustering: Random Hypergraph Model, Hyperedge-cluster Relation, and Computation
A graph models the connections among objects. One important graph analytical task is clustering which partitions a data graph into clusters with dense innercluster connections. A line of clustering maximizes a function called modularity. Modularity-based ...
OptiQL: Robust Optimistic Locking for Memory-Optimized Indexes
Modern memory-optimized indexes often use optimistic locks for concurrent accesses. Read operations can proceed optimistically without taking the lock, greatly improving performance on multicore CPUs. But this is at the cost of robustness against ...
Origin-Destination Travel Time Oracle for Map-based Services
Given an origin (O), a destination (D), and a departure time (T), an Origin-Destination (OD) travel time oracle~(ODT-Oracle) returns an estimate of the time it takes to travel from O to D when departing at T. ODT-Oracles serve important purposes in map-...
SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications
In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction ...
Secure Sampling for Approximate Multi-party Query Processing
We study the problem of random sampling in the secure multi-party computation (MPC) model. In MPC, taking a sample securely must have a cost Ω(n) irrespective to the sample size s. This is in stark contrast with the plaintext setting, where a sample can ...
SH2O: Efficient Data Access for Work-Sharing Databases
Interactive applications require processing tens to hundreds of concurrent analytical queries within tight time constraints. In such setups, where high concurrency causes contention, work-sharing databases are critical for improving scalability and for ...
TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs
We introduce TeraHAC, a (1+ε)-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing (1+ε)-approximate HAC, which is a novel combination of the nearest-...