Export Citations
Issue Downloads
How to design robust algorithms using noisy comparison Oracle
Metric based comparison operations such as finding maximum, nearest and farthest neighbor are fundamental to studying various clustering techniques such as k-center clustering and agglomerative hierarchical clustering. These techniques crucially rely on ...
SAND: streaming subsequence anomaly detection
With the increasing demand for real-time analytics and decision making, anomaly detection methods need to operate over streams of values and handle drifts in data distribution. Unfortunately, existing approaches have severe limitations: they either ...
Optimizing fitness-for-use of differentially private linear queries
In practice, differentially private data releases are designed to support a variety of applications. A data release is fit for use if it meets target accuracy requirements for each application. In this paper, we consider the problem of answering linear ...
Cryptanalysis of an encrypted database in SIGMOD '14
Encrypted database is an innovative technology proposed to solve the data confidentiality issue in cloud-based DB systems. It allows a data owner to encrypt its database before uploading it to the service provider; and it allows the service provider to ...
Unconstrained submodular maximization with modular costs: tight approximation and application to profit maximization
Given a set V, the problem of unconstrained submodular maximization with modular costs (USM-MC) asks for a subset S ⊆ V that maximizes f(S) - c(S), where f is a non-negative, monotone, and submodular function that gauges the utility of S, and c is a non-...
Distributed deep learning on data systems: a comparative analysis of approaches
- Yuhao Zhang,
- Frank McQuillan,
- Nandish Jayaram,
- Nikhil Kak,
- Ekta Khanna,
- Orhan Kislal,
- Domino Valdano,
- Arun Kumar
Deep learning (DL) is growing in popularity for many data analytics applications, including among enterprises. Large business-critical datasets in such settings typically reside in RDBMSs or other data systems. The DB community has long aimed to bring ...
PR-sketch: monitoring per-key aggregation of streaming data with nearly full accuracy
Computing per-key aggregation is indispensable in streaming data analysis formulated as two phases, an update phase and a recovery phase. As the size and speed of data streams rise, accurate per-key information is useful in many applications like ...
Tensors: an abstraction for general data processing
- Dimitrios Koutsoukos,
- Supun Nakandala,
- Konstantinos Karanasos,
- Karla Saur,
- Gustavo Alonso,
- Matteo Interlandi
Deep Learning (DL) has created a growing demand for simpler ways to develop complex models and efficient ways to execute them. Thus, a significant effort has gone into frameworks like PyTorch or TensorFlow to support a variety of DL models and run ...
Budget sharing for multi-analyst differential privacy
Large organizations that collect data about populations (like the US Census Bureau) release summary statistics that are used by multiple stakeholders for resource allocation and policy making problems. These organizations are also legally required to ...
In the land of data streams where synopses are missing, one framework to bring them all
In pursuit of real-time data analysis, approximate summarization structures, i.e., synopses, have gained importance over the years. However, existing stream processing systems, such as Flink, Spark, and Storm, do not support synopses as first class ...
Data acquisition for improving machine learning models
The vast advances in Machine Learning (ML) over the last ten years have been powered by the availability of suitably prepared data for training purposes. The future of ML-enabled enterprise hinges on data. As such, there is already a vibrant market ...
Efficiently answering reachability and path queries on temporal bipartite graphs
Bipartite graphs are naturally used to model relationships between two different types of entities, such as people-location, author-paper, and customer-product. When modeling real-world applications like disease outbreaks, edges are often enriched with ...
Preference queries over taxonomic domains
When composing multiple preferences characterizing the most suitable results for a user, several issues may arise. Indeed, preferences can be partially contradictory, suffer from a mismatch with the level of detail of the actual data, and even lack ...
Revisiting the design of LSM-tree Based OLTP storage engine with persistent memory
- Baoyue Yan,
- Xuntao Cheng,
- Bo Jiang,
- Shibin Chen,
- Canfang Shang,
- Jianying Wang,
- Gui Huang,
- Xinjun Yang,
- Wei Cao,
- Feifei Li
The recent byte-addressable and large-capacity commercialized persistent memory (PM) is promising to drive database as a service (DBaaS) into unchartered territories. This paper investigates how to leverage PMs to revisit the conventional LSM-tree based ...
Kamino: constraint-aware differentially private data synthesis
Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring ...
Towards cost-effective and elastic cloud database deployment via memory disaggregation
- Yingqiang Zhang,
- Chaoyi Ruan,
- Cheng Li,
- Xinjun Yang,
- Wei Cao,
- Feifei Li,
- Bo Wang,
- Jing Fang,
- Yuhui Wang,
- Jingze Huo,
- Chao Bi
It is challenging for cloud-native relational databases to meet the ever-increasing needs of scaling compute and memory resources independently and elastically. The recent emergence of memory disaggregation architecture, relying on high-speed RDMA ...
Dual-objective fine-tuning of BERT for entity matching
An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a ...
Subjects
Currently Not Available