Tabular data synthesis with generative adversarial networks: design space and optimizations
The proliferation of big data has brought an urgent demand for privacy-preserving data publishing. Traditional solutions to this demand have limitations on effectively balancing the trade-off between privacy and utility of the released data. To ...
MinJoin++: a fast algorithm for string similarity joins under edit distance
We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, ...
xDBTagger: explainable natural language interface to databases using keyword mappings and schema graph
Recently, numerous studies have been proposed to attack the natural language interfaces to data-bases (NLIDB) problem by researchers either as a conventional pipeline-based or an end-to-end deep-learning-based solution. Although each approach has ...
A quantitative evaluation of persistent memory hash indexes
Persistent memory (PMem) is increasingly being leveraged to build hash-based indexing structures featuring cheap persistence, high performance, and instant recovery. Especially with the release of Intel Optane DC Persistent Memory Modules, we have ...
Eris: efficiently measuring discord in multidimensional sources
Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and data fusion. To solve the latter, it is mostly assumed that ground truth can be determined. However, in general, the data ...
A systematic evaluation of machine learning on serverless infrastructure
- Jiawei Jiang,
- Shaoduo Gan,
- Bo Du,
- Gustavo Alonso,
- Ana Klimovic,
- Ankit Singla,
- Wentao Wu,
- Sheng Wang,
- Ce Zhang
Recently, the serverless paradigm of computing has inspired research on its applicability to data-intensive tasks such as ETL, database query processing, and machine learning (ML) model training. Recent efforts have proposed multiple systems for ...
A survey on transactional stream processing
Transactional stream processing (TSP) strives to create a cohesive model that merges the advantages of both transactional and stream-oriented guarantees. Over the past decade, numerous endeavors have contributed to the evolution of TSP solutions, ...
Efficient detection of multivariate correlations with different correlation measures
Correlation analysis is an invaluable tool in many domains, for better understanding the data and extracting salient insights. Most works to date focus on detecting high pairwise correlations. A generalization of this problem with known ...
A survey on the evolution of stream processing systems
Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a ...
RCBench: an RDMA-enabled transaction framework for analyzing concurrency control algorithms
- Hongyao Zhao,
- Jingyao Li,
- Wei Lu,
- Qian Zhang,
- Wanqing Yang,
- Jiajia Zhong,
- Meihui Zhang,
- Haixiang Li,
- Xiaoyong Du,
- Anqun Pan
Distributed transaction processing over the TCP/IP network suffers from the weak transaction scalability problem, i.e., its performance drops significantly when the number of involved data nodes per transaction increases. Although quite a few of ...