Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems
- Lijie Xu,
- Shuang Qiu,
- Binhang Yuan,
- Jiawei Jiang,
- Cedric Renggli,
- Shaoduo Gan,
- Kaan Kara,
- Guoliang Li,
- Ji Liu,
- Wentao Wu,
- Jieping Ye,
- Ce Zhang
Data distribution tailoring revisited: cost-efficient integration of representative data
Data scientists often develop data sets for analysis by drawing upon available data sources. A major challenge is ensuring that the data set used for analysis adequately represents relevant demographic groups or other variables. Whether data is ...
Hyper-distance oracles in hypergraphs
We study point-to-point distance estimation in hypergraphs, where the query is parameterized by a positive integer s, which defines the required level of overlap for two hyperedges to be considered adjacent. To answer s-distance queries, we first ...
Similarity-driven and task-driven models for diversity of opinion in crowdsourcing markets
The recent boom in crowdsourcing has opened up a new avenue for utilizing human intelligence in the realm of data analysis. This innovative approach provides a powerful means for connecting online workers to tasks that cannot effectively be done ...
Efficient algorithms for reachability and path queries on temporal bipartite graphs
Bipartite graphs are naturally used to model relationships between two types of entities, such as people-location, user-post, and investor-stock. When modeling real-world applications like disease outbreaks, edges are often enriched with temporal ...
A survey on hybrid transactional and analytical processing
To provide applications with the ability to analyze fresh data and eliminate the time-consuming ETL workflow, hybrid transactional and analytical (HTAP) systems have been developed to serve online transaction processing and online analytical ...
Minimum motif-cut: a workload-aware RDF graph partitioning strategy
In designing a distributed RDF system, it is quite common to divide an RDF graph into subgraphs, called partitions, which are then distributed. Graph partitioning in general and RDF graph partitioning in particular are challenging problems. In ...
Flexible grouping of linear segments for highly accurate lossy compression of time series data
Approximating a series of timestamped data points through a sequence of line segments with a maximum error guarantee is a fundamental data compression problem, termed as Piecewise Linear Approximation (PLA). As the demand for analyzing large ...
Survey of vector database management systems
There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. ...
FedST: secure federated shapelet transformation for time series classification
This paper explores how to build a shapelet-based time series classification (TSC) model in the federated learning (FL) scenario, that is, using more data from multiple owners without actually sharing the data. We propose FedST, a novel federated ...
Open benchmark for filtering techniques in entity resolution
- Franziska Neuhof,
- Marco Fisichella,
- George Papadakis,
- Konstantinos Nikoletos,
- Nikolaus Augsten,
- Wolfgang Nejdl,
- Manolis Koubarakis
Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the ...
AutoCTS++: zero-shot joint neural architecture and hyperparameter search for correlated time series forecasting
- Xinle Wu,
- Xingjian Wu,
- Bin Yang,
- Lekui Zhou,
- Chenjuan Guo,
- Xiangfei Qiu,
- Jilin Hu,
- Zhenli Sheng,
- Christian S. Jensen
Sensors in cyber-physical systems often capture interconnected processes and thus emit correlated time series (CTS), the forecasting of which enables important applications. Recent deep learning based forecasting methods show strong capabilities ...