Issue Downloads
Memory management techniques for large-scale persistent-main-memory systems
Storage Class Memory (SCM) is a novel class of memory technologies that promise to revolutionize database architectures. SCM is byte-addressable and exhibits latencies similar to those of DRAM, while being non-volatile. Hence, SCM could replace both ...
Trajectory similarity join in spatial networks
The matching of similar pairs of objects, called similarity join, is fundamental functionality in data management. We consider the case of trajectory similarity join (TS-Join), where the objects are trajectories of vehicles moving in road networks. Thus,...
HoloClean: holistic data repairs with probabilistic inference
We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies qualitative data repairing, which relies on integrity constraints or external data sources, with quantitative data repairing methods, ...
Caribou: intelligent distributed storage
The ever increasing amount of data being handled in data centers causes an intrinsic inefficiency: moving data around is expensive in terms of bandwidth, latency, and power consumption, especially given the low computational complexity of many database ...
Towards linear algebra over normalized data
Providing machine learning (ML) over relational data is a mainstream requirement for data analytics systems. While almost all ML tools require the input data to be presented as a single table, many datasets are multi-table. This forces data scientists ...
Comparative evaluation of big-data systems on scientific image analytics workloads
- Parmita Mehta,
- Sven Dorkenwald,
- Dongfang Zhao,
- Tomer Kaftan,
- Alvin Cheung,
- Magdalena Balazinska,
- Ariel Rokem,
- Andrew Connolly,
- Jacob Vanderplas,
- Yusra AlSayyad
Scientific discoveries are increasingly driven by analyzing large volumes of image data. Many new libraries and specialized database management systems (DBMSs) have emerged to support such tasks. It is unclear how well these systems support real-world ...
Revenue maximization in incentivized social advertising
Incentivized social advertising, an emerging marketing model, provides monetization opportunities not only to the owners of the social networking platforms but also to their influential users by offering a "cut" on the advertising revenue. We consider a ...
SquirrelJoin: network-aware distributed join processing with lazy partitioning
To execute distributed joins in parallel on compute clusters, systems partition and exchange data records between workers. With large datasets, workers spend a considerable amount of time transferring data over the network. When compute clusters are ...
I've seen "enough": incrementally improving visualizations to support rapid decision making
- Sajjadur Rahman,
- Maryam Aliakbarpour,
- Ha Kyung Kong,
- Eric Blais,
- Karrie Karahalios,
- Aditya Parameswaran,
- Ronitt Rubinfield
Data visualization is an effective mechanism for identifying trends, insights, and anomalies in data. On large datasets, however, generating visualizations can take a long time, delaying the extraction of insights, hampering decision making, and ...
Minimal on-road time route scheduling on time-dependent graphs
On time-dependent graphs, fastest path query is an important problem and has been well studied. It focuses on minimizing the total travel time (waiting time + on-road time) but does not allow waiting on any intermediate vertex if the FIFO property is ...
A holistic view of stream partitioning costs
Stream processing has become the dominant processing model for monitoring and real-time analytics. Modern Parallel Stream Processing Engines (pSPEs) have made it feasible to increase the performance in both monitoring and analytical queries by ...
Truss-based community search: a truss-equivalence based indexing approach
We consider the community search problem defined upon a large graph G: given a query vertex q in G, to find as output all the densely connected subgraphs of G, each of which contains the query v. As an online, query-dependent variant of the well-known ...
Query optimization for dynamic imputation
Missing values are common in data analysis and present a usability challenge. Users are forced to pick between removing tuples with missing values or creating a cleaned version of their data by applying a relatively expensive imputation strategy. Our ...
In search of an entity resolution OASIS: optimal asymptotic sequential importance sampling
Entity resolution (ER) presents unique challenges for evaluation methodology. While crowdsourcing platforms acquire ground truth, sound approaches to sampling must drive labelling efforts. In ER, extreme class imbalance between matching and non-matching ...
Flexible online task assignment in real-time spatial data
The popularity of Online To Offline (O2O) service platforms has spurred the need for online task assignment in real-time spatial data, where streams of spatially distributed tasks and workers are matched in real time such that the total number of ...
A forward scan based plane sweep algorithm for parallel interval joins
The interval join is a basic operation that finds application in temporal, spatial, and uncertain databases. Although a number of centralized and distributed algorithms have been proposed for the efficient evaluation of interval joins, classic plane ...
ASAP: prioritizing attention via time series smoothing
Time series visualization of streaming telemetry (i.e., charting of key metrics such as server load over time) is increasingly prevalent in modern data platforms and applications. However, many existing systems simply plot the raw data streams as they ...
Knowledge verification for long-tail verticals
Collecting structured knowledge for real-world entities has become a critical task for many applications. A big gap between the knowledge in existing knowledge repositories and the knowledge in the real world is the knowledge on tail verticals (i.e., ...
SkyGraph: retrieving regions of interest using skyline subgraph queries
Several services today are annotated with points of interest (PoIs) such as "coffee shop", "park", etc. A region of interest (RoI) is a neighborhood that contains PoIs relevant to the user. In this paper, we study the scenario where a user wants to ...
Reverse engineering aggregation queries
Query reverse engineering seeks to re-generate the SQL query that produced a given query output table from a given database. In this paper, we solve this problem for OLAP queries with group-by and aggregation. We develop a novel three-phase algorithm ...
LDA*: a robust and large-scale topic modeling system
We present LDA*, a system that has been deployed in one of the largest Internet companies to fulfil their requirements of "topic modeling as an internal service"---relying on thousands of machines, engineers in different sectors submit their data, some ...
Social hash partitioner: a scalable distributed hypergraph partitioner
We design and implement a distributed algorithm for balanced k-way hypergraph partitioning that minimizes fanout, a fundamental hypergraph quantity also known as the communication volume and (k - 1)-cut metric, by optimizing a novel objective called ...
On sampling from massive graph streams
We propose Graph Priority Sampling (gps), a new paradigm for order-based reservoir sampling from massive graph streams. gps provides a general way to weight edge sampling according to auxiliary and/or size variables so as to accomplish various ...
Pyramid sketch: a sketch framework for frequency estimation of data streams
Sketch is a probabilistic data structure, and is used to store and query the frequency of any item in a given multiset. Due to its high memory efficiency, it has been applied to various fields in computer science, such as stream database, network ...
Reconciling skyline and ranking queries
Traditionally, skyline and ranking queries have been treated separately as alternative ways of discovering interesting data in potentially large datasets. While ranking queries adopt a specific scoring function to rank tuples, skyline queries return the ...
CleanM: an optimizable query language for unified scale-out data cleaning
Data cleaning has become an indispensable part of data analysis due to the increasing amount of dirty data. Data scientists spend most of their time preparing dirty data before it can be used for data analysis. At the same time, the existing tools that ...
Distributed trajectory similarity search
Mobile and sensing devices have already become ubiquitous. They have made tracking moving objects an easy task. As a result, mobile applications like Uber and many IoT projects have generated massive amounts of trajectory data that can no longer be ...
Runtime optimization of join location in parallel data management systems
Applications running on parallel systems often need to join a streaming relation or a stored relation with data indexed in a parallel data storage system. Some applications also compute UDFs on the joined tuples. The join can be done at the data storage ...
Stitching web tables for improving matching quality
HTML tables on web pages ("web tables") cover a wide variety of topics. Data from web tables can thus be useful for tasks such as knowledge base completion or ad hoc table extension. Before table data can be used for these tasks, the tables must be ...
DigitHist: a histogram-based data summary with tight error bounds
We propose DigitHist, a histogram summary for selectivity estimation on multi-dimensional data with tight error bounds. By combining multi-dimensional and one-dimensional histograms along regular grids of different resolutions, DigitHist provides an ...
Subjects
Currently Not Available