PVLDB: Vol 10, No 11

Volume 10, Issue 11August 2017

Volume 10, Issue 11

August 2017

Editor:

Peter Boncz
CWI
,
Ken Salem
University of Waterloo

Publisher:

VLDB Endowment

ISSN:2150-8097

Subscribe to Journal Recommend ACM DL

ALREADY A SUBSCRIBER?SIGN IN

Bibliometrics

Issue Downloads

PDFFront matter (Cover, Contents, Organization)

Select All

Export Citations Save to Binder

research-article

Memory management techniques for large-scale persistent-main-memory systems

Pages 1166–1177https://doi.org/10.14778/3137628.3137629

Storage Class Memory (SCM) is a novel class of memory technologies that promise to revolutionize database architectures. SCM is byte-addressable and exhibits latencies similar to those of DRAM, while being non-volatile. Hence, SCM could replace both ...

research-article

Trajectory similarity join in spatial networks

Pages 1178–1189https://doi.org/10.14778/3137628.3137630

The matching of similar pairs of objects, called similarity join, is fundamental functionality in data management. We consider the case of trajectory similarity join (TS-Join), where the objects are trajectories of vehicles moving in road networks. Thus,...

research-article

HoloClean: holistic data repairs with probabilistic inference

Pages 1190–1201https://doi.org/10.14778/3137628.3137631

We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies qualitative data repairing, which relies on integrity constraints or external data sources, with quantitative data repairing methods, ...

research-article

Caribou: intelligent distributed storage

Pages 1202–1213https://doi.org/10.14778/3137628.3137632

The ever increasing amount of data being handled in data centers causes an intrinsic inefficiency: moving data around is expensive in terms of bandwidth, latency, and power consumption, especially given the low computational complexity of many database ...

research-article

Towards linear algebra over normalized data

Pages 1214–1225https://doi.org/10.14778/3137628.3137633

Providing machine learning (ML) over relational data is a mainstream requirement for data analytics systems. While almost all ML tools require the input data to be presented as a single table, many datasets are multi-table. This forces data scientists ...

research-article

Comparative evaluation of big-data systems on scientific image analytics workloads

Pages 1226–1237https://doi.org/10.14778/3137628.3137634

Scientific discoveries are increasingly driven by analyzing large volumes of image data. Many new libraries and specialized database management systems (DBMSs) have emerged to support such tasks. It is unclear how well these systems support real-world ...

research-article

Revenue maximization in incentivized social advertising

Pages 1238–1249https://doi.org/10.14778/3137628.3137635

Incentivized social advertising, an emerging marketing model, provides monetization opportunities not only to the owners of the social networking platforms but also to their influential users by offering a "cut" on the advertising revenue. We consider a ...

research-article

SquirrelJoin: network-aware distributed join processing with lazy partitioning

Pages 1250–1261https://doi.org/10.14778/3137628.3137636

To execute distributed joins in parallel on compute clusters, systems partition and exchange data records between workers. With large datasets, workers spend a considerable amount of time transferring data over the network. When compute clusters are ...

research-article

I've seen "enough": incrementally improving visualizations to support rapid decision making

Pages 1262–1273https://doi.org/10.14778/3137628.3137637

Data visualization is an effective mechanism for identifying trends, insights, and anomalies in data. On large datasets, however, generating visualizations can take a long time, delaying the extraction of insights, hampering decision making, and ...

research-article

Minimal on-road time route scheduling on time-dependent graphs

Pages 1274–1285https://doi.org/10.14778/3137628.3137638

On time-dependent graphs, fastest path query is an important problem and has been well studied. It focuses on minimizing the total travel time (waiting time + on-road time) but does not allow waiting on any intermediate vertex if the FIFO property is ...

research-article

A holistic view of stream partitioning costs

Pages 1286–1297https://doi.org/10.14778/3137628.3137639

Stream processing has become the dominant processing model for monitoring and real-time analytics. Modern Parallel Stream Processing Engines (pSPEs) have made it feasible to increase the performance in both monitoring and analytical queries by ...

research-article

Truss-based community search: a truss-equivalence based indexing approach

Pages 1298–1309https://doi.org/10.14778/3137628.3137640

We consider the community search problem defined upon a large graph G: given a query vertex q in G, to find as output all the densely connected subgraphs of G, each of which contains the query v. As an online, query-dependent variant of the well-known ...

research-article

Query optimization for dynamic imputation

Pages 1310–1321https://doi.org/10.14778/3137628.3137641

Missing values are common in data analysis and present a usability challenge. Users are forced to pick between removing tuples with missing values or creating a cleaned version of their data by applying a relatively expensive imputation strategy. Our ...

research-article

In search of an entity resolution OASIS: optimal asymptotic sequential importance sampling

Pages 1322–1333https://doi.org/10.14778/3137628.3137642

Entity resolution (ER) presents unique challenges for evaluation methodology. While crowdsourcing platforms acquire ground truth, sound approaches to sampling must drive labelling efforts. In ER, extreme class imbalance between matching and non-matching ...

research-article

Flexible online task assignment in real-time spatial data

Pages 1334–1345https://doi.org/10.14778/3137628.3137643

The popularity of Online To Offline (O2O) service platforms has spurred the need for online task assignment in real-time spatial data, where streams of spatially distributed tasks and workers are matched in real time such that the total number of ...

research-article

A forward scan based plane sweep algorithm for parallel interval joins

Pages 1346–1357https://doi.org/10.14778/3137628.3137644

The interval join is a basic operation that finds application in temporal, spatial, and uncertain databases. Although a number of centralized and distributed algorithms have been proposed for the efficient evaluation of interval joins, classic plane ...

research-article

ASAP: prioritizing attention via time series smoothing

Pages 1358–1369https://doi.org/10.14778/3137628.3137645

Time series visualization of streaming telemetry (i.e., charting of key metrics such as server load over time) is increasingly prevalent in modern data platforms and applications. However, many existing systems simply plot the raw data streams as they ...

research-article

Knowledge verification for long-tail verticals

Pages 1370–1381https://doi.org/10.14778/3137628.3137646

Collecting structured knowledge for real-world entities has become a critical task for many applications. A big gap between the knowledge in existing knowledge repositories and the knowledge in the real world is the knowledge on tail verticals (i.e., ...

research-article

SkyGraph: retrieving regions of interest using skyline subgraph queries

Pages 1382–1393https://doi.org/10.14778/3137628.3137647

Several services today are annotated with points of interest (PoIs) such as "coffee shop", "park", etc. A region of interest (RoI) is a neighborhood that contains PoIs relevant to the user. In this paper, we study the scenario where a user wants to ...

research-article

Reverse engineering aggregation queries

Pages 1394–1405https://doi.org/10.14778/3137628.3137648

Query reverse engineering seeks to re-generate the SQL query that produced a given query output table from a given database. In this paper, we solve this problem for OLAP queries with group-by and aggregation. We develop a novel three-phase algorithm ...

research-article

LDA*: a robust and large-scale topic modeling system

Pages 1406–1417https://doi.org/10.14778/3137628.3137649

We present LDA*, a system that has been deployed in one of the largest Internet companies to fulfil their requirements of "topic modeling as an internal service"---relying on thousands of machines, engineers in different sectors submit their data, some ...

research-article

Social hash partitioner: a scalable distributed hypergraph partitioner

Pages 1418–1429https://doi.org/10.14778/3137628.3137650

We design and implement a distributed algorithm for balanced k-way hypergraph partitioning that minimizes fanout, a fundamental hypergraph quantity also known as the communication volume and (k - 1)-cut metric, by optimizing a novel objective called ...

research-article

On sampling from massive graph streams

Pages 1430–1441https://doi.org/10.14778/3137628.3137651

We propose Graph Priority Sampling (gps), a new paradigm for order-based reservoir sampling from massive graph streams. gps provides a general way to weight edge sampling according to auxiliary and/or size variables so as to accomplish various ...

research-article

Pyramid sketch: a sketch framework for frequency estimation of data streams

Pages 1442–1453https://doi.org/10.14778/3137628.3137652

Sketch is a probabilistic data structure, and is used to store and query the frequency of any item in a given multiset. Due to its high memory efficiency, it has been applied to various fields in computer science, such as stream database, network ...

research-article

Reconciling skyline and ranking queries

Pages 1454–1465https://doi.org/10.14778/3137628.3137653

Traditionally, skyline and ranking queries have been treated separately as alternative ways of discovering interesting data in potentially large datasets. While ranking queries adopt a specific scoring function to rank tuples, skyline queries return the ...

research-article

CleanM: an optimizable query language for unified scale-out data cleaning

Pages 1466–1477https://doi.org/10.14778/3137628.3137654

Data cleaning has become an indispensable part of data analysis due to the increasing amount of dirty data. Data scientists spend most of their time preparing dirty data before it can be used for data analysis. At the same time, the existing tools that ...

research-article

Distributed trajectory similarity search

Pages 1478–1489https://doi.org/10.14778/3137628.3137655

Mobile and sensing devices have already become ubiquitous. They have made tracking moving objects an easy task. As a result, mobile applications like Uber and many IoT projects have generated massive amounts of trajectory data that can no longer be ...

research-article

Runtime optimization of join location in parallel data management systems

Pages 1490–1501https://doi.org/10.14778/3137628.3137656

Applications running on parallel systems often need to join a streaming relation or a stored relation with data indexed in a parallel data storage system. Some applications also compute UDFs on the joined tuples. The join can be done at the data storage ...

research-article

Stitching web tables for improving matching quality

Pages 1502–1513https://doi.org/10.14778/3137628.3137657

HTML tables on web pages ("web tables") cover a wide variety of topics. Data from web tables can thus be useful for tasks such as knowledge base completion or ad hoc table extension. Before table data can be used for these tasks, the tables must be ...

research-article

DigitHist: a histogram-based data summary with tight error bounds

Pages 1514–1525https://doi.org/10.14778/3137628.3137658

We propose DigitHist, a histogram summary for selectivity estimation on multi-dimensional data with tight error bounds. By combining multi-dimensional and one-dimensional histograms along regular grids of different resolutions, DigitHist provides an ...

Subjects

Currently Not Available

Proceedings of the VLDB Endowment

Sections

Issue Downloads

Memory management techniques for large-scale persistent-main-memory systems

Trajectory similarity join in spatial networks

HoloClean: holistic data repairs with probabilistic inference

Caribou: intelligent distributed storage

Towards linear algebra over normalized data

Comparative evaluation of big-data systems on scientific image analytics workloads

Revenue maximization in incentivized social advertising

SquirrelJoin: network-aware distributed join processing with lazy partitioning

I've seen "enough": incrementally improving visualizations to support rapid decision making

Minimal on-road time route scheduling on time-dependent graphs

A holistic view of stream partitioning costs

Truss-based community search: a truss-equivalence based indexing approach

Query optimization for dynamic imputation

In search of an entity resolution OASIS: optimal asymptotic sequential importance sampling

Flexible online task assignment in real-time spatial data

A forward scan based plane sweep algorithm for parallel interval joins

ASAP: prioritizing attention via time series smoothing

Knowledge verification for long-tail verticals

SkyGraph: retrieving regions of interest using skyline subgraph queries

Reverse engineering aggregation queries

LDA*: a robust and large-scale topic modeling system

Social hash partitioner: a scalable distributed hypergraph partitioner

On sampling from massive graph streams

Pyramid sketch: a sketch framework for frequency estimation of data streams

Reconciling skyline and ranking queries

CleanM: an optimizable query language for unified scale-out data cleaning

Distributed trajectory similarity search

Runtime optimization of join location in parallel data management systems

Stitching web tables for improving matching quality

DigitHist: a histogram-based data summary with tight error bounds

Sections

Issue Downloads

Save to Binder

Subjects

Comments