: Search

short-paper

Public Access

Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataPages 285–295https://doi.org/10.1145/3318464.3389785

Today, data analysts largely rely on intuition to determine whether missing or withheld rows of a dataset significantly affect their analyses. We propose a framework that can produce automatic contingency analysis, i.e., the range of values an aggregate ...

research-article

Debunking Four Long-Standing Misconceptions of Time-Series Distance Measures

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataPages 1887–1905https://doi.org/10.1145/3318464.3389760

Distance measures are core building blocks in time-series analysis and the subject of active research for decades. Unfortunately, the most detailed experimental study in this area is outdated (over a decade old) and, naturally, does not reflect recent ...

research-article

Public Access

Thrifty Query Execution via Incrementability

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataPages 1241–1256https://doi.org/10.1145/3318464.3389756

Many applications schedule queries before all data is ready. To return fast query results, database systems can eagerly process existing data and incrementally incorporate new data into prior intermediate results, which often relies on incremental view ...

research-article

Public Access

ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataPages 2117–2120https://doi.org/10.1145/2882903.2899409

Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, ...

research-article

PrivateClean: Data Cleaning and Differential Privacy

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataPages 937–951https://doi.org/10.1145/2882903.2915248

Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the ...

short-paper

Public Access

SparkR: Scaling R Programs with Spark

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataPages 1099–1104https://doi.org/10.1145/2882903.2903740

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data ...

research-article

Public Access

Spark SQL: Relational Data Processing in Spark

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataPages 1383–1394https://doi.org/10.1145/2723372.2742797

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. ...

research-article

Open Access

Rethinking Data-Intensive Science Using Scalable Analytics Systems

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataPages 631–646https://doi.org/10.1145/2723372.2742787

"Next generation" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the ...

research-article

Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataPages 1327–1342https://doi.org/10.1145/2723372.2737784

The rise of data-intensive "Web 2.0" Internet services has led to a range of popular new programming frameworks that collectively embody the latest incarnation of the vision of Object-Relational Mapping (ORM) systems, albeit at unprecedented scale. In ...

research-article

Fine-grained partitioning for aggressive data skipping

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataPages 1115–1126https://doi.org/10.1145/2588555.2610515

Modern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining ...

research-article

A sample-and-clean framework for fast and accurate query processing on dirty data

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataPages 469–480https://doi.org/10.1145/2588555.2610505

In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of ...

panel

Should we all be teaching "intro to data science" instead of "intro to databases"?

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataPages 917–918https://doi.org/10.1145/2588555.2600092

The Database Community has a unique perspective on the challenges and solutions of long-term management of data and the value of data as a resource. In current computer science curricula, however, these insights are typically locked up in the context of ...

research-article

PLANET: making progress with commit processing in unpredictable environments

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataPages 3–14https://doi.org/10.1145/2588555.2588558

Latency unpredictability in a database system can come from many factors, such as load spikes in the workload, inter-query interactions from consolidation, or communication costs in cloud computing or geo-replication. High variance and high latency ...

research-article

Generalized scale independence through incremental precomputation

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataPages 625–636https://doi.org/10.1145/2463676.2465333

Developers of rapidly growing applications must be able to anticipate potential scalability problems before they cause performance issues in production environments. A new type of data independence, called scale independence, seeks to address this ...

research-article

RTP: robust tenant placement for elastic in-memory database clusters

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataPages 773–784https://doi.org/10.1145/2463676.2465302

In the cloud services industry, a key issue for cloud operators is to minimize operational costs. In this paper, we consider algorithms that elastically contract and expand a cluster of in-memory databases depending on tenants' behavior over time while ...

research-article

Shark: SQL and rich analytics at scale

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataPages 13–24https://doi.org/10.1145/2463676.2465288

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (...

research-article

Leveraging transitive relations for crowdsourced joins

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataPages 229–240https://doi.org/10.1145/2463676.2465280

The development of crowdsourced query processing systems has recently attracted a significant attention in the database community. A variety of crowdsourced queries have been investigated. In this paper, we focus on the crowdsourced join query which ...

demonstration

PBS at work: advancing data management with consistency metrics

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataPages 1113–1116https://doi.org/10.1145/2463676.2465260

A large body of recent work has proposed analytical and empirical techniques for quantifying the data consistency properties of distributed data stores. In this demonstration, we begin to explore the wide range of new database functionality they enable, ...

demonstration

Shark: fast data analysis using coarse-grained distributed memory

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataPages 689–692https://doi.org/10.1145/2213836.2213934

Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing ...

research-article

Hybrid in-database inference for declarative information extraction

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataPages 517–528https://doi.org/10.1145/1989323.1989378

In the database community, work on information extraction (IE) has centered on two themes: how to effectively manage IE tasks, and how to manage the uncertainties that arise in the IE process in a scalable manner. Recent work has proposed a ...

Applied Filters

People

Names

Institutions

Authors

Editors

Publications

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Save to Binder