Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- short-paperMay 2020
Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataPages 285–295https://doi.org/10.1145/3318464.3389785Today, data analysts largely rely on intuition to determine whether missing or withheld rows of a dataset significantly affect their analyses. We propose a framework that can produce automatic contingency analysis, i.e., the range of values an aggregate ...
- research-articleMay 2020
Debunking Four Long-Standing Misconceptions of Time-Series Distance Measures
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataPages 1887–1905https://doi.org/10.1145/3318464.3389760Distance measures are core building blocks in time-series analysis and the subject of active research for decades. Unfortunately, the most detailed experimental study in this area is outdated (over a decade old) and, naturally, does not reflect recent ...
- research-articleMay 2020
Thrifty Query Execution via Incrementability
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataPages 1241–1256https://doi.org/10.1145/3318464.3389756Many applications schedule queries before all data is ready. To return fast query results, database systems can eagerly process existing data and incrementally incorporate new data into prior intermediate results, which often relies on incremental view ...
- research-articleJune 2016
ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataPages 2117–2120https://doi.org/10.1145/2882903.2899409Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, ...
- research-articleJune 2016
PrivateClean: Data Cleaning and Differential Privacy
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataPages 937–951https://doi.org/10.1145/2882903.2915248Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the ...
-
- short-paperJune 2016
SparkR: Scaling R Programs with Spark
- Shivaram Venkataraman,
- Zongheng Yang,
- Davies Liu,
- Eric Liang,
- Hossein Falaki,
- Xiangrui Meng,
- Reynold Xin,
- Ali Ghodsi,
- Michael Franklin,
- Ion Stoica,
- Matei Zaharia
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataPages 1099–1104https://doi.org/10.1145/2882903.2903740R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data ...
- research-articleMay 2015
Spark SQL: Relational Data Processing in Spark
- Michael Armbrust,
- Reynold S. Xin,
- Cheng Lian,
- Yin Huai,
- Davies Liu,
- Joseph K. Bradley,
- Xiangrui Meng,
- Tomer Kaftan,
- Michael J. Franklin,
- Ali Ghodsi,
- Matei Zaharia
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataPages 1383–1394https://doi.org/10.1145/2723372.2742797Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. ...
- research-articleMay 2015
Rethinking Data-Intensive Science Using Scalable Analytics Systems
- Frank Austin Nothaft,
- Matt Massie,
- Timothy Danford,
- Zhao Zhang,
- Uri Laserson,
- Carl Yeksigian,
- Jey Kottalam,
- Arun Ahuja,
- Jeff Hammerbacher,
- Michael Linderman,
- Michael J. Franklin,
- Anthony D. Joseph,
- David A. Patterson
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataPages 631–646https://doi.org/10.1145/2723372.2742787"Next generation" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the ...
- research-articleMay 2015
Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataPages 1327–1342https://doi.org/10.1145/2723372.2737784The rise of data-intensive "Web 2.0" Internet services has led to a range of popular new programming frameworks that collectively embody the latest incarnation of the vision of Object-Relational Mapping (ORM) systems, albeit at unprecedented scale. In ...
- research-articleJune 2014
Fine-grained partitioning for aggressive data skipping
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataPages 1115–1126https://doi.org/10.1145/2588555.2610515Modern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining ...
- research-articleJune 2014
A sample-and-clean framework for fast and accurate query processing on dirty data
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataPages 469–480https://doi.org/10.1145/2588555.2610505In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of ...
- panelJune 2014
Should we all be teaching "intro to data science" instead of "intro to databases"?
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataPages 917–918https://doi.org/10.1145/2588555.2600092The Database Community has a unique perspective on the challenges and solutions of long-term management of data and the value of data as a resource. In current computer science curricula, however, these insights are typically locked up in the context of ...
- research-articleJune 2014
PLANET: making progress with commit processing in unpredictable environments
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataPages 3–14https://doi.org/10.1145/2588555.2588558Latency unpredictability in a database system can come from many factors, such as load spikes in the workload, inter-query interactions from consolidation, or communication costs in cloud computing or geo-replication. High variance and high latency ...
- research-articleJune 2013
Generalized scale independence through incremental precomputation
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataPages 625–636https://doi.org/10.1145/2463676.2465333Developers of rapidly growing applications must be able to anticipate potential scalability problems before they cause performance issues in production environments. A new type of data independence, called scale independence, seeks to address this ...
- research-articleJune 2013
RTP: robust tenant placement for elastic in-memory database clusters
- Jan Schaffner,
- Tim Januschowski,
- Megan Kercher,
- Tim Kraska,
- Hasso Plattner,
- Michael J. Franklin,
- Dean Jacobs
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataPages 773–784https://doi.org/10.1145/2463676.2465302In the cloud services industry, a key issue for cloud operators is to minimize operational costs. In this paper, we consider algorithms that elastically contract and expand a cluster of in-memory databases depending on tenants' behavior over time while ...
- research-articleJune 2013
Shark: SQL and rich analytics at scale
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataPages 13–24https://doi.org/10.1145/2463676.2465288Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (...
- research-articleJune 2013
Leveraging transitive relations for crowdsourced joins
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataPages 229–240https://doi.org/10.1145/2463676.2465280The development of crowdsourced query processing systems has recently attracted a significant attention in the database community. A variety of crowdsourced queries have been investigated. In this paper, we focus on the crowdsourced join query which ...
- demonstrationJune 2013
PBS at work: advancing data management with consistency metrics
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataPages 1113–1116https://doi.org/10.1145/2463676.2465260A large body of recent work has proposed analytical and empirical techniques for quantifying the data consistency properties of distributed data stores. In this demonstration, we begin to explore the wide range of new database functionality they enable, ...
- demonstrationMay 2012
Shark: fast data analysis using coarse-grained distributed memory
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataPages 689–692https://doi.org/10.1145/2213836.2213934Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing ...
- research-articleJune 2011
Hybrid in-database inference for declarative information extraction
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataPages 517–528https://doi.org/10.1145/1989323.1989378In the database community, work on information extraction (IE) has centered on two themes: how to effectively manage IE tasks, and how to manage the uncertainties that arise in the IE process in a scalable manner. Recent work has proposed a ...