Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3319647.3325854acmconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
poster

Big data skipping in the cloud

Published: 22 May 2019 Publication History

Abstract

According to today's best practices, cloud compute and storage services should be deployed and managed independently. However, this generates a problem for big data analytics in the cloud: potentially huge datasets need to be shipped from the storage service to the compute service to analyse the data. To address this, minimizing the amount of data sent across the network is critical to achieve good performance and low cost. Data skipping is a technique which achieves this for SQL style analytics on structured data.
Data skipping stores summary metadata for each object (or file) in a dataset. For each column in the object, the summary might include minimum and maximum values, a list or bloom filter of the appearing values, or other metadata which succinctly represents the data in that column. This metadata can then be indexed to support efficient retrieval, although since it can be orders of magnitude smaller than the data itself, this step may not be essential. This metadata can be used during query evaluation to skip over objects which have no relevant data. False positives for object relevance are acceptable since the query execution engine will ultimately filter the data at the row level. However false negatives must be avoided to ensure correctness of query results.
Unlike fully inverted database indexes, data skipping indexes are much smaller than the data itself. This property is critical in the cloud, since otherwise a full index scan could increase the amount of data sent across the network instead of reducing it. In the context of database systems, data skipping is used as an additional technique which complements classical indexes. It is referred to as synopsis in DB2 [6] and zone maps in Oracle [9], where in both cases it is limited to min/max metadata. Data skipping and the associated topic of data layout, has been addressed in recent research papers [7, 8] and is also used in cloud analytics platforms [3,4]. Data skipping can also be built into specific data formats [1].
We implemented data skipping support for Apache Spark SQL [2] without changing core Spark, in the form of an addon Scala library which can be added to the classpath and used in Spark applications. Our work applies to storage systems which implement the Hadoop FileSystem API, which includes various object storage systems as well as HDFS. Metadata is stored in Elasticsearch (ES) [5], and additional metadata stores can be supported in future using a pluggable API. Our approach prunes the list of candidate objects for any given Spark SQL query according to the associated data skipping metadata, stored and indexed in ES. Our technique applies to all Spark supported native formats e.g. JSON, CSV, Avro, Parquet, ORC, and can benefit from the latest optimizations built in to those formats in Spark. Unlike approaches which embed data skipping metadata inside the data format itself [1], which require reading at least part of the object, our approach avoids touching irrelevant objects altogether.

References

[1]
2019. Apache Parquet. https://parquet.apache.org/
[2]
2019. Apache Spark. https://spark.apache.org/
[3]
2019. Data Skipping for IBM Cloud SQL Query. https://www.ibm.com/blogs/bluemix/2019/03/data-skipping-for-ibm-cloud-sql-query/
[4]
2019. Databricks Delta Guide. https://docs.databricks.com/delta/optimizations.html#delta-data-skipping
[5]
2019. Elasticsearch. https://www.elastic.co/products/elasticsearch
[6]
Vijayshankar Raman et al. 2013. DB2 with BLU acceleration: So much more than just a column store. Proceedings of the VLDB Endowment 6, 11 (2013), 1080--1091.
[7]
Anil Shanbhag, Alekh Jindal, Samuel Madden, Jorge Quiane, and Aaron J Elmore. 2017. A robust partitioning scheme for ad-hoc query workloads. In Proceedings of the 2017 Symposium on Cloud Computing. ACM.
[8]
Liwen Sun, Michael J Franklin, Sanjay Krishnan, and Reynold S Xin. 2014. Fine-grained partitioning for aggressive data skipping. In Proceedings of the 2014 SIGMOD. ACM.
[9]
Mohamed Ziauddin, Andrew Witkowski, You Jung Kim, Dmitry Potapov, Janaki Lahorani, and Murali Krishna. 2017. Dimensions based data clustering and zone maps. Proceedings of the VLDB Endowment 10, 12 (2017), 1622--1633.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SYSTOR '19: Proceedings of the 12th ACM International Conference on Systems and Storage
May 2019
211 pages
ISBN:9781450367493
DOI:10.1145/3319647
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

In-Cooperation

  • USENIX Assoc: USENIX Assoc

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2019

Check for updates

Qualifiers

  • Poster

Conference

SYSTOR '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 108 of 323 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 195
    Total Downloads
  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media