poster

Big data skipping in the cloud

Authors:

Paula Ta-ShmaAuthors Info & Claims

SYSTOR '19: Proceedings of the 12th ACM International Conference on Systems and Storage

Page 193

https://doi.org/10.1145/3319647.3325854

Published: 22 May 2019 Publication History

Get Access

Abstract

According to today's best practices, cloud compute and storage services should be deployed and managed independently. However, this generates a problem for big data analytics in the cloud: potentially huge datasets need to be shipped from the storage service to the compute service to analyse the data. To address this, minimizing the amount of data sent across the network is critical to achieve good performance and low cost. Data skipping is a technique which achieves this for SQL style analytics on structured data.

Data skipping stores summary metadata for each object (or file) in a dataset. For each column in the object, the summary might include minimum and maximum values, a list or bloom filter of the appearing values, or other metadata which succinctly represents the data in that column. This metadata can then be indexed to support efficient retrieval, although since it can be orders of magnitude smaller than the data itself, this step may not be essential. This metadata can be used during query evaluation to skip over objects which have no relevant data. False positives for object relevance are acceptable since the query execution engine will ultimately filter the data at the row level. However false negatives must be avoided to ensure correctness of query results.

Unlike fully inverted database indexes, data skipping indexes are much smaller than the data itself. This property is critical in the cloud, since otherwise a full index scan could increase the amount of data sent across the network instead of reducing it. In the context of database systems, data skipping is used as an additional technique which complements classical indexes. It is referred to as synopsis in DB2 [6] and zone maps in Oracle [9], where in both cases it is limited to min/max metadata. Data skipping and the associated topic of data layout, has been addressed in recent research papers [7, 8] and is also used in cloud analytics platforms [3,4]. Data skipping can also be built into specific data formats [1].

We implemented data skipping support for Apache Spark SQL [2] without changing core Spark, in the form of an addon Scala library which can be added to the classpath and used in Spark applications. Our work applies to storage systems which implement the Hadoop FileSystem API, which includes various object storage systems as well as HDFS. Metadata is stored in Elasticsearch (ES) [5], and additional metadata stores can be supported in future using a pluggable API. Our approach prunes the list of candidate objects for any given Spark SQL query according to the associated data skipping metadata, stored and indexed in ES. Our technique applies to all Spark supported native formats e.g. JSON, CSV, Avro, Parquet, ORC, and can benefit from the latest optimizations built in to those formats in Spark. Unlike approaches which embed data skipping metadata inside the data format itself [1], which require reading at least part of the object, our approach avoids touching irrelevant objects altogether.

References

[1]

2019. Apache Parquet. https://parquet.apache.org/

Google Scholar

[2]

2019. Apache Spark. https://spark.apache.org/

Google Scholar

[3]

2019. Data Skipping for IBM Cloud SQL Query. https://www.ibm.com/blogs/bluemix/2019/03/data-skipping-for-ibm-cloud-sql-query/

Google Scholar

[4]

2019. Databricks Delta Guide. https://docs.databricks.com/delta/optimizations.html#delta-data-skipping

Google Scholar

[5]

2019. Elasticsearch. https://www.elastic.co/products/elasticsearch

Google Scholar

[6]

Vijayshankar Raman et al. 2013. DB2 with BLU acceleration: So much more than just a column store. Proceedings of the VLDB Endowment 6, 11 (2013), 1080--1091.

Digital Library

Google Scholar

[7]

Anil Shanbhag, Alekh Jindal, Samuel Madden, Jorge Quiane, and Aaron J Elmore. 2017. A robust partitioning scheme for ad-hoc query workloads. In Proceedings of the 2017 Symposium on Cloud Computing. ACM.

Digital Library

Google Scholar

[8]

Liwen Sun, Michael J Franklin, Sanjay Krishnan, and Reynold S Xin. 2014. Fine-grained partitioning for aggressive data skipping. In Proceedings of the 2014 SIGMOD. ACM.

Digital Library

Google Scholar

[9]

Mohamed Ziauddin, Andrew Witkowski, You Jung Kim, Dmitry Potapov, Janaki Lahorani, and Murali Krishna. 2017. Dimensions based data clustering and zone maps. Proceedings of the VLDB Endowment 10, 12 (2017), 1622--1633.

Digital Library

Google Scholar

Index Terms

Big data skipping in the cloud
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query optimization
        Query planning

Recommendations

Comments

Information & Contributors

Information

Published In

SYSTOR '19: Proceedings of the 12th ACM International Conference on Systems and Storage

May 2019

211 pages

ISBN:9781450367493

DOI:10.1145/3319647

General Chair:
Moshik Hershcovitch
IBM Research
,
Program Chairs:
Ashvin Goel
University of Toronto
,
Adam Morrison
Tel Aviv University

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

In-Cooperation

USENIX Assoc: USENIX Assoc

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2019

Check for updates

Qualifiers

Poster

Conference

SYSTOR '19

Sponsor:

SIGOPS

SYSTOR '19: The 12th ACM International Systems and Storage Conference

June 3 - 5, 2019

Haifa, Israel

Acceptance Rates

Overall Acceptance Rate 108 of 323 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
195
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Big Data Processing Using Spark in Cloud

Big Data Analytics

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark