research-article

Skipping-oriented partitioning for columnar layouts

Authors:

Michael J. Franklin,

Eugene WuAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 10, Issue 4

Pages 421 - 432

https://doi.org/10.14778/3025111.3025123

Published: 01 November 2016 Publication History

Abstract

As data volumes continue to grow, modern database systems increasingly rely on data skipping mechanisms to improve performance by avoiding access to irrelevant data. Recent work [39] proposed a fine-grained partitioning scheme that was shown to improve the opportunities for data skipping in row-oriented systems. Modern analytics and big data systems increasingly adopt columnar storage schemes, and in such systems, a row-based approach misses important opportunities for further improving data skipping. The flexibility of column-oriented organizations, however, comes with the additional cost of tuple reconstruction. In this paper, we develop Generalized Skipping-Oriented Partitioning (GSOP), a novel hybrid data skipping framework that takes into account these row-based and column-based tradeoffs. In contrast to previous column-oriented physical design work, GSOP considers the tradeoffs between horizontal data skipping and vertical partitioning jointly. Our experiments using two public benchmarks and a real-world workload show that GSOP can significantly reduce the amount of data scanned and improve end-to-end query response times over the state-of-the- art techniques.

References

[1]

Apache Drill. https://drill.apache.org.

[2]

Apache Parquet. http://parquet.apache.org.

[3]

Big Data Benchmark. amplab.cs.berkeley.edu/benchmark.

[4]

CasJobs. http://skyserver.sdss.org/casjobs/.

[5]

Sloan Digital Sky Surveys. http://www.sdss.org.

[6]

TPC-H. http://www.tpc.org/tpch.

[7]

A. Ailamaki et al. Data page layouts for relational databases on deep memory hierarchies. VLDB Journal, 11(3):198--215, Nov. 2002.

Digital Library

[8]

A. Gupta et al. Amazon Redshift and the case for simpler data warehouses. In SIGMOD, pages 1917--1923, 2015.

Digital Library

[9]

A. Hall et al. Processing a trillion cells per mouse click. PVLDB, 5(11):1436--1446, 2012.

Digital Library

[10]

A. Jindal et al. Trojan data layouts: Right shoes for a running elephant. In SOCC, pages 21:1--21:14, New York, NY, USA, 2011.

Digital Library

[11]

A. Lamb et al. The Vertica analytic database: C-Store 7 years later. VLDB, 5(12):1790--1801, 2012.

Digital Library

[12]

D. Abadi, D. Myers, D. DeWitt, and S. Madden. Materialization strategies in a column-oriented dbms. In ICDE, pages 466--475, April 2007.

[13]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994.

Digital Library

[14]

I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: A hands-free adaptive store. In SIGMOD, pages 1103--1114, New York, NY, USA, 2014. ACM.

Digital Library

[15]

B. Bhattacharjee et al. Efficient query processing for multi-dimensionally clustered tables in DB2. In VLDB, pages 963--974, 2003.

Digital Library

[16]

B. Dageville et al. The Snowflake elastic data warehouse. In SIGMOD, pages 215--226, 2016.

Digital Library

[17]

P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.

[18]

C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. PVLDB, 3:48--57, 2010.

Digital Library

[19]

D. Abadi et al. Integrating compression and execution in column-oriented database systems. In SIGMOD, SIGMOD, pages 671--682, 2006.

Digital Library

[20]

D. Abadi et al. The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3), 2013.

Digital Library

[21]

D. Ślȩzak et al. Brighthouse: An analytic data warehouse for ad-hoc queries. PVLDB, 1(2):1337--1345, 2008.

Digital Library

[22]

A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7), 2011.

Digital Library

[23]

S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, pages 68--78, 2007.

[24]

S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD, pages 297--308, 2009.

Digital Library

[25]

A. Jindal, E. Palatinus, V. Pavlov, and J. Dittrich. A comparison of knives for bread slicing. PVLDB, 6(6):361--372, 2013.

Digital Library

[26]

M. Armbrust et al. Spark SQL: relational data processing in spark. In SIGMOD, pages 1383--1394, 2015.

Digital Library

[27]

M. Grund et al. Hyrise: A main memory hybrid storage engine. PVLDB, 4(2):105--116, Nov. 2010.

Digital Library

[28]

M. Stonebraker et al. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005.

Digital Library

[29]

M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2, 2012.

Digital Library

[30]

M. Zukowski el al. DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing. In DaMoN, pages 47--54, 2008.

Digital Library

[31]

G. Moerkotte. Small materialized aggregates: A light weight index for data warehousing. In VLDB, pages 476--487, 1998.

Digital Library

[32]

R. Hankins et al. Data morphing: An adaptive, cache-conscious storage technique. In VLDB, pages 417--428. VLDB Endowment, 2003.

Digital Library

[33]

J. Rao, C. Zhang, N. Megiddo, and G. Lohman. Automating physical database design in a parallel database. In SIGMOD, pages 558--569, 2002.

Digital Library

[34]

S. Agarwal et al. Automated selection of materialized views and indexes in SQL databases. In VLDB, pages 496--505, 2000.

Digital Library

[35]

S. Agrawal et al. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, pages 359--370, 2004.

Digital Library

[36]

S. Melnik et al. Dremel: interactive analysis of webale datasets. PVLDB, 3(1--2):330--339, 2010.

Digital Library

[37]

S. Papadomanolakis el al. AutoPart: Automating schema design for large scientific databases using data partitioning. In SSDBM, pages 383--392, 2004.

Digital Library

[38]

F. M. Schuhknecht, A. Jindal, and J. Dittrich. The uncracked pieces in database cracking. PVLDB, 7(2):97--108, Oct. 2013.

Digital Library

[39]

L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In SIGMOD, pages 1115--1126, 2014.

Digital Library

[40]

V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013.

Digital Library

[41]

Y. He et al. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In ICDE, pages 1199--1208, 2011.

Digital Library

[42]

Y. Huai et al. Understanding insights into the basic structure and essential issues of table placement methods in clusters. PVLDB, 6(14), 2013.

Digital Library

[43]

Yin Huai et al. Major technical advancements in Apache Hive. In SIGMOD, pages 1235--1246, 2014.

Digital Library

[44]

Z Liu et al. JSON data management: Supporting schema-less development in rdbms. In SIGMOD, pages 1247--1258, New York, NY, USA, 2014.

Digital Library

[45]

J. Zhou, N. Bruno, and W. Lin. Advanced partitioning techniques for massively distributed computation. In SIGMOD, pages 13--24, 2012.

Digital Library

[46]

J. Zhou and K. Ross. A multi-resolution block storage model for database design. In IDEAS, pages 22--31, July 2003.

Cited By

Hansert PMichel S(2024)Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion PipelinesProceedings of the VLDB Endowment10.14778/3681954.368201317:11(3456-3469)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682013
Lv YZhang KWang ZZhang XLee RHe ZJing YWang X(2024)RTScan: Efficient Scan with Ray Tracing CoresProceedings of the VLDB Endowment10.14778/3648160.364818317:6(1460-1472)Online publication date: 3-May-2024
https://doi.org/10.14778/3648160.3648183
Wei JZhang GChen JWang YZheng WSun TWu JJiang J(2024)Exploiting Data-pattern-aware Vertical Partitioning to Achieve Fast and Low-cost Cloud Log StorageACM Transactions on Storage10.1145/364364120:2(1-35)Online publication date: 19-Feb-2024
https://dl.acm.org/doi/10.1145/3643641
Show More Cited By

Recommendations

Fine-grained partitioning for aggressive data skipping
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Modern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining ...
Pando: Enhanced Data Skipping with Logical Data Partitioning

With enormous volumes of data, quickly retrieving data that is relevant to a query is essential for achieving high performance. Modern cloud-based database systems often partition the data into blocks and employ various techniques to skip irrelevant ...
A partitioning framework for aggressive data skipping

We propose to demonstrate a fine-grained partitioning framework that reorganizes the data tuples into small blocks at data loading time. The goal is to enable queries to maximally skip scanning data blocks. The partition framework consists of four steps:...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 10, Issue 4

November 2016

180 pages

ISSN:2150-8097

Editors:
Peter Boncz
CWI
,
Ken Salem
University of Waterloo

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 November 2016

Published in PVLDB Volume 10, Issue 4

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
298
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)5

Reflects downloads up to 07 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hansert PMichel S(2024)Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion PipelinesProceedings of the VLDB Endowment10.14778/3681954.368201317:11(3456-3469)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682013
Lv YZhang KWang ZZhang XLee RHe ZJing YWang X(2024)RTScan: Efficient Scan with Ray Tracing CoresProceedings of the VLDB Endowment10.14778/3648160.364818317:6(1460-1472)Online publication date: 3-May-2024
https://doi.org/10.14778/3648160.3648183
Wei JZhang GChen JWang YZheng WSun TWu JJiang J(2024)Exploiting Data-pattern-aware Vertical Partitioning to Achieve Fast and Low-cost Cloud Log StorageACM Transactions on Storage10.1145/364364120:2(1-35)Online publication date: 19-Feb-2024
https://dl.acm.org/doi/10.1145/3643641
Liu PLi CChen H(2024)Enhancing Storage Efficiency and Performance: A Survey of Data Partitioning TechniquesJournal of Computer Science and Technology10.1007/s11390-024-3538-139:2(346-368)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s11390-024-3538-1
Huynh AChaudhari HTerzi EAthanassoulis M(2024)Towards flexibility and robustness of LSM treesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00826-933:4(1105-1128)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-023-00826-9
Tong YLiu JWang HZhou KHe RZhang QWang C(2023)Sieve: A Learned Data-Skipping Index for Data AnalyticsProceedings of the VLDB Endowment10.14778/3611479.361152016:11(3214-3226)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611520
Sudhir STao WLaptev NHabis CCafarella MMadden S(2023)Pando: Enhanced Data Skipping with Logical Data PartitioningProceedings of the VLDB Endowment10.14778/3598581.359860116:9(2316-2329)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.14778/3598581.3598601
Sioulas PMytilinis IAilamaki A(2023)SH2O: Efficient Data Access for Work-Sharing DatabasesProceedings of the ACM on Management of Data10.1145/36173401:3(1-26)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617340
Wei JZhang GChen JWang YZheng WSun TWu JJiang JFedorova ANarayanan DDi Luna GQuerzoni L(2023)LogGrep: Fast and Cheap Cloud Log Storage by Exploiting both Static and Runtime PatternsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567484(452-468)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3567484
Xie XShi SWang HLi M(2023)SAT: sampling acceleration tree for adaptive database repartitionWorld Wide Web10.1007/s11280-023-01199-326:5(3503-3533)Online publication date: 3-Aug-2023
https://dl.acm.org/doi/10.1007/s11280-023-01199-3
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents