research-article

Open access

Instance-Optimized Data Layouts for Cloud Analytics Workloads

Authors:

Jialin Ding,

Umar Farooq Minhas,

Badrish Chandramouli,

Ying Li,

Tim KraskaAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 418 - 431

https://doi.org/10.1145/3448016.3457270

Published: 18 June 2021 Publication History

PDF eReader

Abstract

In this paper, we propose MTO, an instance-optimized data layout framework that determines the blocking strategy for all tables in a multi-table database in the presence of joins, such as in a star or snowflake schema common in real-world workloads. MTO takes advantage of sideways information passing through joins to jointly optimize the layout for all tables, which results in better block skipping and hence reduced query execution times. Experiments on a commercial cloud-based analytics service show that MTO achieves up to 93% reduction in blocks accessed and 75% reduction in end-to-end query times compared to alternative blocking strategies.

Supplementary Material

MP4 File (3448016.3457270.mp4)

Today, businesses rely on efficiently running analytics on large amounts of operational and historical data to gain business insights and competitive advantage. Increasingly, such analytics are run using cloud-based data analytics services, such as Google BigQuery, Microsoft Azure Synapse, Amazon Redshift, and Snowflake. These services persist and process data in compressed, columnar formats, stored in large blocks, each of which contains thousands or millions of records. For these services, disk I/O from (remote) cloud storage is often one of the dominant costs for query processing. To reduce the amount of I/O, services often maintain per-block metadata, such as zone maps, which are used to skip blocks that are irrelevant to the query, leading to lower query execution times. However, the effectiveness of block skipping via zone maps is dependent on how the records are assigned to blocks. Recent work on instance-optimized data layouts aims to maximize block skipping by specializing the block assignment strategy to a specific dataset and workload. However, these existing approaches only optimize the layout for a single table.In this paper, we propose MTO, an instance-optimized data layout framework that determines the blocking strategy for all tables in a multi-table database in the presence of joins, such as in a star or snowflake schema common in real-world workloads. We show that MTO can achieve better block skipping and hence reduced query execution times by taking advantage of sideways information passing through joins to jointly optimize the layout for all tables. Experiments on a commercial cloud data warehouse show that MTO achieves up to 93% reduction in blocks accessed and 75% reduction in end-to-end query times compared to alternative blocking strategies.

Download
41.30 MB

References

[1]

Sanjay Agrawal, Surajit Chaudhuri, Lubor Kollar, Arun Marathe, Vivek Narasayya, and Manoj Syamala. 2005. Database Tuning Advisor for Microsoft SQL Server 2005: Demo. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data(Baltimore, Maryland) (SIGMOD '05). Association for Computing Machinery, New York, NY, USA, 930--932. https://doi.org/10.1145/1066157.1066292

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Qd-tree: Learning Data Layouts for Big Data Analytics

Challenges and Benefits of Deploying Big Data Analytics in the Cloud for Business Intelligence

Big Data Analytics

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations