Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Anser: Adaptive Information Sharing Framework of AnalyticDB

Published: 01 August 2023 Publication History

Abstract

The surge in data analytics has fostered burgeoning demand for AnalyticDB on Alibaba Cloud, which has well served thousands of customers from various business sectors. The most notable feature is the diversity of the workloads it handles, including batch processing, real-time data analytics, and unstructured data analytics. To improve the overall performance for such diverse workloads, one of the major challenges is to optimize long-running complex queries without sacrificing the processing efficiency of short-running interactive queries. While existing methods attempt to utilize runtime dynamic statistics for adaptive query processing, they often focus on specific scenarios instead of providing a holistic solution.
To address this challenge, we propose a new framework called Anser, which enhances the design of traditional distributed data warehouses by embedding a new information sharing mechanism. This allows for the efficient management of the production and consumption of various dynamic information across the system. Building on top of Anser, we introduce a novel scheduling policy that optimizes both data and information exchanges within the physical plan, enabling the acceleration of complex analytical queries without sacrificing the performance of short-running interactive queries. We conduct comprehensive experiments over public and in-house workloads to demonstrate the effectiveness and efficiency of our proposed information sharing framework.

References

[1]
[n. d.]. Apache Hive. https://hive.apache.org/. Last accessed 2023-03-01.
[2]
[n. d.]. Apache Kafka. https://kafka.apache.org/. Last accessed 2023-03-01.
[3]
[n. d.]. Elastic Compute Service. https://www.alibabacloud.com/product/ecs.Last accessed 2023-03-01.
[4]
[n. d.]. HDFS Architecture Guide. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html. Last accessed 2023-03-01.
[5]
[n. d.]. Impala Runtime Filtering. https://impala.apache.org/docs/build/html/topics/impala_runtime_filtering.html. Last accessed 2023-03-01.
[6]
[n. d.]. Object Storage Service (OSS). https://www.alibabacloud.com/product/object-storage-service?spm=a3c0i.23458820.2359477120.2.26a77d3fagA3sE. Last accessed 2023-03-01.
[7]
[n. d.]. Parameter Sensitive Plan optimization. https://learn.microsoft.com/en-us/sql/relational-databases/performance/parameter-sensitivity-plan-optimization?view=sql-server-ver16. Last accessed 2023-03-01.
[8]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European conference on computer systems. 29--42.
[9]
Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1383--1394.
[10]
Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J Green, Monish Gupta, Sebastian Hillig, et al. 2022. Amazon Redshift re-invented. In Proceedings of the 2022 International Conference on Management of Data. 2205--2217.
[11]
Ron Avnur and Joseph M Hellerstein. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 261--272.
[12]
Shivnath Babu, Pedro Bizarro, and David DeWitt. 2005. Proactive re-optimization. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 107--118.
[13]
Srikanth Bellamkonda, Hua-Gang Li, Unmesh Jagtap, Yali Zhu, Vince Liang, and Thierry Cruanes. 2013. Adaptive and Big Data Scale Parallel Execution in Oracle. Proc. VLDB Endow. 6 (2013), 1102--1113.
[14]
Chuangxian Wei Xiaoqiang Peng Liang Lin Sheng Wang Zhe Chen Feifei Li Yue Pan Fang Zheng Chengliang ChaiChaoqunZhan, Maomeng Su. 2019. AnalyticDB: Realtime OLAP Database System at AlibabaCloud. In Proceedings of the VLDB Endowment, Vol. 12. 2059--2070.
[15]
Surajit Chaudhuri and Kyuseok Shim. 1994. Including group-by in query optimization. In VLDB, Vol. 94. 12--15.
[16]
Ming-Syan Chen, Hui-I Hsiao, and Philip S Yu. 1997. On applying hash filters to improving the execution of multi-join queries. The VLDB journal 6 (1997), 121--131.
[17]
Ming-Syan Chen, Hui-I Hsiao, and Philip S Yu. 1997. On applying hash filters to improving the execution of multi-join queries. The VLDB journal 6 (1997), 121--131.
[18]
Amol Deshpande. 2004. An initial study of overheads of eddies. ACM SIGMOD Record 33, 1 (2004), 44--49.
[19]
Amol Deshpande, Joseph M Hellerstein, et al. 2004. Lifting the burden of history from adaptive query processing. In VLDB. Citeseer, 948--959.
[20]
Amol Deshpande, Joseph M Hellerstein, and Vijayshankar Raman. 2006. Adaptive query processing: why, how, when, what next. (2006), 806--807.
[21]
Jialin Ding, Umar Farooq Minhas, Badrish Chandramouli, Chi Wang, Yinan Li, Ying Li, Donald Kossmann, Johannes Gehrke, and Tim Kraska. 2021. Instance-optimized data layouts for cloud analytics workloads. In Proceedings of the 2021 International Conference on Management of Data. 418--431.
[22]
David J. DeWitt Donovan A. Schneider. 1989. A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment. 1989 ACM SIGMOD international conference on Management of data) (1989), 110--121.
[23]
Mostafa Elhemali, César A Galindo-Legaria, Torsten Grabs, and Milind M Joshi. 2007. Execution strategies for SQL subqueries. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data. 993--1004.
[24]
Goetz Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19--29.
[25]
Goetz Graefe and Karen Ward. 1989. Dynamic query evaluation plans. In Proceedings of the 1989 ACM SIGMOD international conference on Management of data. 358--366.
[26]
Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon redshift and the case for simpler data warehouses. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1917--1923.
[27]
Ashish Gupta, Venky Harinarayan, and Dallan Quass. 1995. Aggregate-query processing in data warehousing environments. In VLDB, Vol. 95. Citeseer, 358--369.
[28]
Joseph M Hellerstein, Peter J Haas, and Helen J Wang. 2007. 2007 Test-of-time Award "Online Aggregation". (2007), 1.
[29]
Yannis E. Ioannidis, Raymond T. Ng, Kyuseok Shim, and Timos K. Sellis. 1997. Parametric query optimization. The VLDB Journal 6 (1997), 132--151.
[30]
Zachary G. Ives and Nicholas E. Taylor. 2008. Sideways Information Passing for Push-Style Query Processing. 2008 IEEE 24th International Conference on Data Engineering (2008), 774--783.
[31]
Matthias Jarke and Jürgen Hartmut Koch. 1984. Query Optimization in Database Systems. ACM Comput. Surv. 16 (1984), 111--152.
[32]
Navin Kabra and David J DeWitt. 1998. Efficient mid-query re-optimization of sub-optimal query execution plans. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 106--117.
[33]
P-A Larson. 2002. Data reduction by partial preaggregation. In Proceedings 18th International Conference on Data Engineering. IEEE, 706--715.
[34]
Allison W. Lee and Mohamed Zaït. 2008. Closing the query processing loop in Oracle 11g. Proc. VLDB Endow. 1 (2008), 1368--1378.
[35]
Kaiyu Li and Guoliang Li. 2018. Approximate Query Processing: What is New and Where to Go? Data Science and Engineering 3 (2018), 379--397.
[36]
Lothar F Mackert and Guy M Lohman. 1986. R* Optimizer Validation and Performance Evaluation. Very Large Data Bases: Proceedings 149 (1986), 149.
[37]
Abhishek Modi, Kaushik Rajan, Srinivas Thimmaiah, Prakhar Jain, Swinky Mann, Ayushi Agarwal, Ajith Shetty, Shahid K I, Ashit Gosalia, and Partho Sarthi. 2021. New query optimization techniques in the Spark engine of Azure synapse. Proceedings of the VLDB Endowment 15, 4 (2021), 936--948.
[38]
M. Oyamada. 2018. Accelerating Feature Engineering with Adaptive Partial Aggregation Tree. 2018 IEEE International Conference on Big Data (Big Data) (2018), 5417--5419.
[39]
Glenn Norman Paulley. 2001. Exploiting functional dependence in query optimization. University of Waterloo.
[40]
Vijayshankar Raman, Amol Deshpande, and Joseph M Hellerstein. 2003. Using state modules for adaptive query processing. In Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405). IEEE, 353--364.
[41]
Praveen Seshadri, Joseph M Hellerstein, Hamid Pirahesh, TY Cliff Leung, Raghu Ramakrishnan, Divesh Srivastava, Peter J Stuckey, and S Sudarshan. 1996. Cost-based optimization for magic: Algebra and implementation. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data. 435--446.
[42]
Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, et al. 2019. Presto: SQL on everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1802--1813.
[43]
Tarique Siddiqui, Alekh Jindal, Shi Qiao, Hiren Patel, and Wangchao Le. 2020. Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (2020).
[44]
Michael Stillger, Guy M Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEODB2's learning optimizer. In VLDB, Vol. 1. 19--28.
[45]
Michael Stonebraker. 1986. The case for shared nothing. Database Engineering Bulletin) (1986), 4--9.
[46]
Chuangxian Wei, Bin Wu, Sheng Wang, Renjie Lou, Chaoqun Zhan, Feifei Li, and Yuanzhe Cai. 2020. AnalyticDB-V: a hybrid analytical engine towards query fusion for structured and unstructured data. Proceedings of the VLDB Endowment 13, 12 (2020), 3152--3165.
[47]
Rongbiao Xie, Meng Li, Zheyu Miao, Rong Gu, He Huang, Haipeng Dai, and Guihai Chen. 2021. Hash Adaptive Bloom Filter. 2021 IEEE 37th International Conference on Data Engineering (ICDE) (2021), 636--647.
[48]
Yanjun Yao, Sisi Xiong, Hairong Qi, Yilu Liu, Leon M. Tolbert, and Qing Cao. 2015. Efficient Histogram Estimation for Smart Grid Data Processing With the Loglog-Bloom-Filter. IEEE Transactions on Smart Grid 6 (2015), 199--208.
[49]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 15--28.
[50]
Yong Zhao and Rong Chen. 2021. Spark SQL Query Optimization Based on Runtime Statistics Collection. 2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA) (2021), 250--255.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 16, Issue 12
August 2023
685 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2023
Published in PVLDB Volume 16, Issue 12

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 107
    Total Downloads
  • Downloads (Last 12 months)55
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media