Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3267809.3267814acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

RIOS: Runtime Integrated Optimizer for Spark

Published: 11 October 2018 Publication History

Abstract

Many Data-Intensive Scalable Computing (DISC) systems do not support sophisticated cost-based query optimizers because they lack the necessary data statistics. Consequently many crucial optimizations, such as join order and plan selection, are not well supported in DISC systems. RIOS is a Runtime Integrated Optimizer for Spark that lazily binds to execution plans at runtime, after collecting the statistics needed to make more optimal decisions. We evaluate the efficacy of our approach and show that better plans can be derived at runtime, achieving more than an order-of-magnitude performance improvement compared to compile time generated plans produced by the Apache Spark rule-base optimizer.

References

[1]
Apache hadoop. http://hadoop.apache.org, 2017.
[2]
Apache orc. https://orc.apache.org/, 2017.
[3]
Apache parquet. https://parquet.apache.org/, 2017.
[4]
Tpc-ds benchmark. http://www.tpc.org/tpcds/, 2017.
[5]
Tpc-h benchmark. http://www.tpc.org/tpch/, 2017.
[6]
S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing data-parallel computing. In NSDI, 2012.
[7]
N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci., 64(3):719--747, 2002.
[8]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29, 1996.
[9]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. In SIGMOD, 2015.
[10]
M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. Gray, P. P. Griffiths, W. F. King, and et al. System r: Relational approach to database management. TODS, 1(2):97--137, June 1976.
[11]
R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD, pages 261--272, 2000.
[12]
S. Babu, P. Bizarro, and D. DeWitt. Proactive re-optimization. In SIGMOD, SIGMOD '05, 2005.
[13]
P. A. Bernstein and D.-M. W. Chiu. Using semi-joins to solve relational queries. JACM, 28(1):25--40, Jan. 1981.
[14]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. CACM, 13(7):422--426, Jul 1970.
[15]
V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, 2011.
[16]
A. Broder. On the resemblance and containment of documents. In SEQUENCES, 1997.
[17]
N. Bruno, S. Jain, and J. Zhou. Continuous cloud-scale query optimization and processing. PVLDB, 6(11):961--972, 2013.
[18]
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink™: Stream and batch processing in a single engine. IEEE Data Eng. Bull., 38(4):28--38, 2015.
[19]
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, Aug 2008.
[20]
D. D. Chamberlin, M. M. Astrahan, M. W. Blasgen, J. N. Gray, W. F. King, B. G. Lindsay, R. Lorie, J. W. Mehl, T. G. Price, F. Putzolu, P. G. Selinger, M. Schkolnick, D. R. Slutz, I. L. Traiger, B. W. Wade, and R. A. Yost. A history and evaluation of system r. CACM, 24(10), Oct. 1981.
[21]
S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 32(2), jun 2007.
[22]
G. Cormode and M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In VLDB, pages 13--24. VLDB Endowment, 2005.
[23]
G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends databases, 4(1):1--294, jan 2012.
[24]
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. J. Algorithms, 55(1):58--75, apr 2005.
[25]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. CACM, 51(1), Jan 2008.
[26]
A. Deshpande, Z. Ives, and V. Raman. Adaptive query processing. Found. Trends databases, 1(1), jan 2007.
[27]
P. Flajolet, ÃL'ric Fusy, O. Gandouet, and et al. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In AOFA, 2007.
[28]
G. Graefe. The cascades framework for query optimization. IEEE Data Eng. Bull., 18(3):19--29, 1995.
[29]
G. Graefe and D. J. DeWitt. The exodus optimizer generator. In SIGMOD, pages 160--172, 1987.
[30]
G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In ICDE, pages 209--218, 1993.
[31]
L. M. Haas, J. C. Freytag, G. M. Lohman, and H. Pirahesh. Extensible query processing in starburst. In SIGMOD, pages 377--388.
[32]
D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, S. Xu, M. Balazinska, B. Howe, and D. Suciu. Demonstration of the myria big data management service. In SIGMOD, pages 881--884, 2014.
[33]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev., 41(3):59--72, 2007.
[34]
N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In SIGMOD, pages 106--117, 1998.
[35]
S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, and B. Ding. Quickr: Lazily approximating complex adhoc queries in bigdata clusters. In SIGMOD, pages 631--646. ACM, 2016.
[36]
K. Karanasos, A. Balmin, M. Kutsch, F. Ozcan, V. Ercegovac, C. Xia, and J. Jackson. Dynamically optimizing queries over large scale data platforms. In SIGMOD, pages 943--954, New York, NY, USA, 2014. ACM.
[37]
Q. Ke, M. Isard, and Y. Yu. Optimus: A dynamic rewriting framework for data-parallel execution plans. In EuroSys, pages 15--28, 2013.
[38]
S. J. Kim, M. Al-Kateb, P. Sinclair, A. Crolotte, C. Zhang, and L. Rose. Dynamic statistics collection in the teradata unified data architecture. In ICDE, pages 255--258, 2017.
[39]
G. M. Lohman. Is query optimization a âĂIJsolvedâĂİ problem. In Proc. Workshop on Database Query Optimization, page 13. Oregon Graduate Center Comp. Sci. Tech. Rep, 2014.
[40]
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In ICDT, pages 398--412. Springer-Verlag, 2005.
[41]
J. K. Mullin. Optimal semijoins for distributed database systems. IEEE Trans. Softw. Eng., 16(5):558--560, may 1990.
[42]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110. ACM, 2008.
[43]
O. Papapetrou, W. Siberski, and W. Nejdl. Cardinality estimation and dynamic length adaptation for bloom filters. Distributed and Parallel Databases, 28(2):119---156, 2010.
[44]
R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., 3 edition, 2003.
[45]
W. Rödiger, S. Idicula, A. Kemper, and T. Neumann. Flow-join: Adaptive skew handling for distributed joins over high-speed networks. In ICDE, pages 1194--1205, 2016.
[46]
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. SIGMOD '79, pages 23--34, New York, NY, USA, 1979. ACM.
[47]
M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. Leo - db2's learning optimizer. In VLDB, pages 19--28, 2001.
[48]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A warehousing solution over a map-reduce framework. VLDB, 2(2):1626--1629, Aug 2009.
[49]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.
[50]
J. Zhou, N. Bruno, M.-C. Wu, P.-A. Larson, R. Chaiken, and D. Shakib. Scope: parallel databases meet mapreduce. The VLDB Journal, 21(5):611 --636, 2012.

Cited By

View all
  • (2024)A Learned Cost Model for Big Data Query ProcessingInformation Sciences10.1016/j.ins.2024.120650(120650)Online publication date: Apr-2024
  • (2023)RelJoin: Relative-cost-based Selection of Distributed Join Methods for Query Plan OptimizationInformation Sciences10.1016/j.ins.2023.120022(120022)Online publication date: Dec-2023
  • (2023)SPOAHA: Spark Program Optimizer Based on Artificial Hummingbird AlgorithmKnowledge Science, Engineering and Management10.1007/978-3-031-40289-0_26(317-331)Online publication date: 9-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
October 2018
546 pages
ISBN:9781450360111
DOI:10.1145/3267809
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Adaptive Query Optimization
  2. Lazy Planning
  3. Runtime Statistics

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SoCC '18
Sponsor:
SoCC '18: ACM Symposium on Cloud Computing
October 11 - 13, 2018
CA, Carlsbad, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)2
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Learned Cost Model for Big Data Query ProcessingInformation Sciences10.1016/j.ins.2024.120650(120650)Online publication date: Apr-2024
  • (2023)RelJoin: Relative-cost-based Selection of Distributed Join Methods for Query Plan OptimizationInformation Sciences10.1016/j.ins.2023.120022(120022)Online publication date: Dec-2023
  • (2023)SPOAHA: Spark Program Optimizer Based on Artificial Hummingbird AlgorithmKnowledge Science, Engineering and Management10.1007/978-3-031-40289-0_26(317-331)Online publication date: 9-Aug-2023
  • (2022)A Resource-Aware Deep Cost Model for Big Data Query Processing2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00071(885-897)Online publication date: May-2022
  • (2022)PROADAPT: Proactive framework for adaptive partitioning for big data warehousesData & Knowledge Engineering10.1016/j.datak.2022.102102142(102102)Online publication date: Nov-2022
  • (2021)SEIZE: Runtime Inspection for Parallel Dataflow SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303517032:4(842-854)Online publication date: 1-Apr-2021
  • (2021)KDDLog:Performance and Scalability in Knowledge Discovery by Declarative Queries with Aggregates2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00113(1260-1271)Online publication date: Apr-2021
  • (2021)Towards an Adaptive Multidimensional Partitioning for Accelerating Spark SQLBig Data Analytics and Knowledge Discovery10.1007/978-3-030-86534-4_3(27-38)Online publication date: 5-Sep-2021
  • (2020)Dynamic speculative optimizations for SQL compilation in Apache SparkProceedings of the VLDB Endowment10.14778/3377369.337738213:5(754-767)Online publication date: 19-Feb-2020
  • (2020)SEIZE User Desired Moments: Runtime Inspection for Parallel Dataflow Systems2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS47774.2020.00147(1199-1200)Online publication date: Nov-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media