Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems

Published: 18 May 2020 Publication History

Abstract

Data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source cross-platform system that copes with these new requirements. The optimizer allocates the subtasks of data analytic tasks to the most suitable platforms. Our main contributions are: (i) a mechanism based on graph transformations to explore alternative execution strategies; (ii) a novel graph-based approach to determine efficient data movement plans among subtasks and platforms; and (iii) an efficient plan enumeration algorithm, based on a novel enumeration algebra. We extensively evaluate our optimizer under diverse real tasks. We show that our optimizer can perform tasks more than one order of magnitude faster when using multiple platforms than when using a single platform.

References

[1]
Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 265–283 (2016)
[2]
Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, and Silberschatz A HadoopDB: an architectural hybrid of mapreduce and dbms technologies for analytical workloads PVLDB 2009 2 1 922-933
[3]
Agrawal, D., Ba, L., Berti-Equille, L., Chawla, S., Elmagarmid, A., Hammady, H., Idris, Y., Kaoudi, Z., Khayyat, Z., Kruse, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.A., Tang, N., Zaki, M.J.: Rheem: enabling multi-platform task execution. In: SIGMOD, pp. 2069–2072 (2016)
[4]
Agrawal D, Chawla S, Contreras-Rojas B, Elmagarmid AK, Idris Y, Kaoudi Z, Kruse S, Lucas J, Mansour E, Ouzzani M, Papotti P, Quiané-Ruiz J, Tang N, Thirumuruganathan S, and Troudi A RHEEM: enabling cross-platform data processing: may the big data be with you! PVLDB 2018 11 11 1414-1427
[5]
Agrawal, D., Chawla, S., Elmagarmid, A., Kaoudi, Z., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.A., Tang, N., Zaki, M.J.: Road to freedom in big data analytics. In: EDBT, pp. 479–484 (2016)
[6]
Alexandrov A et al. The stratosphere platform for big data analytics VLDB J. 2014 23 6 939-964
[7]
Apache Beam. (2019). https://beam.apache.org. Accessed 2 May 2019
[8]
Apache Drill (2019). https://drill.apache.org. Accessed 2 May 2019
[9]
Apache Spark: Lightning-fast cluster computing (2019). http://spark.apache.org. Accessed 2 May 2019
[10]
Baaziz A and Quoniam L How to use big data technologies to optimize operations in upstream petroleum industry Int. J. Innov. (IJI) 2013 1 1 19-25
[11]
Babu, S., Bizarro, P.: Adaptive query processing in the looking glass. In: CIDR (2005)
[12]
Babu, S., Bizarro, P., DeWitt, D.J.: Proactive re-optimization with Rio. In: SIGMOD, pp. 936–938 (2005)
[13]
Begoli, E., Camacho-Rodríguez, J., Hyde, J., Mior, M.J., Lemire, D.: Apache calcite: a foundational framework for optimized query processing over heterogeneous data sources. In: SIGMOD, pp. 221–230 (2018)
[14]
Boehm M, Burdick DR, Evfimievski AV, Reinwald B, Reiss FR, Sen P, Tatikonda S, and Tian Y SystemML’s optimizer: plan generation for large-scale machine learning programs IEEE Data Eng. Bull. 2014 37 3 52-62
[15]
Boehm M, Dusenberry M, Eriksson D, Evfimievski AV, Manshadi FM, Pansare N, Reinwald B, Reiss F, Sen P, Surve A, and Tatikonda S SystemML: declarative machine learning on spark PVLDB 2016 9 13 1425-1436
[16]
Bukhres OA, Chen J, Du W, Elmagarmid AK, and Pezzoli R Interbase: an execution environment for heterogeneous software systems IEEE Comput. 1993 26 8 57-69
[17]
Carey, M.J., Haas, L.M., Schwarz, P.M., Arya, M., Cody, W.F., Fagin, R., Flickner, M., Luniewski, A., Niblack, W., Petkovic, D., Thomas, J., Williams, J.H., Wimmers, E.L.: Towards heterogeneous multimedia information systems: the garlic approach. In: Proceedings of the International Workshop on Research Issues in Data Engineering—Distributed Object Management (RIDE-DOM), pp. 124–131 (1995)
[18]
Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: FlumeJava: easy, efficient data-parallel pipelines. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 363–375 (2010)
[19]
Chawathe, S.S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinou, Y., Ullman, J.D., Widom, J.: The TSIMMIS project: integration of heterogeneous information sources. In: Information Processing Society of Japan (IPSJ), pp. 7–18 (1994)
[20]
Chekuri C, Even G, and Kortsarz G A greedy approximation algorithm for the group Steiner problem Discret. Appl. Math. 2000 154 1 15-34
[21]
Contreras-Rojas, B., Quiané-Ruiz, J., Kaoudi, Z., Thirumuruganathan, S.: TagSniff: simplified big data debugging for dataflow jobs. In: SoCC, pp. 453–464 (2019)
[23]
Dean J and Ghemawat S MapReduce: simplified data processing on large clusters Commun. ACM 2008 51 1 107-113
[24]
DeWitt, D.J., Halverson, A., Nehme, R.V., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: SIGMOD, pp. 1255–1266 (2013)
[25]
Doka, K., Papailiou, N., Giannakouris, V., Tsoumakos, D., Koziris, N.: Mix ’n’ match multi-engine analytics. In: IEEE BigData, pp. 194–203 (2016)
[26]
Duggan J, Elmore AJ, Stonebraker M, Balazinska M, Howe B, Kepner J, Madden S, Maier D, Mattson T, and Zdonik SB The BigDAWG polystore system SIGMOD Record 2015 44 2 11-16
[27]
Elmore A, Duggan J, Stonebraker M, Balazinska M, Cetintemel U, Gadepally V, Heer J, Howe B, Kepner J, Kraska T, et al. A demonstration of the BigDAWG polystore system PVLDB 2015 8 12 1908-1911
[28]
Ewen, S., Kache, H., Markl, V., Raman, V.: Progressive query optimization for federated queries. In: EDBT, pp. 847–864 (2006)
[29]
Garg N, Konjevod G, and Ravi R A polylogarithmic approximation algorithm for the group Steiner tree problem J. Algorithms 2000 37 1 66-84
[30]
Gog, I., Schwarzkopf, M., Crooks, N., Grosvenor, M.P., Clement, A., Hand, S.: Musketeer: all for one, one for all in data processing systems. In: EuroSys, pp. 1–16 (2015)
[31]
Haynes, B., Cheung, A., Balazinska, M.: PipeGen: data pipe generator for hybrid analytics. In: SoCC, pp. 470–483 (2016)
[32]
Hems, A., Soofi, A., Perez, E.: How innovative oil and gas companies are using big data to outmaneuver the competition. Microsoft white paper (2014). http://download.microsoft.com/documents/en-us/Drilling_for_New_Business_Value_April2014_Web.pdf. Accessed 2 May 2019
[33]
Hueske F, Peters M, Sax MJ, Rheinländer A, Bergmann R, Krettek A, and Tzoumas K Opening the black boxes in data flow optimization PVLDB 2012 5 11 1256-1267
[34]
Data-driven healthcare organizations use big data analytics for big gains. IBM Software white paper (2019)
[35]
Ioannidis YE Query optimization ACM Comput. Surv. 1996 28 1 121-123
[36]
Jindal, A., Quiané-Ruiz, J., Dittrich, J.: WWHow! Freeing data storage from cages. In: CIDR (2013)
[37]
Josifovski, V., Schwarz, P.M., Haas, L.M., Lin, E.T.: Garlic: a new flavor of federated query processing for DB2. In: SIGMOD, pp. 524–532 (2002)
[38]
Jovanovic, P., Simitsis, A., Wilkinson, K.: Engine independence for logical analytic flows. In: ICDE, pp. 1060–1071 (2014)
[39]
Kaoudi, Z., Quiané-Ruiz, J., Contreras-Rojas, B., Padro-Meza, R., Troudi, A., Chawla, S.: ML-based cross-platform query optimization. In: ICDE (2020)
[40]
Kaoudi, Z., Quiané-Ruiz, J.A.: Cross-platform data processing: use cases and challenges. In: ICDE (tutorial) (2018)
[41]
Kaoudi, Z., Quiane-Ruiz, J.A., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD (2017)
[42]
Kossmann D and Stocker K Iterative dynamic programming: a new class of query optimization algorithms TODS 2000 25 1 43-82
[43]
Kruse, S., Kaoudi, Z., Quiané-Ruiz, J.A., Chawla, S., Naumann, F., Contreras-Rojas, B.: Optimizing cross-platform data movement. In: ICDE, pp. 1642–1645 (2019)
[44]
LeFevre, J., Sankaranarayanan, J., Hacigümüs, H., Tatemura, J., Polyzotis, N., Carey, M.J.: MISO: Souping up big data query processing with a multistore system. In: SIGMOD, pp. 1591–1602 (2014)
[45]
Leis V et al. How good are query optimizers, really? PVLDB 2015 9 3 204-215
[46]
Lim, H., Han, Y., Babu, S.: How to fit when no one size fits. In: CIDR (2013)
[47]
Lucas, J., Idris, Y., Contreras-Rojas, B., Quiané-Ruiz, J., Chawla, S.: RheemStudio: Cross-platform data analytics made easy. In: ICDE, pp. 1573–1576 (2018)
[48]
Luigi project (2019). https://github.com/spotify/luigi. Accessed 2 May 2019
[49]
Markl, V., Raman, V., Simmen, D., Lohman, G., Pirahesh, H., Cilimdzic, M.: Robust query processing through progressive optimization. In: SIGMOD, pp. 659–670 (2004)
[50]
Mitchell M An Introduction to Genetic Algorithms 1998 Cambridge MIT Press
[51]
Noyes, K.: For the airline industry, big data is cleared for take-off. http://fortune.com/2014/06/19/big-data-airline-industry. Accessed 2 May 2019
[52]
Palkar, S., Thomas, J.J., Shanbhag, A., Schwarzkopt, M., Amarasinghe, S.P., Zaharia, M.: A common runtime for high performance data analysis. In: CIDR (2017)
[53]
PostgreSQL (2019). http://www.postgresql.org. Accessed 2 May 2019
[54]
Reich, G., Widmayer, P.: Beyond Steiner’s problem: a VLSI oriented generalization. In: Proceedings of the International Workshop on Graph-Theoretic Concepts in Computer Science (WG), pp. 196–210 (1989)
[55]
Rheem project (2020). https://rheem-ecosystem.github.io/. Accessed 10 May 2020
[56]
Rheinländer A, Heise A, Hueske F, Leser U, and Naumann F SOFA: an extensible logical optimizer for UDF-heavy data flows Inf. Syst. 2015 52 96-125
[57]
Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S.: Efficient and extensible algorithms for multi query optimization. In: SIGMOD, pp. 249–260 (2000)
[58]
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23–34 (1979)
[59]
Shankar, S., Choi, A., Dijcks, J.P.: Integrating hadoop data with oracle parallel processing. Oracle white paper (2010). http://www.oracle.com/technetwork/database/bi-datawarehousing/twp-integrating-hadoop-data-with-or-130063.pdf. Accessed 2 May 2019
[60]
Sheth AP and Larson JA Federated database systems for managing distributed, heterogeneous, and autonomous databases ACM Comput. Surv. 1990 22 3 183-236
[61]
Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: SIGMOD, pp. 829–840 (2012)
[62]
Stonebraker, M.: The case for polystores. ACM SIGMOD Blog. http://wp.sigmod.org/?p=1629. Accessed 2 May 2019
[63]
Wang, J., et al.: The Myria big data management and analytics system and cloud services. In: CIDR (2017)
[64]
Yang F, Li J, and Cheng J Husky: Towards a more efficient and expressive distributed computing framework PVLDB 2016 9 5 420-431

Cited By

View all
  • (2024)Generating Cross-model Analytics Workloads Using LLMsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679932(4303-4307)Online publication date: 21-Oct-2024
  • (2024)CLIC: An Extensible and Efficient Cross-Platform Data Analytics SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329803835:1(34-45)Online publication date: 1-Jan-2024
  • (2023)Apache Wayang: A Unified Data Analytics FrameworkACM SIGMOD Record10.1145/3631504.363151052:3(30-35)Online publication date: 2-Nov-2023
  • Show More Cited By

Index Terms

  1. RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image The VLDB Journal — The International Journal on Very Large Data Bases
      The VLDB Journal — The International Journal on Very Large Data Bases  Volume 29, Issue 6
      Nov 2020
      324 pages

      Publisher

      Springer-Verlag

      Berlin, Heidelberg

      Publication History

      Published: 18 May 2020
      Accepted: 12 April 2020
      Revision received: 21 January 2020
      Received: 07 May 2019

      Author Tags

      1. Cross-platform
      2. Polystore
      3. Query optimization
      4. Data processing

      Qualifiers

      • Research-article

      Funding Sources

      • Technische Universität Berlin (3136)

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Generating Cross-model Analytics Workloads Using LLMsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679932(4303-4307)Online publication date: 21-Oct-2024
      • (2024)CLIC: An Extensible and Efficient Cross-Platform Data Analytics SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329803835:1(34-45)Online publication date: 1-Jan-2024
      • (2023)Apache Wayang: A Unified Data Analytics FrameworkACM SIGMOD Record10.1145/3631504.363151052:3(30-35)Online publication date: 2-Nov-2023
      • (2022)Unified data analyticsProceedings of the VLDB Endowment10.14778/3554821.355489815:12(3778-3781)Online publication date: 1-Aug-2022
      • (2022)Polyglot data managementProceedings of the VLDB Endowment10.14778/3554821.355489115:12(3750-3753)Online publication date: 1-Aug-2022
      • (2021)Expand your Training Limits! Generating Training Data for ML-based Data ManagementProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457286(1865-1878)Online publication date: 9-Jun-2021
      • (2021)Parallel query processing in a polystoreDistributed and Parallel Databases10.1007/s10619-021-07322-539:4(939-977)Online publication date: 3-Feb-2021

      View Options

      View options

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media