research-article

RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems

Authors:

Sebastian Kruse,

Bertty Contreras-Rojas,

Jorge-Arnulfo Quiané-RuizAuthors Info & Claims

The VLDB Journal, Volume 29, Issue 6

Pages 1287 - 1310

https://doi.org/10.1007/s00778-020-00612-x

Published: 18 May 2020 Publication History

Abstract

Data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source cross-platform system that copes with these new requirements. The optimizer allocates the subtasks of data analytic tasks to the most suitable platforms. Our main contributions are: (i) a mechanism based on graph transformations to explore alternative execution strategies; (ii) a novel graph-based approach to determine efficient data movement plans among subtasks and platforms; and (iii) an efficient plan enumeration algorithm, based on a novel enumeration algebra. We extensively evaluate our optimizer under diverse real tasks. We show that our optimizer can perform tasks more than one order of magnitude faster when using multiple platforms than when using a single platform.

References

[1]

Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 265–283 (2016)

[2]

Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, and Silberschatz A HadoopDB: an architectural hybrid of mapreduce and dbms technologies for analytical workloads PVLDB 2009 2 1 922-933

[3]

Agrawal, D., Ba, L., Berti-Equille, L., Chawla, S., Elmagarmid, A., Hammady, H., Idris, Y., Kaoudi, Z., Khayyat, Z., Kruse, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.A., Tang, N., Zaki, M.J.: Rheem: enabling multi-platform task execution. In: SIGMOD, pp. 2069–2072 (2016)

[4]

Agrawal D, Chawla S, Contreras-Rojas B, Elmagarmid AK, Idris Y, Kaoudi Z, Kruse S, Lucas J, Mansour E, Ouzzani M, Papotti P, Quiané-Ruiz J, Tang N, Thirumuruganathan S, and Troudi A RHEEM: enabling cross-platform data processing: may the big data be with you! PVLDB 2018 11 11 1414-1427

[5]

Agrawal, D., Chawla, S., Elmagarmid, A., Kaoudi, Z., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.A., Tang, N., Zaki, M.J.: Road to freedom in big data analytics. In: EDBT, pp. 479–484 (2016)

[6]

Alexandrov A et al. The stratosphere platform for big data analytics VLDB J. 2014 23 6 939-964

[7]

Apache Beam. (2019). https://beam.apache.org. Accessed 2 May 2019

[8]

Apache Drill (2019). https://drill.apache.org. Accessed 2 May 2019

[9]

Apache Spark: Lightning-fast cluster computing (2019). http://spark.apache.org. Accessed 2 May 2019

[10]

Baaziz A and Quoniam L How to use big data technologies to optimize operations in upstream petroleum industry Int. J. Innov. (IJI) 2013 1 1 19-25

[11]

Babu, S., Bizarro, P.: Adaptive query processing in the looking glass. In: CIDR (2005)

[12]

Babu, S., Bizarro, P., DeWitt, D.J.: Proactive re-optimization with Rio. In: SIGMOD, pp. 936–938 (2005)

[13]

Begoli, E., Camacho-Rodríguez, J., Hyde, J., Mior, M.J., Lemire, D.: Apache calcite: a foundational framework for optimized query processing over heterogeneous data sources. In: SIGMOD, pp. 221–230 (2018)

[14]

Boehm M, Burdick DR, Evfimievski AV, Reinwald B, Reiss FR, Sen P, Tatikonda S, and Tian Y SystemML’s optimizer: plan generation for large-scale machine learning programs IEEE Data Eng. Bull. 2014 37 3 52-62

[15]

Boehm M, Dusenberry M, Eriksson D, Evfimievski AV, Manshadi FM, Pansare N, Reinwald B, Reiss F, Sen P, Surve A, and Tatikonda S SystemML: declarative machine learning on spark PVLDB 2016 9 13 1425-1436

[16]

Bukhres OA, Chen J, Du W, Elmagarmid AK, and Pezzoli R Interbase: an execution environment for heterogeneous software systems IEEE Comput. 1993 26 8 57-69

[17]

Carey, M.J., Haas, L.M., Schwarz, P.M., Arya, M., Cody, W.F., Fagin, R., Flickner, M., Luniewski, A., Niblack, W., Petkovic, D., Thomas, J., Williams, J.H., Wimmers, E.L.: Towards heterogeneous multimedia information systems: the garlic approach. In: Proceedings of the International Workshop on Research Issues in Data Engineering—Distributed Object Management (RIDE-DOM), pp. 124–131 (1995)

[18]

Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: FlumeJava: easy, efficient data-parallel pipelines. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 363–375 (2010)

[19]

Chawathe, S.S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinou, Y., Ullman, J.D., Widom, J.: The TSIMMIS project: integration of heterogeneous information sources. In: Information Processing Society of Japan (IPSJ), pp. 7–18 (1994)

[20]

Chekuri C, Even G, and Kortsarz G A greedy approximation algorithm for the group Steiner problem Discret. Appl. Math. 2000 154 1 15-34

[21]

Contreras-Rojas, B., Quiané-Ruiz, J., Kaoudi, Z., Thirumuruganathan, S.: TagSniff: simplified big data debugging for dataflow jobs. In: SoCC, pp. 453–464 (2019)

[22]

DB2 hybrid data management. https://www.ibm.com/analytics/data-management (2019)

[23]

Dean J and Ghemawat S MapReduce: simplified data processing on large clusters Commun. ACM 2008 51 1 107-113

[24]

DeWitt, D.J., Halverson, A., Nehme, R.V., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: SIGMOD, pp. 1255–1266 (2013)

[25]

Doka, K., Papailiou, N., Giannakouris, V., Tsoumakos, D., Koziris, N.: Mix ’n’ match multi-engine analytics. In: IEEE BigData, pp. 194–203 (2016)

[26]

Duggan J, Elmore AJ, Stonebraker M, Balazinska M, Howe B, Kepner J, Madden S, Maier D, Mattson T, and Zdonik SB The BigDAWG polystore system SIGMOD Record 2015 44 2 11-16

[27]

Elmore A, Duggan J, Stonebraker M, Balazinska M, Cetintemel U, Gadepally V, Heer J, Howe B, Kepner J, Kraska T, et al. A demonstration of the BigDAWG polystore system PVLDB 2015 8 12 1908-1911

[28]

Ewen, S., Kache, H., Markl, V., Raman, V.: Progressive query optimization for federated queries. In: EDBT, pp. 847–864 (2006)

[29]

Garg N, Konjevod G, and Ravi R A polylogarithmic approximation algorithm for the group Steiner tree problem J. Algorithms 2000 37 1 66-84

[30]

Gog, I., Schwarzkopf, M., Crooks, N., Grosvenor, M.P., Clement, A., Hand, S.: Musketeer: all for one, one for all in data processing systems. In: EuroSys, pp. 1–16 (2015)

[31]

Haynes, B., Cheung, A., Balazinska, M.: PipeGen: data pipe generator for hybrid analytics. In: SoCC, pp. 470–483 (2016)

[32]

Hems, A., Soofi, A., Perez, E.: How innovative oil and gas companies are using big data to outmaneuver the competition. Microsoft white paper (2014). http://download.microsoft.com/documents/en-us/Drilling_for_New_Business_Value_April2014_Web.pdf. Accessed 2 May 2019

[33]

Hueske F, Peters M, Sax MJ, Rheinländer A, Bergmann R, Krettek A, and Tzoumas K Opening the black boxes in data flow optimization PVLDB 2012 5 11 1256-1267

[34]

Data-driven healthcare organizations use big data analytics for big gains. IBM Software white paper (2019)

[35]

Ioannidis YE Query optimization ACM Comput. Surv. 1996 28 1 121-123

[36]

Jindal, A., Quiané-Ruiz, J., Dittrich, J.: WWHow! Freeing data storage from cages. In: CIDR (2013)

[37]

Josifovski, V., Schwarz, P.M., Haas, L.M., Lin, E.T.: Garlic: a new flavor of federated query processing for DB2. In: SIGMOD, pp. 524–532 (2002)

[38]

Jovanovic, P., Simitsis, A., Wilkinson, K.: Engine independence for logical analytic flows. In: ICDE, pp. 1060–1071 (2014)

[39]

Kaoudi, Z., Quiané-Ruiz, J., Contreras-Rojas, B., Padro-Meza, R., Troudi, A., Chawla, S.: ML-based cross-platform query optimization. In: ICDE (2020)

[40]

Kaoudi, Z., Quiané-Ruiz, J.A.: Cross-platform data processing: use cases and challenges. In: ICDE (tutorial) (2018)

[41]

Kaoudi, Z., Quiane-Ruiz, J.A., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD (2017)

[42]

Kossmann D and Stocker K Iterative dynamic programming: a new class of query optimization algorithms TODS 2000 25 1 43-82

[43]

Kruse, S., Kaoudi, Z., Quiané-Ruiz, J.A., Chawla, S., Naumann, F., Contreras-Rojas, B.: Optimizing cross-platform data movement. In: ICDE, pp. 1642–1645 (2019)

[44]

LeFevre, J., Sankaranarayanan, J., Hacigümüs, H., Tatemura, J., Polyzotis, N., Carey, M.J.: MISO: Souping up big data query processing with a multistore system. In: SIGMOD, pp. 1591–1602 (2014)

[45]

Leis V et al. How good are query optimizers, really? PVLDB 2015 9 3 204-215

[46]

Lim, H., Han, Y., Babu, S.: How to fit when no one size fits. In: CIDR (2013)

[47]

Lucas, J., Idris, Y., Contreras-Rojas, B., Quiané-Ruiz, J., Chawla, S.: RheemStudio: Cross-platform data analytics made easy. In: ICDE, pp. 1573–1576 (2018)

[48]

Luigi project (2019). https://github.com/spotify/luigi. Accessed 2 May 2019

[49]

Markl, V., Raman, V., Simmen, D., Lohman, G., Pirahesh, H., Cilimdzic, M.: Robust query processing through progressive optimization. In: SIGMOD, pp. 659–670 (2004)

[50]

Mitchell M An Introduction to Genetic Algorithms 1998 Cambridge MIT Press

[51]

Noyes, K.: For the airline industry, big data is cleared for take-off. http://fortune.com/2014/06/19/big-data-airline-industry. Accessed 2 May 2019

[52]

Palkar, S., Thomas, J.J., Shanbhag, A., Schwarzkopt, M., Amarasinghe, S.P., Zaharia, M.: A common runtime for high performance data analysis. In: CIDR (2017)

[53]

PostgreSQL (2019). http://www.postgresql.org. Accessed 2 May 2019

[54]

Reich, G., Widmayer, P.: Beyond Steiner’s problem: a VLSI oriented generalization. In: Proceedings of the International Workshop on Graph-Theoretic Concepts in Computer Science (WG), pp. 196–210 (1989)

[55]

Rheem project (2020). https://rheem-ecosystem.github.io/. Accessed 10 May 2020

[56]

Rheinländer A, Heise A, Hueske F, Leser U, and Naumann F SOFA: an extensible logical optimizer for UDF-heavy data flows Inf. Syst. 2015 52 96-125

[57]

Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S.: Efficient and extensible algorithms for multi query optimization. In: SIGMOD, pp. 249–260 (2000)

[58]

Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23–34 (1979)

[59]

Shankar, S., Choi, A., Dijcks, J.P.: Integrating hadoop data with oracle parallel processing. Oracle white paper (2010). http://www.oracle.com/technetwork/database/bi-datawarehousing/twp-integrating-hadoop-data-with-or-130063.pdf. Accessed 2 May 2019

[60]

Sheth AP and Larson JA Federated database systems for managing distributed, heterogeneous, and autonomous databases ACM Comput. Surv. 1990 22 3 183-236

[61]

Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: SIGMOD, pp. 829–840 (2012)

[62]

Stonebraker, M.: The case for polystores. ACM SIGMOD Blog. http://wp.sigmod.org/?p=1629. Accessed 2 May 2019

[63]

Wang, J., et al.: The Myria big data management and analytics system and cloud services. In: CIDR (2017)

[64]

Yang F, Li J, and Cheng J Husky: Towards a more efficient and expressive distributed computing framework PVLDB 2016 9 5 420-431

Cited By

Zheng XKumar AGupta ASerra ESpezzano F(2024)Generating Cross-model Analytics Workloads Using LLMsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679932(4303-4307)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679932
Chen QChen ZZhang KWang X(2024)CLIC: An Extensible and Efficient Cross-Platform Data Analytics SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329803835:1(34-45)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3298038
Beedkar KContreras-Rojas BGavriilidis HKaoudi ZMarkl VPardo-Meza RQuiané-Ruiz J(2023)Apache Wayang: A Unified Data Analytics FrameworkACM SIGMOD Record10.1145/3631504.363151052:3(30-35)Online publication date: 2-Nov-2023
https://dl.acm.org/doi/10.1145/3631504.3631510
Show More Cited By

Index Terms

RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems
1. Information systems
  1. Data management systems
2. Software and its engineering

Index terms have been assigned to the content through auto-classification.

Recommendations

A Review on Recent Trends in Query Processing and Optimization in Big Data
Abstract
The new generation of information technology is exchanging a huge data and these data should be process for analytical and visualization purposes. In most of our daily activities we are exchanging the data such as our mobile devices maintains call ...
RHEEM: enabling cross-platform data processing: may the big data be with you!

Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and ...
Cross-Platform App Recommendation by Jointly Modeling Ratings and Texts
Special issue: Search, Mining and their Applications on Mobile Devices

Over the last decade, the renaissance of Web technologies has transformed the online world into an application (App) driven society. While the abundant Apps have provided great convenience, their sheer number also leads to severe information overload, ...

Comments

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases

The VLDB Journal — The International Journal on Very Large Data Bases Volume 29, Issue 6

Nov 2020

324 pages

ISSN:1066-8888

Issue’s Table of Contents

© The Author(s) 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 18 May 2020

Accepted: 12 April 2020

Revision received: 21 January 2020

Received: 07 May 2019

Author Tags

Qualifiers

Research-article

Funding Sources

Technische Universität Berlin (3136)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zheng XKumar AGupta ASerra ESpezzano F(2024)Generating Cross-model Analytics Workloads Using LLMsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679932(4303-4307)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679932
Chen QChen ZZhang KWang X(2024)CLIC: An Extensible and Efficient Cross-Platform Data Analytics SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329803835:1(34-45)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3298038
Beedkar KContreras-Rojas BGavriilidis HKaoudi ZMarkl VPardo-Meza RQuiané-Ruiz J(2023)Apache Wayang: A Unified Data Analytics FrameworkACM SIGMOD Record10.1145/3631504.363151052:3(30-35)Online publication date: 2-Nov-2023
https://dl.acm.org/doi/10.1145/3631504.3631510
Kaoudi ZQuiané-Ruiz J(2022)Unified data analyticsProceedings of the VLDB Endowment10.14778/3554821.355489815:12(3778-3781)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554898
Kiehn FSchmidt MGlake DPanse FWingerath WWollmer BPoppinga MRitter N(2022)Polyglot data managementProceedings of the VLDB Endowment10.14778/3554821.355489115:12(3750-3753)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554891
Ventura FKaoudi ZQuiané-Ruiz JMarkl VLi GLi ZIdreos SSrivastava D(2021)Expand your Training Limits! Generating Training Data for ML-based Data ManagementProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457286(1865-1878)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457286
Kranas PKolev BLevchenko OPacitti EValduriez PJiménez-Peris RPatiño-Martinez M(2021)Parallel query processing in a polystoreDistributed and Parallel Databases10.1007/s10619-021-07322-539:4(939-977)Online publication date: 3-Feb-2021
https://dl.acm.org/doi/10.1007/s10619-021-07322-5

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents