Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3183713.3196891acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications

Published: 27 May 2018 Publication History

Abstract

MapReduce is a popular programming paradigm for developing large-scale, data-intensive computation. Many frameworks that implement this paradigm have recently been developed. To leverage these frameworks, however, developers must become familiar with their APIs and rewrite existing code. We present Casper, a new tool that automatically translates sequential Java programs into the MapReduce paradigm. Casper identifies potential code fragments to rewrite and translates them in two steps: (1) Casper uses program synthesis to search for a program summary (i.e., a functional specification) of each code fragment. The summary is expressed using a high-level intermediate language resembling the MapReduce paradigm and verified to be semantically equivalent to the original using a theorem prover. (2) Casper generates executable code from the summary, using either the Hadoop, Spark, or Flink API. We evaluated Casper by automatically converting real-world, sequential Java benchmarks to MapReduce. The resulting benchmarks perform up to 48.2x faster compared to the original.

References

[1]
Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006. Compilers: Principles, Techniques, and Tools (2Nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
[2]
Alexander Alexandrov, Asterios Katsifodimos, Georgi Krastev, and Volker Markl. 2016. Implicit Parallelism Through Deep Language Embedding. SIGMOD Rec. 45, 1 (June 2016), 51--58.
[3]
Saman P. Amarasinghe, Jennifer-Ann M. Anderson, Monica S. Lam, and Chau- Wen Tseng. 1995. An Overview of the SUIF Compiler for Scalable Parallel Machines. In PPSC. 662--667.
[4]
Apache Flink 2018. https://flink.apache.org/. (2018). Accessed on: 2018-04-09.
[5]
Apache Hadoop 2018. http://hadoop.apache.org. (2018). Accessed on: 2018-04-09.
[6]
Apache Hive 2018. http://hive.apache.org. (2018). Accessed on: 2018-04-09.
[7]
Apache Pig 2018. https://pig.apache.org/. (2018). Accessed on: 2018-04-09.
[8]
Apache Spark 2018. https://spark.apache.org. (2018). Accessed on: 2018-04-09.
[9]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383--1394.
[10]
William Blume, Rudolf Eigenmann, Jay Hoeflinger, David A. Padua, Paul Petersen, Lawrence Rauchwerger, and Peng Tu. 1994. Automatic Detection of Parallelism: A grand challenge for high performance computing. IEEE P&DT 2, 3 (1994), 37--47.
[11]
Rastislav Bodík and Barbara Jobstmann. 2013. Algorithmic program synthesis: introduction. International Journal on Software Tools for Technology Transfer 15 (2013), 397--411.
[12]
Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. Proc. VLDB Endow. 9, 13 (Sept. 2016), 1425--1436.
[13]
Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. 2005. A Non-Local Algorithm for Image Denoising. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05). IEEE Computer Society, Washington, DC, USA, 60--65.
[14]
Yu-Fang Chen, Lei Song, and ZhilinWu. 2016. The Commutativity Problem of the MapReduce Framework: A Transducer-based Approach. CoRR abs/1605.01497 (2016).
[15]
Alvin Cheung, Samuel Madden, Armando Solar-Lezama, Owen Arden, and Andrew C. Myers. 2014. Using Program Analysis to Improve Database Applications. IEEE Data Eng. Bull. 37, 1 (2014), 48--59.
[16]
Alvin Cheung and Armando Solar-Lezama. 2016. Computer-Assisted Query Formulation. Foundations and Trends in Programming Languages 3, 1 (2016), 1--94.
[17]
Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing Database-backed Applications with Query Synthesis. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 3--14.
[18]
Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Ugur Çetintemel, and Stanley B. Zdonik. 2014. Tupleware: Redefining Modern Analytics. CoRR abs/1406.6667 (2014).
[19]
Przemyslaw Daca, Thomas A. Henzinger, and Andrey Kupriyanov. 2016. Array Folds Logic. CoRR abs/1603.06850 (2016).
[20]
Jerome Darbon, Alexandre Cunha, Tony F. Chan, Stanley Osher, and Grant J. Jensen. 2008. Fast nonlocal filtering applied to electron cryomicroscopy. In ISBI. IEEE, 1331--1334.
[21]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113.
[22]
K. Venkatesh Emani, Karthik Ramachandra, Subhro Bhattacharya, and S. Sudarshan. 2016. Extracting Equivalent SQL from Imperative Code in Database Applications. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1781--1796.
[23]
Grigory Fedyukovich, Maaz Bin Safeer Ahmad, and Rastislav Bodik. 2017. Gradual Synthesis for Static Parallelization of Single-pass Array-processing Programs. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 572--585.
[24]
Fiji: ImageJ 2018. https://github.com/fiji. (2018). Accessed on: 2018-04-09.
[25]
Sumit Gulwani. 2010. Dimensions in Program Synthesis. In Proceedings of the 12th International ACM SIGPLAN Symposium on Principles and Practice of Declarative Programming (PPDP '10). ACM, New York, NY, USA, 13--24.
[26]
C. A. R. Hoare. 1969. An Axiomatic Basis for Computer Programming. Commun. ACM 12, 10 (Oct. 1969), 576--580.
[27]
ImageJ 2018. https://imagej.net/Welcome. (2018). Accessed on: 2018-04-09.
[28]
Shoaib Kamil, Alvin Cheung, Shachar Itzhaky, and Armando Solar-Lezama. 2016. Verified Lifting of Stencil Computations. SIGPLAN Not. 51, 6 (June 2016), 711--726.
[29]
Alfons Kemper, Thomas Neumann, Florian Funke, Viktor Leis, and Henrik Mühe. 2012. HyPer: Adapting Columnar Main-Memory Data Management for Transactional AND Query Processing. IEEE Data Eng. Bull. 35, 1 (2012), 46--51.
[30]
Yannis Klonatos, Christoph Koch, Tiark Rompf, and Hassan Chafi. 2014. Building Efficient Query Engines in a High-level Language. Proc. VLDB Endow. 7, 10 (June 2014), 853--864.
[31]
K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. In Proceedings of the 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR'10). Springer-Verlag, Berlin, Heidelberg, 348--370.
[32]
MagPie Analysis Repository 2018. https://github.com/thisMagpie/Analysis. (2018). Accessed on: 2018-04-09.
[33]
John Matthews, J. Strother Moore, Sandip Ray, and Daron Vroon. 2006. Verification Condition Generation via Theorem Proving. In Proceedings of the 13th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR'06). Springer-Verlag, Berlin, Heidelberg, 362--376.
[34]
Cedric Nugteren and Henk Corporaal. 2012. Introducing 'Bones': A Parallelizing Source-to-source Compiler Based on Algorithmic Skeletons. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU-5). ACM, New York, NY, USA, 1--10.
[35]
Shoumik Palkar, James J. Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, and Matei Zaharia. 2017. Weld: A Common Runtime for High Performance Data Analytics. (January 2017).
[36]
Spiros Papadimitriou and Jimeng Sun. 2008. DisCo: Distributed Co-clustering with Map-Reduce: A Case Study Towards Petabyte-Scale End-to-End Mining. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM '08). IEEE Computer Society, Washington, DC, USA, 512--521.
[37]
Polyglot 2018. http://www.cs.cornell.edu/Projects/polyglot/. (2018). Accessed on: 2018-04-09.
[38]
Cosmin Radoi, Stephen J. Fink, Rodric Rabbah, and Manu Sridharan. 2014. Translating Imperative Code to MapReduce. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages &Applications (OOPSLA '14). ACM, New York, NY, USA, 909--927.
[39]
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. 2007. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA '07). IEEE Computer Society,Washington, DC, USA, 13--24.
[40]
Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. 2015. Parallelizing User-defined Aggregations Using Symbolic Execution. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 153--167.
[41]
P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. 1979. Access Path Selection in a Relational Database Management System. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data (SIGMOD '79). ACM, New York, NY, USA, 23--34.
[42]
Sketch 2018. https://people.csail.mit.edu/asolar/. (2018). Accessed on: 2018-04-09.
[43]
Yannis Smaragdakis and George Balatsouras. 2015. Pointer Analysis. Found. Trends Program. Lang. 2, 1 (April 2015), 1--69.
[44]
Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. SIGPLAN Not. 51, 6 (June 2016), 326--340.
[45]
Armando Solar-Lezama. 2008. Program Synthesis by Sketching. Ph.D. Dissertation. Berkeley, CA, USA. Advisor(s) Bodik, Rastislav.
[46]
Spark GitHub Repository 2018. https://github.com/apache/spark/tree/master/ examples/src/main/scala/org/apache/spark/examples. (2018). Accessed on: 2018- 01--20.
[47]
Glynn Winskel. 1993. The Formal Semantics of Programming Languages: An Introduction. MIT Press, Cambridge, MA, USA.

Cited By

View all
  • (2024)SpEQ: Translation of Sparse Codes using EquivalencesProceedings of the ACM on Programming Languages10.1145/36564458:PLDI(1680-1703)Online publication date: 20-Jun-2024
  • (2024)Decomposition-based Synthesis for Applying Divide-and-Conquer-like Algorithmic ParadigmsACM Transactions on Programming Languages and Systems10.1145/364844046:2(1-59)Online publication date: 17-Jun-2024
  • (2024)PyDaskShift: Automatically Convert Loop-Based Sequential Programs to Distributed Parallel ProgramsRecent Advances in Next-Generation Data Science10.1007/978-3-031-67871-4_15(214-234)Online publication date: 23-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. compilers
  2. mapreduce
  3. program synthesis
  4. verification
  5. verified lifting

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)197
  • Downloads (Last 6 weeks)21
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SpEQ: Translation of Sparse Codes using EquivalencesProceedings of the ACM on Programming Languages10.1145/36564458:PLDI(1680-1703)Online publication date: 20-Jun-2024
  • (2024)Decomposition-based Synthesis for Applying Divide-and-Conquer-like Algorithmic ParadigmsACM Transactions on Programming Languages and Systems10.1145/364844046:2(1-59)Online publication date: 17-Jun-2024
  • (2024)PyDaskShift: Automatically Convert Loop-Based Sequential Programs to Distributed Parallel ProgramsRecent Advances in Next-Generation Data Science10.1007/978-3-031-67871-4_15(214-234)Online publication date: 23-Aug-2024
  • (2023)C2TACO: Lifting Tensor Code to TACOProceedings of the 22nd ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3624007.3624053(42-56)Online publication date: 22-Oct-2023
  • (2023)Fast Instruction Selection for Fast Digital Signal ProcessingProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624768(125-137)Online publication date: 25-Mar-2023
  • (2023)Predicate Pushdown for Data Science PipelinesProceedings of the ACM on Management of Data10.1145/35892811:2(1-28)Online publication date: 20-Jun-2023
  • (2023)FlashFill++: Scaling Programming by Example by Cutting to the ChaseProceedings of the ACM on Programming Languages10.1145/35712267:POPL(952-981)Online publication date: 11-Jan-2023
  • (2023)mlirSynth: Automatic, Retargetable Program Raising in Multi-Level IR Using Program Synthesis2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00012(39-50)Online publication date: 21-Oct-2023
  • (2023)Partitionable Programs Using Tyro V2The Recent Advances in Transdisciplinary Data Science10.1007/978-3-031-23387-6_14(185-199)Online publication date: 29-Jan-2023
  • (2022)Application of Big Data Analysis Techniques in Sports Training and Physical Fitness AnalysisWireless Communications & Mobile Computing10.1155/2022/37410872022Online publication date: 1-Jan-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media