research-article

Implicit Parallelism through Deep Language Embedding

Authors:

Alexander Alexandrov,

Asterios Katsifodimos,

Felix Schüler,

Lauritz Thamsen,

Volker MarklAuthors Info & Claims

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 47 - 61

https://doi.org/10.1145/2723372.2750543

Published: 27 May 2015 Publication History

Abstract

The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataflow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control flow structure using such APIs reveals a number of limitations that impede programmer's productivity.

In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataflows and (ii) hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.

References

[1]

Cascading Project. http://http://www.cascading.org/.

[2]

Scala language reference. http://www.scala-lang.org/files/archive/spec/2.11/.

[3]

Scalding Project. http://github.com/twitter/scalding.

[4]

Scoobi Project. http://github.com/nicta/scoobi.

[5]

S. Ackermann, V. Jovanovic, T. Rompf, and M. Odersky. Jet: An embedded dsl for high performance big data processing. In BigData, 2012.

[6]

A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinlaender, M. J. Sax, S. Schelter, M. Hoeger, K. Tzoumas, and D. Warneke. The stratosphere platform for big data analytics. VLDB Journal, 2014.

Digital Library

[7]

K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. PVLDB, 2011.

Digital Library

[8]

Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters. PVLDB, 2010.

Digital Library

[9]

P. Buneman, L. Libkin, D. Suciu, V. Tannen, and L. Wong. Comprehension Syntax. SIGMOD Record, 1994.

Digital Library

[10]

P. Buneman, S. A. Naqvi, V. Tannen, and L. Wong. Principles of programming with complex objects and collection types. Theor. Comput. Sci., 1995.

Digital Library

[11]

E. Burmako. Scala macros: let our powers combine!: on how rich syntax and static types work with metaprogramming. In Proceedings of the 4th Workshop on Scala, page 3. ACM, 2013.

Digital Library

[12]

M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi. Measuring user influence in twitter: The million follower fallacy. ICWSM, 2010.

[13]

H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. In ACM SIGPLAN Notices, 2011.

Digital Library

[14]

R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 2008.

Digital Library

[15]

S. Chaudhuri and K. Shim. Including group-by in query optimization. In VLDB, 1994.

Digital Library

[16]

A. Cheung, S. Madden, and A. Solar-Lezama. Sloth: being lazy is a virtue (when issuing database queries). In C. E. Dyreson, F. Li, and M. T. Özsu, editors, ACM SIGMOD, 2014.

Digital Library

[17]

H. Ehrig and B. Mahr. Fundamentals of algebraic specification 1: Equations and initial semantics. Springer Publishing Company, Incorporated, 2011.

Digital Library

[18]

S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning Fast Iterative Data Flows. PVLDB, 2012.

Digital Library

[19]

G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 1993.

Digital Library

[20]

T. Grust. Comprehending Queries (PhD Thesis). PhD thesis, Universität Konstanz, 1999.

[21]

T. Grust, M. Mayr, J. Rittinger, and T. Schreiber. Ferry: Database-supported program execution. In SIGMOD. ACM, 2009.

Digital Library

[22]

T. Grust and M. H. Scholl. How to comprehend queries functionally. J. Intell. Inf. Syst., 1999.

Digital Library

[23]

Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou, S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou. Spotting code optimizations in data-parallel pipelines through periscope. In OSDI, pages 121--133, 2012.

Digital Library

[24]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS, 2007.

Digital Library

[25]

W. Kim. On optimizing an sql-like nested query. ACM Transactions on Database Systems (TODS), 7(3):443--469, 1982.

Digital Library

[26]

J. Lambek. A fixpoint theorem for complete categories. Mathematische Zeitschrift, 1968.

[27]

E. Meijer, B. Beckman, and G. Bierman. Linq: reconciling object, relations and xml in the. net framework. In ACM SIGMOD, 2006.

Digital Library

[28]

E. Meijer, M. Fokkinga, and R. Paterson. Functional programming with bananas, lenses, envelopes and barbed wire. pages 124--144. Springer-Verlag, 1991.

Digital Library

[29]

D. G. Murray, M. Isard, and Y. Yu. Steno: automatic optimization of declarative queries. In M. W. Hall and D. A. Padua, editors, Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, pages 121--131. ACM, 2011.

Digital Library

[30]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, ACM SIGMOD, 2008.

Digital Library

[31]

T. Rompf and M. Odersky. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled dsls. In Acm Sigplan Notices. ACM, 2010.

Digital Library

[32]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. VLDB, 2009.

Digital Library

[33]

Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation for data-parallel computing: interfaces and implementations. In ACM SIGOPS, 2009.

Digital Library

[34]

Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008.

Digital Library

[35]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In E. M. Nahum and D. Xu, editors, HotCloud. USENIX, 2010.

Digital Library

Cited By

Grulich PZeuch SMarkl V(2022)BabelfishProceedings of the VLDB Endowment10.14778/3489496.348950115:2(196-210)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489501
Gévay GRabl TBreß SMadai-Tahy LQuiané-Ruiz JMarkl V(2022)Imperative or Functional Control Flow HandlingACM SIGMOD Record10.1145/3542700.354271551:1(60-67)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1145/3542700.3542715
Boehm MKumar AYang J(2022)Data Management in Machine Learning SystemsundefinedOnline publication date: 26-Feb-2022
https://doi.org/null
Show More Cited By

Index Terms

Implicit Parallelism through Deep Language Embedding
1. Information systems
  1. Data management systems
    1. Query languages

Recommendations

Representations and Optimizations for Embedded Parallel Dataflow Languages
Best of EDBT 2017, Best of SIGMOD 2016 and Regular Papers

Parallel dataflow engines such as Apache Hadoop, Apache Spark, and Apache Flink are an established alternative to relational databases for modern data analysis applications. A characteristic of these systems is a scalable programming model based on ...
Emma in Action: Declarative Dataflows for Scalable Data Analysis
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Parallel dataflow APIs based on second-order functions were originally seen as a flexible alternative to SQL. Over time, however, their complexity increased due to the number of physical aspects that had to be exposed by the underlying engines in order ...
Language embedding and optimization in mython
DLS '09: Proceedings of the 5th symposium on Dynamic languages

Mython is an extensible variant of the Python programming language. Mython achieves extensibility by adding a quotation mechanism that accepts an additional parameter as well as the code being quoted. The additional quotation parameter takes the form of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

May 2015

2110 pages

ISBN:9781450327589

DOI:10.1145/2723372

General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
875
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Grulich PZeuch SMarkl V(2022)BabelfishProceedings of the VLDB Endowment10.14778/3489496.348950115:2(196-210)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489501
Gévay GRabl TBreß SMadai-Tahy LQuiané-Ruiz JMarkl V(2022)Imperative or Functional Control Flow HandlingACM SIGMOD Record10.1145/3542700.354271551:1(60-67)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1145/3542700.3542715
Boehm MKumar AYang J(2022)Data Management in Machine Learning SystemsundefinedOnline publication date: 26-Feb-2022
https://doi.org/null
Zou JDas ABarhate PIyengar AYuan BJankov DJermaine C(2021)LachesisProceedings of the VLDB Endowment10.14778/3457390.345739214:8(1262-1275)Online publication date: 21-Oct-2021
https://dl.acm.org/doi/10.14778/3457390.3457392
Gévay GSoto JMarkl V(2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
https://dl.acm.org/doi/10.1145/3477602
Gévay GQuiané-Ruiz JMarkl VLi GLi ZIdreos SSrivastava D(2021)The Power of Nested Parallelism in Big Data Processing Hitting Three Flies with One Slap Proceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457287(605-618)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457287
Gevay GRabl TBres SMadai-Tahy LQuiane-Ruiz JMarkl V(2021)Efficient Control Flow in Dataflow Systems: When Ease-of-Use Meets High Performance2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00127(1428-1439)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00127
Boehm MKumar AYang J(2019)Data Management in Machine Learning SystemsSynthesis Lectures on Data Management10.2200/S00895ED1V01Y201901DTM05714:1(1-173)Online publication date: 25-Feb-2019
https://doi.org/10.2200/S00895ED1V01Y201901DTM057
Kunft AKatsifodimos ASchelter SBreß SRabl TMarkl V(2019)An intermediate representation for optimizing machine learning pipelinesProceedings of the VLDB Endowment10.14778/3342263.334263312:11(1553-1567)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.14778/3342263.3342633
Abedjan ZBreß SMarkl VRabl TSoto J(2019)Data Management Systems Research at TU BerlinACM SIGMOD Record10.1145/3335409.333541547:4(23-28)Online publication date: 17-May-2019
https://dl.acm.org/doi/10.1145/3335409.3335415
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents