Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2723372.2750543acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Implicit Parallelism through Deep Language Embedding

Published: 27 May 2015 Publication History
  • Get Citation Alerts
  • Abstract

    The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataflow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control flow structure using such APIs reveals a number of limitations that impede programmer's productivity.
    In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataflows and (ii) hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.

    References

    [1]
    Cascading Project. http://http://www.cascading.org/.
    [2]
    Scala language reference. http://www.scala-lang.org/files/archive/spec/2.11/.
    [3]
    Scalding Project. http://github.com/twitter/scalding.
    [4]
    Scoobi Project. http://github.com/nicta/scoobi.
    [5]
    S. Ackermann, V. Jovanovic, T. Rompf, and M. Odersky. Jet: An embedded dsl for high performance big data processing. In BigData, 2012.
    [6]
    A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinlaender, M. J. Sax, S. Schelter, M. Hoeger, K. Tzoumas, and D. Warneke. The stratosphere platform for big data analytics. VLDB Journal, 2014.
    [7]
    K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. PVLDB, 2011.
    [8]
    Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters. PVLDB, 2010.
    [9]
    P. Buneman, L. Libkin, D. Suciu, V. Tannen, and L. Wong. Comprehension Syntax. SIGMOD Record, 1994.
    [10]
    P. Buneman, S. A. Naqvi, V. Tannen, and L. Wong. Principles of programming with complex objects and collection types. Theor. Comput. Sci., 1995.
    [11]
    E. Burmako. Scala macros: let our powers combine!: on how rich syntax and static types work with metaprogramming. In Proceedings of the 4th Workshop on Scala, page 3. ACM, 2013.
    [12]
    M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi. Measuring user influence in twitter: The million follower fallacy. ICWSM, 2010.
    [13]
    H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. In ACM SIGPLAN Notices, 2011.
    [14]
    R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 2008.
    [15]
    S. Chaudhuri and K. Shim. Including group-by in query optimization. In VLDB, 1994.
    [16]
    A. Cheung, S. Madden, and A. Solar-Lezama. Sloth: being lazy is a virtue (when issuing database queries). In C. E. Dyreson, F. Li, and M. T. Özsu, editors, ACM SIGMOD, 2014.
    [17]
    H. Ehrig and B. Mahr. Fundamentals of algebraic specification 1: Equations and initial semantics. Springer Publishing Company, Incorporated, 2011.
    [18]
    S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning Fast Iterative Data Flows. PVLDB, 2012.
    [19]
    G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 1993.
    [20]
    T. Grust. Comprehending Queries (PhD Thesis). PhD thesis, Universität Konstanz, 1999.
    [21]
    T. Grust, M. Mayr, J. Rittinger, and T. Schreiber. Ferry: Database-supported program execution. In SIGMOD. ACM, 2009.
    [22]
    T. Grust and M. H. Scholl. How to comprehend queries functionally. J. Intell. Inf. Syst., 1999.
    [23]
    Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou, S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou. Spotting code optimizations in data-parallel pipelines through periscope. In OSDI, pages 121--133, 2012.
    [24]
    M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS, 2007.
    [25]
    W. Kim. On optimizing an sql-like nested query. ACM Transactions on Database Systems (TODS), 7(3):443--469, 1982.
    [26]
    J. Lambek. A fixpoint theorem for complete categories. Mathematische Zeitschrift, 1968.
    [27]
    E. Meijer, B. Beckman, and G. Bierman. Linq: reconciling object, relations and xml in the. net framework. In ACM SIGMOD, 2006.
    [28]
    E. Meijer, M. Fokkinga, and R. Paterson. Functional programming with bananas, lenses, envelopes and barbed wire. pages 124--144. Springer-Verlag, 1991.
    [29]
    D. G. Murray, M. Isard, and Y. Yu. Steno: automatic optimization of declarative queries. In M. W. Hall and D. A. Padua, editors, Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, pages 121--131. ACM, 2011.
    [30]
    C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, ACM SIGMOD, 2008.
    [31]
    T. Rompf and M. Odersky. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled dsls. In Acm Sigplan Notices. ACM, 2010.
    [32]
    A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. VLDB, 2009.
    [33]
    Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation for data-parallel computing: interfaces and implementations. In ACM SIGOPS, 2009.
    [34]
    Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008.
    [35]
    M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In E. M. Nahum and D. Xu, editors, HotCloud. USENIX, 2010.

    Cited By

    View all

    Index Terms

    1. Implicit Parallelism through Deep Language Embedding

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
      May 2015
      2110 pages
      ISBN:9781450327589
      DOI:10.1145/2723372
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 May 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. control flow
      2. data-parallel execution
      3. large-scale data analysis
      4. mapreduce
      5. monad comprehensions
      6. scala macros

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SIGMOD/PODS'15
      Sponsor:
      SIGMOD/PODS'15: International Conference on Management of Data
      May 31 - June 4, 2015
      Victoria, Melbourne, Australia

      Acceptance Rates

      SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)24
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)BabelfishProceedings of the VLDB Endowment10.14778/3489496.348950115:2(196-210)Online publication date: 4-Feb-2022
      • (2022)Imperative or Functional Control Flow HandlingACM SIGMOD Record10.1145/3542700.354271551:1(60-67)Online publication date: 1-Jun-2022
      • (2022)Data Management in Machine Learning SystemsundefinedOnline publication date: 26-Feb-2022
      • (2021)LachesisProceedings of the VLDB Endowment10.14778/3457390.345739214:8(1262-1275)Online publication date: 21-Oct-2021
      • (2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
      • (2021)The Power of Nested Parallelism in Big Data Processing – Hitting Three Flies with One Slap –Proceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457287(605-618)Online publication date: 9-Jun-2021
      • (2021)Efficient Control Flow in Dataflow Systems: When Ease-of-Use Meets High Performance2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00127(1428-1439)Online publication date: Apr-2021
      • (2019)Data Management in Machine Learning SystemsSynthesis Lectures on Data Management10.2200/S00895ED1V01Y201901DTM05714:1(1-173)Online publication date: 25-Feb-2019
      • (2019)An intermediate representation for optimizing machine learning pipelinesProceedings of the VLDB Endowment10.14778/3342263.334263312:11(1553-1567)Online publication date: 1-Jul-2019
      • (2019)Data Management Systems Research at TU BerlinACM SIGMOD Record10.1145/3335409.333541547:4(23-28)Online publication date: 17-May-2019
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media