Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
column

Implicit Parallelism through Deep Language Embedding

Published: 02 June 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Parallel collection processing based on second-order functions such as map and reduce has been widely adopted for scalable data analysis. Initially popularized by Google, over the past decade this programming paradigm has found its way in the core APIs of parallel dataflow engines such as Hadoop's MapReduce, Spark's RDDs, and Flink's DataSets. We review programming patterns typical of these APIs and discuss how they relate to the underlying parallel execution model. We argue that fixing the abstraction leaks exposed by these patterns will reduce the cost of data analysis due to improved programmer productivity. To achieve that, we first revisit the algebraic foundations of parallel collection processing. Based on that, we propose a simplified API that (i) provides proper support for nested collection processing and (ii) alleviates the need of certain second-order primitives through comprehensions -- a declarative syntax akin to SQL. Finally, we present a metaprogramming pipeline that performs algebraic rewrites and physical optimizations which allow us to target parallel dataflow engines like Spark and Flink with competitive performance.

    References

    [1]
    Cascading Project. http://www.cascading.org/.
    [2]
    Emma Language. http://www.emma-language.org/.
    [3]
    A. Alexandrov, R. Bergmann, S. Ewen, J. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M. J. Sax, S. Schelter, M. Höger, K. Tzoumas, and D. Warneke. The Stratosphere platform for big data analytics. VLDB J., 23(6):939--964, 2014.
    [4]
    A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb, and V. Markl. Implicit parallelism through deep language embedding. In SIGMOD Conference, pages 47--61. ACM, 2015.
    [5]
    K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12):1272--1283, 2011.
    [6]
    R. Bird and O. de Moor. The Algebra of Programming. Prentice Hall, 1997.
    [7]
    S. Boag, D. Chamberlin, M. F. Fernández, D. Florescu, J. Robie, J. Siméon, and M. Stefanescu. Xquery 1.0: An XML query language, 2002.
    [8]
    P. Buneman, S. A. Naqvi, V. Tannen, and L. Wong. Principles of programming with complex objects and collection types. Theor. Comput. Sci., 149(1):3--48, 1995.
    [9]
    R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 2008.
    [10]
    A. J. Gill and S. L. P. Jones. Cheap deforestation in practice: An optimizer for haskell. In IFIP Congress (1), pages 581--586, 1994.
    [11]
    G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In ICDE, pages 209--218. IEEE Computer Society, 1993.
    [12]
    T. Grust. Comprehending Queries (PhD Thesis). PhD thesis, Universität Konstanz, 1999.
    [13]
    T. Grust, M. Mayr, J. Rittinger, and T. Schreiber. FERRY: database-supported program execution. In SIGMOD Conference, pages 1063--1066. ACM, 2009.
    [14]
    M. Jarke and J. W. Schmidt. Query processing strategies in the PASCAL/R relational database management system. In SIGMOD Conference, pages 256--264. ACM Press, 1982.
    [15]
    G. L. S. Jr. Organizing functional code for parallel execution or, foldl and foldr considered slightly harmful. In ICFP, pages 1--2. ACM, 2009.
    [16]
    C. Lamb, G. Landis, J. A. Orenstein, and D. Weinreb. The objectstore database system. Commun. ACM, 34(10):50--63, 1991.
    [17]
    J. Lambek. Least fixpoints of endofunctors of cartesian closed categories. Mathematical Structures in Computer Science, 3(2):229--257, 1993.
    [18]
    E. Meijer, B. Beckman, and G. M. Bierman. LINQ: reconciling object, relations and XML in the .net framework. In SIGMOD Conference, page 706. ACM, 2006.
    [19]
    T. Rompf and M. Odersky. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled dsls. In GPCE, pages 127--136. ACM, 2010.
    [20]
    P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD Conference, pages 23--34. ACM, 1979.
    [21]
    D. Suciu and L. Wong. On two forms of structural recursion. In ICDT, volume 893 of Lecture Notes in Computer Science, pages 111--124. Springer, 1995.
    [22]
    A. Ulrich and T. Grust. The flatter, the better: Query compilation based on the flattening transformation. In SIGMOD Conference, pages 1421--1426. ACM, 2015.
    [23]
    M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In E. M. Nahum and D. Xu, editors, HotCloud. USENIX, 2010.

    Cited By

    View all

    Index Terms

    1. Implicit Parallelism through Deep Language Embedding
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGMOD Record
      ACM SIGMOD Record  Volume 45, Issue 1
      March 2016
      73 pages
      ISSN:0163-5808
      DOI:10.1145/2949741
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 June 2016
      Published in SIGMOD Volume 45, Issue 1

      Check for updates

      Qualifiers

      • Column

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)7
      • Downloads (Last 6 weeks)0

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)A survey on machine learning in array databasesApplied Intelligence10.1007/s10489-022-03979-253:9(9799-9822)Online publication date: 1-May-2023
      • (2022)Automatic Decomposition of a Sequential Algorithm for MapReduce Frameworks2022 IEEE International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON)10.1109/SIBIRCON56155.2022.10017034(1780-1783)Online publication date: 11-Nov-2022
      • (2021)TraNCEProceedings of the VLDB Endowment10.14778/3476311.347633014:12(2727-2730)Online publication date: 1-Jul-2021
      • (2021)Declarative Data Analytics: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.295808433:6(2392-2411)Online publication date: 1-Jun-2021
      • (2020)Scalable querying of nested dataProceedings of the VLDB Endowment10.5555/3430915.344244114:3(445-457)Online publication date: 14-Dec-2020
      • (2020)Scalable querying of nested dataProceedings of the VLDB Endowment10.14778/3430915.343093314:3(445-457)Online publication date: 1-Nov-2020
      • (2019)Representations and Optimizations for Embedded Parallel Dataflow LanguagesACM Transactions on Database Systems10.1145/328162944:1(1-44)Online publication date: 29-Jan-2019
      • (2018)Mosaics in Big DataProceedings of the 12th ACM International Conference on Distributed and Event-based Systems10.1145/3210284.3214344(7-13)Online publication date: 25-Jun-2018
      • (2018)Automatically Leveraging MapReduce Frameworks for Data-Intensive ApplicationsProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3196891(1205-1220)Online publication date: 27-May-2018
      • (2018)Compile-Time Code Generation for Embedded Data-Intensive Query Languages2018 IEEE International Congress on Big Data (BigData Congress)10.1109/BigDataCongress.2018.00008(1-8)Online publication date: Jul-2018
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media