research-article

Public Access

PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

Authors:

R. Matthew Barnett,

Tania Lorido-Botran,

Kia Teymourian,

Chris JermaineAuthors Info & Claims

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 1189 - 1204

https://doi.org/10.1145/3183713.3196933

Published: 27 May 2018 Publication History

Abstract

This paper describes PlinyCompute, a system for development of high-performance, data-intensive, distributed computing tools and libraries. \emphIn the large, PlinyCompute presents the programmer with a very high-level, declarative interface, relying on automatic, relational-database style optimization to figure out how to stage distributed computations. However, in the small, PlinyCompute presents the capable systems programmer with a persistent object data model and API (the "PC object model'') and associated memory management system that has been designed from the ground-up for high performance, distributed, data-intensive computing. This contrasts with most other Big Data systems, which are constructed on top of the Java Virtual Machine (JVM), and hence must at least partially cede performance-critical concerns such as memory management (including layout and de/allocation) and virtual method/function dispatch to the JVM. This hybrid approach---declarative in the large, trusting the programmer's ability to utilize PC object model efficiently in the small---results in a system that is ideal for the development of reusable, data-intensive tools and libraries.

References

[1]

{access date}. DL4J. https://deeplearning4j.org/. (Online). Oct 1, 2017.

[2]

{access date}. Eigen. http://eigen.tuxfamily.org/index.php?title=Main_Page. (Online). Oct 1, 2017.

[3]

{access date}. GSL - GNU Scientific Library. https://www.gnu.org/software/gsl/. (Online). Oct 1, 2017.

[4]

{access date}. Mahout. http://mahout.apache.org. (Online). Oct 22, 2016.

[5]

{access date}. Mahout Samsara. https://mahout.apache.org/users/environment/ out-of-core-reference.html. (Online). Oct 22, 2016.

[6]

{access date}. Project Tungsten Bringing Spark Closer to Bare Metal. https://databricks.com/blog/2015/04/28/ project-tungsten-bringing-spark-closer-to-bare-metal.html. (Online). Oct 1, 2017.

[7]

{access date}. SciDB's supported OS. http://www.paradigm4.com/HTMLmanual/ 14.8/scidb_ug/pr01s01.html. (Online). Nov 1, 2017.

[8]

Daniel J Abadi, Peter A Boncz, and Stavros Harizopoulos. 2009. Column-oriented database systems. Proceedings of the VLDB Endowment 2, 2 (2009), 1664--1665.

Digital Library

[9]

Martín Abadi, Ashish Agarwal, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).

[10]

Yanif Ahmad and Christoph Koch. 2009. DBToaster: A SQL compiler for highperformance delta processing in main-memory databases. Proceedings of the VLDB Endowment 2, 2 (2009), 1566--1569.

Digital Library

[11]

Alexander Alexandrov et al. 2014. The Stratosphere platform for big data analytics. VLDBJ 23, 6 (2014), 939--964.

Digital Library

[12]

Michael Armbrust, Reynold S Xin, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1383--1394.

Digital Library

[13]

MKABV Bittorf, Taras Bobrovytsky, et al. 2015. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research.

[14]

Stephen M Blackburn, Robin Garner, et al. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In ACM Sigplan Notices, Vol. 41. ACM, 169--190.

Digital Library

[15]

Matthias Boehm, Michael W Dusenberry, et al. 2016. SystemML: Declarative machine learning on spark. Proceedings of the VLDB Endowment 9, 13 (2016), 1425--1436.

Digital Library

[16]

Matthias Boehm, Shirish Tatikonda, et al. 2014. Hybrid parallelization strategies for large-scale machine learning in SystemML. Proceedings of the VLDB Endowment 7, 7 (2014), 553--564.

Digital Library

[17]

Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: HyperPipelining Query Execution. In Cidr, Vol. 5. 225--237.

[18]

Sebastian Breß, Bastian Köcher, et al. 2017. Generating Custom Code for Efficient Query Execution on Heterogeneous Processors. arXiv preprint arXiv:1709.00700 (2017).

[19]

Paul G Brown. 2010. Overview of SciDB: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 963--968.

Digital Library

[20]

Zhuhua Cai, Zekai J Gao, et al. 2014. A comparison of platforms for implementing and running very large scale machine learning algorithms. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 1371-- 1382.

Digital Library

[21]

Paris Carbone, Asterios Katsifodimos, et al. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).

[22]

Surajit Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. ACM, 34--43.

Digital Library

[23]

Weidong Chen, Michael Kifer, and David S Warren. 1993. HiLog: A foundation for higher-order logic programming. The Journal of Logic Programming 15, 3 (1993), 187--230.

Digital Library

[24]

Edgar F Codd. 1971. A data base sublanguage founded on the relational calculus. In Proceedings of the 1971 ACM SIGFIDET (now SIGMOD) Workshop on Data Description, Access and Control. ACM, 35--68.

Digital Library

[25]

Transaction Processing Performance Council. 2008. TPC-H benchmark specification. Published at http://www. tcp. org/hspec. html 21 (2008), 592--603.

[26]

Andrew Crotty, Alex Galakatos, et al. 2015. Tupleware:" Big" Data, Big Analytics, Small Clusters. In CIDR.

[27]

Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46--55.

Digital Library

[28]

Ankur Dave, Alekh Jindal, et al. 2016. Graphframes: an integrated api for mixing graph and relational queries. In Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems. ACM, 2.

Digital Library

[29]

Ahmed Elgohary, Matthias Boehm, et al. 2016. Compressed linear algebra for large-scale machine learning. Proceedings of the VLDB Endowment 9, 12 (2016), 960--971.

Digital Library

[30]

Grégory M Essertel, Ruby Y Tahboub, et al. 2017. Flare: Native compilation for heterogeneous workloads in Apache Spark. arXiv preprint arXiv:1703.08219 (2017).

[31]

Amol Ghoting, Rajasekar Krishnamurthy, et al. 2011. SystemML: Declarative machine learning on MapReduce. In ICDE. 231--242.

Digital Library

[32]

Joseph E Gonzalez, Reynold S Xin, et al. 2014. GraphX: Graph Processing in a Distributed Dataflow Framework. In OSDI, Vol. 14. 599--613.

Digital Library

[33]

Goetz Graefe. 1990. Encapsulation of parallelism in the Volcano query processing system. Vol. 19. ACM.

Digital Library

[34]

William Gropp, Ewing Lusk, et al. 1996. A high-performance, portable implementation of the MPI message passing interface standard. Parallel computing 22, 6 (1996), 789--828.

Digital Library

[35]

Dan Grossman, Greg Morrisett, et al. 2002. Region-based memory management in Cyclone. ACM Sigplan Notices 37, 5 (2002), 282--293.

Digital Library

[36]

Stratos Idreos, Fabian Groffen, et al. 2012. MonetDB: Two decades of research in column-oriented database architectures. A Quarterly Bulletin of the IEEE Computer Society Technical Committee on Database Engineering 35, 1 (2012), 40--45.

[37]

Nicolai M Josuttis. 2012. The C++ standard library: a tutorial and reference. Addison-Wesley.

Digital Library

[38]

Peter J Keleher, Alan L Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. 1994. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In USENIX Winter, Vol. 1994. 23--36.

Digital Library

[39]

Yannis Klonatos, Christoph Koch, et al. 2014. Building efficient query engines in a high-level language. Proceedings of the VLDB Endowment 7, 10 (2014), 853--864.

Digital Library

[40]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis &transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 75.

Digital Library

[41]

Chris Arthur Lattner. 2002. LLVM: An infrastructure for multi-stage optimization. Ph.D. Dissertation. University of Illinois at Urbana-Champaign.

[42]

Guy Lebanon. 2006. Metric learning for text documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 4 (2006), 497--508.

Digital Library

[43]

Lu Lu et al. 2016. Lifetime-Based Memory Management for Distributed Data Processing Systems. VLDB 9, 12 (2016), 936--947.

Digital Library

[44]

Xiangrui Meng, Joseph Bradley, et al. 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235--1241.

Digital Library

[45]

Dale Miller. 1991. A logic programming language with lambda-abstraction, function variables, and simple unification. Journal of logic and computation 1, 4 (1991), 497--536.

[46]

Fabian Nagel, Gavin Bierman, and Stratis D Viglas. 2014. Code generation for efficient query processing in managed runtimes. Proceedings of the VLDB Endowment 7, 12 (2014), 1095--1106.

Digital Library

[47]

Thomas Neumann. 2011. Efficiently compiling efficient query plans for modern hardware. Proceedings of the VLDB Endowment 4, 9 (2011), 539--550.

Digital Library

[48]

Kay Ousterhout, Ryan Rasti, et al. 2015. Making Sense of Performance in Data Analytics Frameworks. In NSDI, Vol. 15. 293--307.

Digital Library

[49]

Shoumik Palkar, James J Thomas, et al. 2017. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR).

[50]

Juwei Shi, Yunjie Qiu, et al. 2015. Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2110--2121.

Digital Library

[51]

Sourav Sikdar, Kia Teymourian, and Chris Jermaine. 2017. An Experimental Comparison of Complex Object Implementations for Big Data Systems. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17). ACM, 432--444.

Digital Library

[52]

Michael Stonebraker, Paul Brown, et al. 2011. The architecture of SciDB. In Scientific and Statistical Database Management. Springer, 1--16.

Digital Library

[53]

Michael Stonebraker, Lawrence A Rowe, et al. 1990. Third-generation database system manifesto. ACM SIGMOD record 19, 3 (1990), 31--44.

Digital Library

[54]

Yuanyuan Tian, Shirish Tatikonda, and Berthold Reinwald. 2012. Scalable and numerically stable descriptive statistics in systemml. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on. IEEE, 1351--1359.

Digital Library

[55]

Mads Tofte and Jean-Pierre Talpin. 1997. Region-based memory management. Information and computation 132, 2 (1997), 109--176.

Digital Library

[56]

Yuan Yu et al. 2008. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI, Vol. 8. 1--14.

Digital Library

[57]

Matei Zaharia et al. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI. USENIX, 2--15.

Digital Library

[58]

Matei Zaharia, Mosharaf Chowdhury, et al. 2010. Spark: cluster computing with working sets. In USENIX HotCloud. 1--10.

Digital Library

[59]

Yili Zheng, Amir Kamil, et al. 2014. UPC++: a PGAS extension for C++. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 1105--1114.

Digital Library

[60]

Jia Zou, R Matthew Barnett, et al. 2017. PlinyCompute: A Platform for HighPerformance, Distributed, Data-Intensive Tool Development. arXiv preprint arXiv:1711.05573 (2017).

[61]

Marcin Zukowski, Peter A Boncz, et al. 2005. MonetDB/X100-A DBMS In The CPU Cache. IEEE Data Eng. Bull. 28, 2 (2005), 17--22.

Cited By

Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
Bergami G(2024)Towards automating microservices orchestration through data-driven evolutionary architecturesService Oriented Computing and Applications10.1007/s11761-024-00387-x18:1(1-12)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s11761-024-00387-x
Guan HMasood SDwarampudi MGunda VMin HYu LNag SZou J(2023)A Comparison of End-to-End Decision Forest Inference PipelinesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624656(200-215)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624656
Show More Cited By

Index Terms

PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Classifying Java class transformations for pervasive virtualized access
GPCE '09: Proceedings of the eighth international conference on Generative programming and component engineering

The indirection of object accesses is a common theme for target domains as diverse as transparent distribution, persistence, and program instrumentation. Virtualizing accesses to fields and methods (by redirecting calls through accessor and indirection ...
A java toolkit for teaching distributed algorithms
ITiCSE '02: Proceedings of the 7th annual conference on Innovation and technology in computer science education

We present a toolkit for developing and visualizing distributed algorithms in Java. This toolkit consists of a Java class library with a simple programming interface that allows to develop distributed algorithms in a message passing model. The resulting ...
Pervasive Load-Time Transformation for Transparently Distributed Java

The transformation of large, off-the-shelf Java applications to support complex new functionality essentially requires generation of an entirely new application that retains the execution semantics of the original. We describe such a whole-program ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

May 2018

1874 pages

ISBN:9781450347037

DOI:10.1145/3183713

General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF
DARPA MUSE program

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10 - 15, 2018

TX, Houston, USA

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
745
Total Downloads

Downloads (Last 12 months)83
Downloads (Last 6 weeks)8

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
Bergami G(2024)Towards automating microservices orchestration through data-driven evolutionary architecturesService Oriented Computing and Applications10.1007/s11761-024-00387-x18:1(1-12)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s11761-024-00387-x
Guan HMasood SDwarampudi MGunda VMin HYu LNag SZou J(2023)A Comparison of End-to-End Decision Forest Inference PipelinesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624656(200-215)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624656
Boehm MInterlandi MJermaine CDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589407
Zhou LChen JDas AMin HYu LZhao MZou J(2022)Serving deep learning models with deduplication from relational databasesProceedings of the VLDB Endowment10.14778/3547305.354732515:10(2230-2243)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.14778/3547305.3547325
Zou JDas ABarhate PIyengar AYuan BJankov DJermaine C(2021)LachesisProceedings of the VLDB Endowment10.14778/3457390.345739214:8(1262-1275)Online publication date: 21-Oct-2021
https://dl.acm.org/doi/10.14778/3457390.3457392
Jankov DYuan BLuo SJermaine C(2021)Distributed numerical and machine learning computations via two-phase execution of aggregated join treesProceedings of the VLDB Endowment10.14778/3450980.345099114:7(1228-1240)Online publication date: 12-Apr-2021
https://dl.acm.org/doi/10.14778/3450980.3450991
Luo SJankov DYuan BJermaine CLi GLi ZIdreos SSrivastava D(2021)Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear AlgebraProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457317(1222-1234)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457317
Sikdar SJermaine CMaier DPottinger RDoan ATan WAlawini ANgo H(2020)MONSOON: Multi-Step Optimization and Execution of Queries with Partially Obscured PredicatesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389728(225-240)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389728
Zou JIyengar AJermaine C(2020)Architecture of a distributed storage that combines file system, memory and computation in a single layerThe VLDB Journal10.1007/s00778-020-00605-w29:5(1049-1073)Online publication date: 26-Feb-2020
https://doi.org/10.1007/s00778-020-00605-w
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents