Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3183713.3196933acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

Published: 27 May 2018 Publication History

Abstract

This paper describes PlinyCompute, a system for development of high-performance, data-intensive, distributed computing tools and libraries. \emphIn the large, PlinyCompute presents the programmer with a very high-level, declarative interface, relying on automatic, relational-database style optimization to figure out how to stage distributed computations. However, in the small, PlinyCompute presents the capable systems programmer with a persistent object data model and API (the "PC object model'') and associated memory management system that has been designed from the ground-up for high performance, distributed, data-intensive computing. This contrasts with most other Big Data systems, which are constructed on top of the Java Virtual Machine (JVM), and hence must at least partially cede performance-critical concerns such as memory management (including layout and de/allocation) and virtual method/function dispatch to the JVM. This hybrid approach---declarative in the large, trusting the programmer's ability to utilize PC object model efficiently in the small---results in a system that is ideal for the development of reusable, data-intensive tools and libraries.

References

[1]
{access date}. DL4J. https://deeplearning4j.org/. (Online). Oct 1, 2017.
[2]
{access date}. Eigen. http://eigen.tuxfamily.org/index.php?title=Main_Page. (Online). Oct 1, 2017.
[3]
{access date}. GSL - GNU Scientific Library. https://www.gnu.org/software/gsl/. (Online). Oct 1, 2017.
[4]
{access date}. Mahout. http://mahout.apache.org. (Online). Oct 22, 2016.
[5]
{access date}. Mahout Samsara. https://mahout.apache.org/users/environment/ out-of-core-reference.html. (Online). Oct 22, 2016.
[6]
{access date}. Project Tungsten Bringing Spark Closer to Bare Metal. https://databricks.com/blog/2015/04/28/ project-tungsten-bringing-spark-closer-to-bare-metal.html. (Online). Oct 1, 2017.
[7]
{access date}. SciDB's supported OS. http://www.paradigm4.com/HTMLmanual/ 14.8/scidb_ug/pr01s01.html. (Online). Nov 1, 2017.
[8]
Daniel J Abadi, Peter A Boncz, and Stavros Harizopoulos. 2009. Column-oriented database systems. Proceedings of the VLDB Endowment 2, 2 (2009), 1664--1665.
[9]
Martín Abadi, Ashish Agarwal, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
[10]
Yanif Ahmad and Christoph Koch. 2009. DBToaster: A SQL compiler for highperformance delta processing in main-memory databases. Proceedings of the VLDB Endowment 2, 2 (2009), 1566--1569.
[11]
Alexander Alexandrov et al. 2014. The Stratosphere platform for big data analytics. VLDBJ 23, 6 (2014), 939--964.
[12]
Michael Armbrust, Reynold S Xin, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1383--1394.
[13]
MKABV Bittorf, Taras Bobrovytsky, et al. 2015. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research.
[14]
Stephen M Blackburn, Robin Garner, et al. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In ACM Sigplan Notices, Vol. 41. ACM, 169--190.
[15]
Matthias Boehm, Michael W Dusenberry, et al. 2016. SystemML: Declarative machine learning on spark. Proceedings of the VLDB Endowment 9, 13 (2016), 1425--1436.
[16]
Matthias Boehm, Shirish Tatikonda, et al. 2014. Hybrid parallelization strategies for large-scale machine learning in SystemML. Proceedings of the VLDB Endowment 7, 7 (2014), 553--564.
[17]
Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: HyperPipelining Query Execution. In Cidr, Vol. 5. 225--237.
[18]
Sebastian Breß, Bastian Köcher, et al. 2017. Generating Custom Code for Efficient Query Execution on Heterogeneous Processors. arXiv preprint arXiv:1709.00700 (2017).
[19]
Paul G Brown. 2010. Overview of SciDB: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 963--968.
[20]
Zhuhua Cai, Zekai J Gao, et al. 2014. A comparison of platforms for implementing and running very large scale machine learning algorithms. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 1371-- 1382.
[21]
Paris Carbone, Asterios Katsifodimos, et al. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).
[22]
Surajit Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. ACM, 34--43.
[23]
Weidong Chen, Michael Kifer, and David S Warren. 1993. HiLog: A foundation for higher-order logic programming. The Journal of Logic Programming 15, 3 (1993), 187--230.
[24]
Edgar F Codd. 1971. A data base sublanguage founded on the relational calculus. In Proceedings of the 1971 ACM SIGFIDET (now SIGMOD) Workshop on Data Description, Access and Control. ACM, 35--68.
[25]
Transaction Processing Performance Council. 2008. TPC-H benchmark specification. Published at http://www. tcp. org/hspec. html 21 (2008), 592--603.
[26]
Andrew Crotty, Alex Galakatos, et al. 2015. Tupleware:" Big" Data, Big Analytics, Small Clusters. In CIDR.
[27]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46--55.
[28]
Ankur Dave, Alekh Jindal, et al. 2016. Graphframes: an integrated api for mixing graph and relational queries. In Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems. ACM, 2.
[29]
Ahmed Elgohary, Matthias Boehm, et al. 2016. Compressed linear algebra for large-scale machine learning. Proceedings of the VLDB Endowment 9, 12 (2016), 960--971.
[30]
Grégory M Essertel, Ruby Y Tahboub, et al. 2017. Flare: Native compilation for heterogeneous workloads in Apache Spark. arXiv preprint arXiv:1703.08219 (2017).
[31]
Amol Ghoting, Rajasekar Krishnamurthy, et al. 2011. SystemML: Declarative machine learning on MapReduce. In ICDE. 231--242.
[32]
Joseph E Gonzalez, Reynold S Xin, et al. 2014. GraphX: Graph Processing in a Distributed Dataflow Framework. In OSDI, Vol. 14. 599--613.
[33]
Goetz Graefe. 1990. Encapsulation of parallelism in the Volcano query processing system. Vol. 19. ACM.
[34]
William Gropp, Ewing Lusk, et al. 1996. A high-performance, portable implementation of the MPI message passing interface standard. Parallel computing 22, 6 (1996), 789--828.
[35]
Dan Grossman, Greg Morrisett, et al. 2002. Region-based memory management in Cyclone. ACM Sigplan Notices 37, 5 (2002), 282--293.
[36]
Stratos Idreos, Fabian Groffen, et al. 2012. MonetDB: Two decades of research in column-oriented database architectures. A Quarterly Bulletin of the IEEE Computer Society Technical Committee on Database Engineering 35, 1 (2012), 40--45.
[37]
Nicolai M Josuttis. 2012. The C++ standard library: a tutorial and reference. Addison-Wesley.
[38]
Peter J Keleher, Alan L Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. 1994. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In USENIX Winter, Vol. 1994. 23--36.
[39]
Yannis Klonatos, Christoph Koch, et al. 2014. Building efficient query engines in a high-level language. Proceedings of the VLDB Endowment 7, 10 (2014), 853--864.
[40]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis &transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 75.
[41]
Chris Arthur Lattner. 2002. LLVM: An infrastructure for multi-stage optimization. Ph.D. Dissertation. University of Illinois at Urbana-Champaign.
[42]
Guy Lebanon. 2006. Metric learning for text documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 4 (2006), 497--508.
[43]
Lu Lu et al. 2016. Lifetime-Based Memory Management for Distributed Data Processing Systems. VLDB 9, 12 (2016), 936--947.
[44]
Xiangrui Meng, Joseph Bradley, et al. 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235--1241.
[45]
Dale Miller. 1991. A logic programming language with lambda-abstraction, function variables, and simple unification. Journal of logic and computation 1, 4 (1991), 497--536.
[46]
Fabian Nagel, Gavin Bierman, and Stratis D Viglas. 2014. Code generation for efficient query processing in managed runtimes. Proceedings of the VLDB Endowment 7, 12 (2014), 1095--1106.
[47]
Thomas Neumann. 2011. Efficiently compiling efficient query plans for modern hardware. Proceedings of the VLDB Endowment 4, 9 (2011), 539--550.
[48]
Kay Ousterhout, Ryan Rasti, et al. 2015. Making Sense of Performance in Data Analytics Frameworks. In NSDI, Vol. 15. 293--307.
[49]
Shoumik Palkar, James J Thomas, et al. 2017. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR).
[50]
Juwei Shi, Yunjie Qiu, et al. 2015. Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2110--2121.
[51]
Sourav Sikdar, Kia Teymourian, and Chris Jermaine. 2017. An Experimental Comparison of Complex Object Implementations for Big Data Systems. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17). ACM, 432--444.
[52]
Michael Stonebraker, Paul Brown, et al. 2011. The architecture of SciDB. In Scientific and Statistical Database Management. Springer, 1--16.
[53]
Michael Stonebraker, Lawrence A Rowe, et al. 1990. Third-generation database system manifesto. ACM SIGMOD record 19, 3 (1990), 31--44.
[54]
Yuanyuan Tian, Shirish Tatikonda, and Berthold Reinwald. 2012. Scalable and numerically stable descriptive statistics in systemml. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on. IEEE, 1351--1359.
[55]
Mads Tofte and Jean-Pierre Talpin. 1997. Region-based memory management. Information and computation 132, 2 (1997), 109--176.
[56]
Yuan Yu et al. 2008. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI, Vol. 8. 1--14.
[57]
Matei Zaharia et al. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI. USENIX, 2--15.
[58]
Matei Zaharia, Mosharaf Chowdhury, et al. 2010. Spark: cluster computing with working sets. In USENIX HotCloud. 1--10.
[59]
Yili Zheng, Amir Kamil, et al. 2014. UPC++: a PGAS extension for C++. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 1105--1114.
[60]
Jia Zou, R Matthew Barnett, et al. 2017. PlinyCompute: A Platform for HighPerformance, Distributed, Data-Intensive Tool Development. arXiv preprint arXiv:1711.05573 (2017).
[61]
Marcin Zukowski, Peter A Boncz, et al. 2005. MonetDB/X100-A DBMS In The CPU Cache. IEEE Data Eng. Bull. 28, 2 (2005), 17--22.

Cited By

View all
  • (2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
  • (2024)Towards automating microservices orchestration through data-driven evolutionary architecturesService Oriented Computing and Applications10.1007/s11761-024-00387-x18:1(1-12)Online publication date: 1-Mar-2024
  • (2023)A Comparison of End-to-End Decision Forest Inference PipelinesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624656(200-215)Online publication date: 30-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed computing
  2. object model
  3. query compilation

Qualifiers

  • Research-article

Funding Sources

  • NSF
  • DARPA MUSE program

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)83
  • Downloads (Last 6 weeks)8
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
  • (2024)Towards automating microservices orchestration through data-driven evolutionary architecturesService Oriented Computing and Applications10.1007/s11761-024-00387-x18:1(1-12)Online publication date: 1-Mar-2024
  • (2023)A Comparison of End-to-End Decision Forest Inference PipelinesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624656(200-215)Online publication date: 30-Oct-2023
  • (2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
  • (2022)Serving deep learning models with deduplication from relational databasesProceedings of the VLDB Endowment10.14778/3547305.354732515:10(2230-2243)Online publication date: 1-Jun-2022
  • (2021)LachesisProceedings of the VLDB Endowment10.14778/3457390.345739214:8(1262-1275)Online publication date: 21-Oct-2021
  • (2021)Distributed numerical and machine learning computations via two-phase execution of aggregated join treesProceedings of the VLDB Endowment10.14778/3450980.345099114:7(1228-1240)Online publication date: 12-Apr-2021
  • (2021)Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear AlgebraProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457317(1222-1234)Online publication date: 9-Jun-2021
  • (2020)MONSOON: Multi-Step Optimization and Execution of Queries with Partially Obscured PredicatesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389728(225-240)Online publication date: 11-Jun-2020
  • (2020)Architecture of a distributed storage that combines file system, memory and computation in a single layerThe VLDB Journal10.1007/s00778-020-00605-w29:5(1049-1073)Online publication date: 26-Feb-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media