PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

Zou, Jia; Barnett, R. Matthew; Lorido-Botran, Tania; Luo, Shangyu; Monroy, Carlos; Sikdar, Sourav; Teymourian, Kia; Yuan, Binhang; Jermaine, Chris

Computer Science > Databases

arXiv:1711.05573 (cs)

[Submitted on 15 Nov 2017 (v1), last revised 16 Nov 2017 (this version, v2)]

Title:PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

Authors:Jia Zou, R. Matthew Barnett, Tania Lorido-Botran, Shangyu Luo, Carlos Monroy, Sourav Sikdar, Kia Teymourian, Binhang Yuan, Chris Jermaine

View PDF

Abstract:This paper describes PlinyCompute, a system for development of high-performance, data-intensive, distributed computing tools and libraries. In the large, PlinyCompute presents the programmer with a very high-level, declarative interface, relying on automatic, relational-database style optimization to figure out how to stage distributed computations. However, in the small, PlinyCompute presents the capable systems programmer with a persistent object data model and API (the "PC object model") and associated memory management system that has been designed from the ground-up for high performance, distributed, data-intensive computing. This contrasts with most other Big Data systems, which are constructed on top of the Java Virtual Machine (JVM), and hence must at least partially cede performance-critical concerns such as memory management (including layout and de/allocation) and virtual method/function dispatch to the JVM. This hybrid approach---declarative in the large, trusting the programmer's ability to utilize PC object model efficiently in the small---results in a system that is ideal for the development of reusable, data-intensive tools and libraries. Through extensive benchmarking, we show that implementing complex objects manipulation and non-trivial, library-style computations on top of PlinyCompute can result in a speedup of 2x to more than 50x or more compared to equivalent implementations on Spark.

Comments:	48 pages, including references and Appendix
Subjects:	Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:1711.05573 [cs.DB]
	(or arXiv:1711.05573v2 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.1711.05573

Submission history

From: Jia Zou [view email]
[v1] Wed, 15 Nov 2017 14:01:06 UTC (604 KB)
[v2] Thu, 16 Nov 2017 02:30:18 UTC (604 KB)

Computer Science > Databases

Title:PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators