Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Changing engines in midstream: a java stream computational model for big data processing

Published: 01 August 2014 Publication History

Abstract

With the addition of lambda expressions and the Stream API in Java 8, Java has gained a powerful and expressive query language that operates over in-memory collections of Java objects, making the transformation and analysis of data more convenient, scalable and efficient. In this paper, we build on Java 8 Stream and add a DistributableStream abstraction that supports federated query execution over an extensible set of distributed compute engines. Each query eventually results in the creation of a materialized result that is returned either as a local object or as an engine defined distributed Java Collection that can be saved and/or used as a source for future queries. Distinctively, DistributableStream supports the changing of compute engines both between and within a query, allowing different parts of a computation to be executed on different platforms. At execution time, the query is organized as a sequence of pipelined stages, each stage potentially running on a different engine. Each node that is part of a stage executes its portion of the computation on the data available locally or produced by the previous stage of the computation. This approach allows for computations to be assigned to engines based on pricing, data locality, and resource availability. Coupled with the inherent laziness of stream operations, this brings great flexibility to query planning and separates the semantics of the query from the details of the engine used to execute it. We currently support three engines, Local, Apache Hadoop MapReduce and Oracle Coherence, and we illustrate how new engines and data sources can be added.

References

[1]
Apache Hadoop. http:/hadoop.apache.org/.
[2]
Apache Hadoop YARN. http:/hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.
[3]
Apache Hive. http://hive.apache.org/.
[4]
Apache Pig. http://pig.apache.org/.
[5]
Apache Spark. http://spark.apache.org/.
[6]
Apache Storm. http://storm.incubator.apache.org/.
[7]
Apache Tez. http://tez.incubator.apache.org/.
[8]
Apache ZooKeeper. http://zookeeper.apache.org/.
[9]
Cascading. http://www.cascading.org/.
[10]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI, pages 137--150, 2004.
[11]
The Dryad project. http://research.microsoft.com/en-us/projects/Dryad/.
[12]
JDK 8 project. https://jdk8.java.net/.
[13]
H. Lee, K. J. Brown, A. K. Sujeeth, H. Chafi, T. Rompf, M. Odersky, and K. Olukotun. Implementing domain-specific languages for heterogeneous parallel computing. IEEE Micro, 31(5):42--53, 2011.
[14]
LINQ: language integrated query. http://msdn.microsoft.com/en-us/library/bb397926.aspx/.
[15]
Oracle Big Data Appliance. http://www.oracle.com/us/products/database/big-data-appliance/.
[16]
Oracle Coherence. http://www.oracle.com/technetwork/middleware/coherence/.
[17]
Shark. http://shark.cs.berkeley.edu/.
[18]
Spark streaming. http://spark.apache.org/streaming/.
[19]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of NSDI, pages 15--28, 2012.

Cited By

View all
  • (2018)Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster ComputingACM SIGPLAN Notices10.1145/3296957.317318753:2(564-577)Online publication date: 19-Mar-2018
  • (2018)Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster ComputingProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173187(564-577)Online publication date: 19-Mar-2018
  • (2017)UpsortableProceedings of the VLDB Endowment10.14778/3137765.313779710:12(1873-1876)Online publication date: 1-Aug-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 7, Issue 13
August 2014
466 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2014
Published in PVLDB Volume 7, Issue 13

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster ComputingACM SIGPLAN Notices10.1145/3296957.317318753:2(564-577)Online publication date: 19-Mar-2018
  • (2018)Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster ComputingProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173187(564-577)Online publication date: 19-Mar-2018
  • (2017)UpsortableProceedings of the VLDB Endowment10.14778/3137765.313779710:12(1873-1876)Online publication date: 1-Aug-2017
  • (2017)Measuring performance of middleware technologies for medical systemsACM SIGBED Review10.1145/3076125.307612614:2(8-14)Online publication date: 31-Mar-2017
  • (2015)Integrating Java 8 Streams with The Real-Time Specification for JavaProceedings of the 13th International Workshop on Java Technologies for Real-time and Embedded Systems10.1145/2822304.2822314(1-10)Online publication date: 7-Oct-2015

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media