Apache Spark

Veith, Alexandre da Silva; Assunção, Marcos Dias de

doi:10.1007/978-3-319-63962-8_37-1

Alexandre da Silva Veith³ &
Marcos Dias de Assunção³

978 Accesses

Definition

Apache Spark is a cluster computing solution and in-memory processing framework that extends the MapReduce model to support other types of computations such as interactive queries and stream processing (Zaharia et al. 2012). Designed to cover a variety of workloads, Spark introduces an abstraction called RDD!s (RDD!s) that enables running computations in memory in a fault-tolerant manner. RDD!s, which are immutable and partitioned collections of records, provide a programming interface for performing operations, such as map, filter, and join, over multiple data items. For fault-tolerance purposes, Spark records all transformations carried out to build a dataset, thus forming a lineage graph.

Overview

Spark (Zaharia et al. 2016) is an open-source big data framework originally developed at the University of California at Berkeley and later adopted by the Apache Foundation, which has maintained it ever since. Spark was designed to address some of the limitations of the...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

References

Alsheikh MA, Niyato D, Lin S, Tan H-P, Han Z (2016) Mobile big data analytics using deep learning and Apache Spark. IEEE Netw 30(3):22–29
Article Google Scholar
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data (SIGMOD’15). ACM, New York, pp 1383–1394
Google Scholar
Freeman J, Vladimirov N, Kawashima T, Mu Y, Sofroniew NJ, Bennett DV, Rosen J, Yang C-T, Looger LL, Ahrens MB (2014) Mapping brain activity at scale with cluster computing. Nat Methods 11(9):941–950
Article Google Scholar
Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) Graphx: graph processing in a distributed dataflow framework. In: OSDI, vol 14, pp 599–613
Google Scholar
Ha K, Chen Z, Hu W, Richter W, Pillai P, Satyanarayanan M (2014) Towards wearable cognitive assistance. In: 12th annual international conference on mobile systems, applications, and services, MobiSys’14. ACM, New York, pp 68–81. http://dx.doi.org/10.1145/2594368.2594383
Google Scholar
Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz R, Shenker S, Stoica I (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol 11, pp 22–22
Google Scholar
Hu YC, Patel M, Sabella D, Sprecher N, Young V (2015) Mobile edge computing – a key technology towards 5G. ETSI White Paper 11(11):1–16
Google Scholar
Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning Spark: lightning-fast big data analysis. O’Reilly Media, Inc., Beijing
Google Scholar
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al (2016) Mllib: machine learning in Apache Spark. J Mach Learn Res 17(1):1235–1241
MathSciNet MATH Google Scholar
Ryza S, Laserson U, Owen S, Wills J (2017) Advanced analytics with Spark: patterns for learning from data at scale. O’Reilly Media, Inc., Sebastopol
Google Scholar
Shah MA, Hellerstein JM, Chandrasekaran S, Franklin MJ (2003) Flux: an adaptive partitioning operator for continuous query systems. In: 19th international conference on data engineering (ICDE 2003). IEEE Computer Society, pp 25–36
Google Scholar
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E (2013) Apache hadoop YARN: yet another resource negotiator. In: 4th annual symposium on cloud computing (SOCC’13). ACM, New York, pp 5:1–5:16. http://dx.doi.org/10.1145/2523616.2523633
Wu Y, Tan KL (2015) ChronoStream: elastic stateful stream computation in the cloud. In: 2015 IEEE 31st international conference on data engineering, pp 723–734
Google Scholar
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX conference on networked systems design and implementation (NSDI’12). USENIX Association, Berkeley, pp 2–2
Google Scholar
Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: 24th ACM symposium on operating systems principles (SOSP’13). ACM, New York, pp 423–438
Chapter Google Scholar
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65
Article Google Scholar

Download references

Author information

Authors and Affiliations

Inria Avalon, LIP Laboratory, ENS Lyon, University of Lyon, Lyon, France
Alexandre da Silva Veith & Marcos Dias de Assunção

Authors

Alexandre da Silva Veith
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Dias de Assunção
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcos Dias de Assunção .

Editor information

Editors and Affiliations

School of Comp. Sci. and Engineering, University of New South Wales School of Comp. Sci. and Engineering, Eveleigh, New South Wales, Australia
Sherif Sakr
Sch of Info Techno, Building J12, University of Sydney Sch of Info Techno, Building J12, Sydney, Australia
Albert Zomaya

Section Editor information

School of Computing, Engineering and Mathematics, Western Sydney University, Locked Bag 1797, 2751, Penrith, NSW, Australia
Rodrigo N. Calheiros
Inria, LIP, ENS Lyon, 46 allee d'Italie, 69364, Lyon, France
Marcos Dias de Assuncao Ph.D

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Veith, A.d.S., Assunção, M.D.d. (2018). Apache Spark. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_37-1

Download citation

DOI: https://doi.org/10.1007/978-3-319-63962-8_37-1
Published: 01 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Living Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics