Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Apache Spark

  • Living reference work entry
  • First Online:
Encyclopedia of Big Data Technologies

Definition

Apache Spark is a cluster computing solution and in-memory processing framework that extends the MapReduce model to support other types of computations such as interactive queries and stream processing (Zaharia et al. 2012). Designed to cover a variety of workloads, Spark introduces an abstraction called RDD!s (RDD!s) that enables running computations in memory in a fault-tolerant manner. RDD!s, which are immutable and partitioned collections of records, provide a programming interface for performing operations, such as map, filter, and join, over multiple data items. For fault-tolerance purposes, Spark records all transformations carried out to build a dataset, thus forming a lineage graph.

Overview

Spark (Zaharia et al. 2016) is an open-source big data framework originally developed at the University of California at Berkeley and later adopted by the Apache Foundation, which has maintained it ever since. Spark was designed to address some of the limitations of the...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    https://wiki.openstack.org/wiki/Swift

  2. 2.

    https://aws.amazon.com/s3/

  3. 3.

    https://spark.apache.org/docs/latest/sql-programming-guide.html

  4. 4.

    https://kafka.apache.org/

  5. 5.

    https://aws.amazon.com/kinesis/

  6. 6.

    https://spark-summit.org/

  7. 7.

    https://edgent.apache.org/

  8. 8.

    https://beam.apache.org/

References

  • Alsheikh MA, Niyato D, Lin S, Tan H-P, Han Z (2016) Mobile big data analytics using deep learning and Apache Spark. IEEE Netw 30(3):22–29

    Article  Google Scholar 

  • Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data (SIGMOD’15). ACM, New York, pp 1383–1394

    Google Scholar 

  • Freeman J, Vladimirov N, Kawashima T, Mu Y, Sofroniew NJ, Bennett DV, Rosen J, Yang C-T, Looger LL, Ahrens MB (2014) Mapping brain activity at scale with cluster computing. Nat Methods 11(9):941–950

    Article  Google Scholar 

  • Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) Graphx: graph processing in a distributed dataflow framework. In: OSDI, vol 14, pp 599–613

    Google Scholar 

  • Ha K, Chen Z, Hu W, Richter W, Pillai P, Satyanarayanan M (2014) Towards wearable cognitive assistance. In: 12th annual international conference on mobile systems, applications, and services, MobiSys’14. ACM, New York, pp 68–81. http://dx.doi.org/10.1145/2594368.2594383

    Google Scholar 

  • Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz R, Shenker S, Stoica I (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol 11, pp 22–22

    Google Scholar 

  • Hu YC, Patel M, Sabella D, Sprecher N, Young V (2015) Mobile edge computing – a key technology towards 5G. ETSI White Paper 11(11):1–16

    Google Scholar 

  • Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning Spark: lightning-fast big data analysis. O’Reilly Media, Inc., Beijing

    Google Scholar 

  • Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al (2016) Mllib: machine learning in Apache Spark. J Mach Learn Res 17(1):1235–1241

    MathSciNet  MATH  Google Scholar 

  • Ryza S, Laserson U, Owen S, Wills J (2017) Advanced analytics with Spark: patterns for learning from data at scale. O’Reilly Media, Inc., Sebastopol

    Google Scholar 

  • Shah MA, Hellerstein JM, Chandrasekaran S, Franklin MJ (2003) Flux: an adaptive partitioning operator for continuous query systems. In: 19th international conference on data engineering (ICDE 2003). IEEE Computer Society, pp 25–36

    Google Scholar 

  • Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E (2013) Apache hadoop YARN: yet another resource negotiator. In: 4th annual symposium on cloud computing (SOCC’13). ACM, New York, pp 5:1–5:16. http://dx.doi.org/10.1145/2523616.2523633

  • Wu Y, Tan KL (2015) ChronoStream: elastic stateful stream computation in the cloud. In: 2015 IEEE 31st international conference on data engineering, pp 723–734

    Google Scholar 

  • Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX conference on networked systems design and implementation (NSDI’12). USENIX Association, Berkeley, pp 2–2

    Google Scholar 

  • Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: 24th ACM symposium on operating systems principles (SOSP’13). ACM, New York, pp 423–438

    Chapter  Google Scholar 

  • Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcos Dias de Assunção .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Veith, A.d.S., Assunção, M.D.d. (2018). Apache Spark. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_37-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63962-8_37-1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63962-8

  • Online ISBN: 978-3-319-63962-8

  • eBook Packages: Living Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics