Apache Spark Components

The document discusses the main components of Apache Spark: 1) Spark Core is the fundamental engine that provides execution and handles basic I/O, scheduling, and fault recovery. It uses RDDs (Resilient Distributed Datasets) for fault-tolerant sharing of data across tasks. 2) Spark SQL allows querying and analyzing structured data using SQL. It integrates with Spark Core and uses DataFrames and a Catalyst optimizer. 3) Spark Streaming processes live data streams in microbatches, providing reliability, integration with historical data, and exactly-once processing semantics.

Uploaded by

nitinlucky

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views

Apache Spark Components

Uploaded by

nitinlucky

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Module6:

Components of Apache Spark (EcoSystem)

a. Spark Core
b. Spark SQL
c. Spark Streaming
d. MLlib(Machine learning library)
e. GraphX
f. Spark R

Now since we have some understanding of Spark let us dive deeper into Spark and understand
the components Apache Spark consists of. Apache Spark consists of Spark Core Engine, Spark
SQL, Spark Streaming, MLlib, GraphX and Spark R. You can use Spark Core Engine along with
any of the other five components mentioned above. It is not necessary to use all the spark
components together. Depending on the use case and application any one or more of these can
be used along with Spark Core.
Let us look at each of these components in detail.

Spark Core: Spark Core is the heart of the Apache Spark framework.Spark Core provides the
execution engine for spark platform which is required and used by other components which is
built on top of Spark Core as per the requirement. Spark Core provides the in-built memory
computing and referencing datasets stored in external storage systems. It is Spark’s core
responsibility to perform all the basic I/O functions, scheduling, monitoring etc. Also fault
recovery and effective memory management are Spark Core’s other important functions.

Spark Core uses a very special data structure called the RDD. Data sharing in distributed
processing systems like MapReduce need the data in intermediate steps to be stored and then
retrieved from permanent storage like HDFS or S3 which makes it very slow due to the
serialization and deserialization of I/O steps. RDDs overcome this as these data structures are
in-memory and fault tolerant and can be shared across different tasks within the same Spark
process. The RDDs can be any immutable and partitioned collections and can contain any type
of objects Python, Scala, Java or some user defined class objects. RDDs can be created either
by Transformations of an existing RDD or loading from external sources like HDFS or HBase
etc. We will look into RDD and its transformations in depth in later sections in the tutorial.

Spark SQL: Spark SQL is built on top of Shark which was the first interactive SQL on Hadoop
system. Shark was built on top of Hive codebase and achieved performance improvement by
swapping out physical execution engine part of Hive. But due to the limitations of Hive Shark
was not able to achieve the performance it was supposed to. So the Shark project was stopped
and Spark SQL was built with the knowledge of Shark on top of Spark Core Engine to leverage
the power of Spark. You can read more about Shark in the following blog by Reynold Xin, one of
the Spark SQL code maintainers.
Spark SQL is named like this because it works with the data in a similar fashion to SQL. In fact it
there is a mention that Spark SQL’s aim is to meet SQL 92 standards. But the gist is that it
allows developers to write declarative code letting the engine use as much of the data and
stored structure (RDDs) as it can to optimize the resultant distributed query behind the scenes.
The goal is to allow the user to not have to worry about the distributed nature as much and
focus on the business use case. Users can perform extract, transform and load functions on
data from a variety of sources in different formats like JSON, Parquet or Hive and then execute
ad-hoc queries using Spark SQL.

DataFrame constitutes the main abstraction for Spark SQL. Distributed collection of data
ordered into named columns is known as a DataFrame in Spark. In the earlier versions of Spark
SQL, DataFrame were referred to as SchemaRDDs. DataFrame API in spark integrates with the
Spark procedural code to render tight integration between procedural and relational processing.
DataFrame API evaluates operations in a lazy manner to provide support for relational
optimizations and optimize the overall data processing workflow. All relational functionalities in
Spark can be encapsulated using SparkSQL context or HiveContext.

Catalyst, an extensible optimizer is at the core functioning of Spark SQL, which is an

optimization framework embedded in Scala to help developers improve their productivity and
performance of the queries that they write. Using Catalyst, Spark developers can briefly specify
complex relational optimizations and query transformations in a few lines of code by making the
best use of Scala’s powerful programming constructs like pattern matching and runtime meta
programming. Catalyst eases the process of adding optimization rules, data sources and data
types for machine learning domains.

Spark Streaming: This Spark library is primarily maintained by Tathagat Das and helped by
Matie Zaharia. As the name suggests this library is for Streaming data. This is a very popular
Spark library as it takes Spark’s big data processing power and cranks up the speed. Spark
Streaming has the ability to Stream gigabytes per second. This capability of big and fast data
has a lot of potential. Spark Streaming is used for analyzing continuous stream of data.
Common example is processing log data from a website or server. Spark streaming is not really
streaming technically. What it really does is it breaks down the data into individual chunks that it
processes together as small RDDs. So it actually does not process data as byte at a time as it
comes in, but it processes data every second or two seconds or some fixed interval of time. So
strictly speaking Spark streaming is not real time but near real time or micro batching, but it
suffices for a vast majority of applications. Spark streaming can be configured to talk to a variety
of data sources. So we can just listen to a port that has a bunch of data being thrown at it, or we
can connect to data sources like Amazon Kinesis, Kafka, Flume etc. There are connectors
available to connect Spark to these sources. The good thing about Spark streaming is it is
reliable. It has a concept called “checkpointing” to store state to the disk periodically and
depending on what kind of data sources or receiver we are using it can pick up data from the
point of failure. It is a very robust mechanism to handle all kinds of failure like disk failure or
node failure etc. Spark Streaming has exactly-once message guarantees and helps recover lost
work without having to write any extra code or adding additional configurations.

Just like how Spark SQL has the concept of Dataframe/Dataset built on top of RDD, Spark
streaming has something called Dstream. This is a collection of RDDs that embodies the entire
stream data. The good thing about Dstream is that we can apply most of the built in functions on
RDDs also on the DStream like flatMap, map etc. Also the Dstream can be broken into
individual RDDs and can be processed one chunk at a time. Spark developers can reuse the
same code for stream and batch processing and can also integrate the streaming data with
historical data.

MLlib: Today many companies focus on building customer centric data products and services
which need machine learning to build predictive insights, recommendations and personalised
results. Data scientists can solve these problems using popular languages like Python and R,
but they spend a lot of time in building and supporting infrastructure for these languages. Spark
has built in support for doing machine learning and data science at massive scale using the
clusters. It’s called MLLib which stands for Machine Learning Library. MLlib is a low-level
machine learning library. It can be called from Java, Scala and Python programming languages.
It is simple to use, scalable and can be easily integrated with other tools and frameworks. MLlib
eases the deployment and development of scalable machine learning pipelines. Machine
learning in itself is a subject and it may not be possible to get into details here. But these are
some of the important features and capabilities Spark MLLib offers:

● Linear regression, logistic regression

● Support Vector Machines
● Naive Bayes classifier
● K-Means clustering
● Decision trees
● Recommendations using Alternating Least Squares
● Basic statistics
○ Chi-squared test, Pearsons or Spearman correlation, min, max, mean, variance
● Feature extraction
○ Term Frequency/ Inverse Document Frequency useful for search

GraphX: For graphs and graph-parallel processing Apache Spark provides another API called
GraphX. Graph here does not mean charts, lines or bar graph, but these are graphs in computer
sciences like social networks which consists of vertices where each vertex consists of an
individual user in the social network and there are many users connected to each other by
edges. These edges represent the relationship between the users in the network. GraphX is
useful in giving overall information about the graph network like it can tell how many triangles
appear in the graph and apply the PageRank algorithm to it. It can measure things like
“connectedness”, degree distribution, average path length and other high level measures of a
graph. It can also join graphs together and transform graphs quickly. It also supports the Pregel
API for traversing a graph.Spark GraphX provies Resilient Distributed Graph (RDG- an
abstraction of Spark RDD’s). RDG’s API is used by data scientists to perform several graph
operations through various computational primitives.Similar to RDDs basic operations like map,
filter, property graphs also consist of basic operators. Those operators take UDFs (user defined
functions) and produce new graphs. Moreover, these are produced with transformed properties
and structure.

Spark R: R programming language is widely used by Data scientists due to its simplicity and
ability to run complex algorithms. But R suffers from a problem that its data processing capacity
is limited to a single node. This makes R not usable when processing huge amount of data. The
problem is solved by SparkR which is an R package in Apache Spark. SparkR provides data
frame implementation that supports operations like selection, filtering, aggregation etc. on
distributed large datasets. SparkR also has support for distributed machine learning using Spark
MLlib.

Legal Register Example
100% (6)
Legal Register Example
3 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
BDA1
No ratings yet
BDA1
17 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Spark 101
No ratings yet
Spark 101
25 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
Bda 5
No ratings yet
Bda 5
21 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Sparks QL Sig Mod 2015
No ratings yet
Sparks QL Sig Mod 2015
12 pages
Spark Interview Questions and Answers
100% (2)
Spark Interview Questions and Answers
31 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
BDA GTU Study Material Presentations Unit-6 03102021061221PM
No ratings yet
BDA GTU Study Material Presentations Unit-6 03102021061221PM
23 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Bda 7
No ratings yet
Bda 7
4 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Spark2x: Big Data Huawei Course
No ratings yet
Spark2x: Big Data Huawei Course
25 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
Fast Data Processing With Spark - Second Edition - Sample Chapter
No ratings yet
Fast Data Processing With Spark - Second Edition - Sample Chapter
18 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Pyspark Interview Code
100% (2)
Pyspark Interview Code
197 pages
Apache Spark
No ratings yet
Apache Spark
15 pages
Ams 560 Spark SQL
No ratings yet
Ams 560 Spark SQL
2 pages
What Is Apache Spark?
No ratings yet
What Is Apache Spark?
232 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Big Data Processing With Apache Spark
No ratings yet
Big Data Processing With Apache Spark
17 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Thesis Apache Spark
100% (2)
Thesis Apache Spark
4 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
spark
No ratings yet
spark
9 pages
4. Introduction-to-Apache-Spark
No ratings yet
4. Introduction-to-Apache-Spark
22 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Spark BD
No ratings yet
Spark BD
9 pages
Evaluative Summary On Databricks' Value Propositions
No ratings yet
Evaluative Summary On Databricks' Value Propositions
2 pages
Apache Spark Tutorial
100% (1)
Apache Spark Tutorial
6 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Spark
No ratings yet
Spark
9 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
226 Unit-7
No ratings yet
226 Unit-7
26 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Apache Spark Features
No ratings yet
Apache Spark Features
2 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
4 pages
Design Patterns
No ratings yet
Design Patterns
12 pages
Design Patterns
No ratings yet
Design Patterns
12 pages
At 13704
No ratings yet
At 13704
2 pages
Module 5 PDF
No ratings yet
Module 5 PDF
133 pages
Financial Accounting and Accountability Teaching Booklet 2022
No ratings yet
Financial Accounting and Accountability Teaching Booklet 2022
298 pages
04 - Sewage Pumping Stations
No ratings yet
04 - Sewage Pumping Stations
14 pages
10 Maths NcertSolutions Chapter 3 3 PDF
No ratings yet
10 Maths NcertSolutions Chapter 3 3 PDF
9 pages
Data Sheet
No ratings yet
Data Sheet
7 pages
Chapter 03
No ratings yet
Chapter 03
50 pages
PMP Solutions UserGuide 11 2
No ratings yet
PMP Solutions UserGuide 11 2
621 pages
Cloud Interview Guide_v4
No ratings yet
Cloud Interview Guide_v4
35 pages
For HDFC ERGO General Insurance Company LTD
No ratings yet
For HDFC ERGO General Insurance Company LTD
2 pages
Alberto Onasis
0% (1)
Alberto Onasis
9 pages
Air BRAKE SYSTEMS
85% (13)
Air BRAKE SYSTEMS
66 pages
Salary Slip Format in Excel
No ratings yet
Salary Slip Format in Excel
12 pages
Your Namehere: Personal Details
No ratings yet
Your Namehere: Personal Details
1 page
IF2 - Project 1 Solution PDF
No ratings yet
IF2 - Project 1 Solution PDF
9 pages
3140_02_6RP_AFP_tcm143-700710
No ratings yet
3140_02_6RP_AFP_tcm143-700710
8 pages
Shuvam Shukla CV PDF
No ratings yet
Shuvam Shukla CV PDF
2 pages
Global Knowledge Management Strategies: Kevin Desouza, Roberto Evaristo
No ratings yet
Global Knowledge Management Strategies: Kevin Desouza, Roberto Evaristo
6 pages
Government of Canada (GC) API Store
No ratings yet
Government of Canada (GC) API Store
15 pages
SGLGB Technical Notes
100% (2)
SGLGB Technical Notes
9 pages
BS4 BOPs
No ratings yet
BS4 BOPs
37 pages
Unit - Iii - Ba
No ratings yet
Unit - Iii - Ba
36 pages
CSSXHTML Webdev QRG
No ratings yet
CSSXHTML Webdev QRG
1 page
Four Process Strategy
No ratings yet
Four Process Strategy
10 pages
Als Effectiveness Final
No ratings yet
Als Effectiveness Final
33 pages
University of Jammu: Result Gazette
No ratings yet
University of Jammu: Result Gazette
6 pages
SDD HSC Notes
No ratings yet
SDD HSC Notes
29 pages
SAP Concur: The Complete Smart Solution For Your Business Travel
No ratings yet
SAP Concur: The Complete Smart Solution For Your Business Travel
1 page
The Narcotic Drugs and Psychotropic Substances NDPS Act, 1985
No ratings yet
The Narcotic Drugs and Psychotropic Substances NDPS Act, 1985
57 pages