Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Real-time Processing Systems
Apache Spark
1
Apache Spark
• Apache Spark is a lightning-fast cluster computing designed for fast
computation
• It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
which includes Interactive Queries and Stream Processing
• Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management
• Spark uses Hadoop in two ways – one is storage and second is
processing. Since Spark has its own cluster management
computation, it uses Hadoop for storage purpose only
2
Apache Spark
• The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application
• Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming
• Apart from supporting all these workload in a respective system, it
reduces the management burden of maintaining separate tools
3
Features of Apache Spark
• Speed − Spark helps to run an application in Hadoop cluster, up to
100 times faster in memory, and 10 times faster when running on
disk. This is possible by reducing number of read/write operations to
disk. It stores the intermediate processing data in memory
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python. Therefore, you can write applications in different
languages
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It
also supports SQL queries, Streaming data, Machine learning (ML),
and Graph algorithms
4
Components of Spark
• The following illustration depicts the different components of Spark
Apache Spark Core
• Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems
5
Components of Spark
Spark SQL
• Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data
Spark Streaming
• Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini-batches of data
MLlib (Machine Learning Library)
• MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. Spark MLlib is nine times as fast as the Hadoop disk-based
version of Apache Mahout
GraphX
• GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API
6
Spark Architecture
Spark Architecture includes following three main components:
• Data Storage
• API
• Management Framework
Data Storage:
• Spark uses HDFS file system for data storage purposes. It works with
any Hadoop compatible data source including HDFS, HBase,
Cassandra, etc.
7
Spark Architecture
API:
• The API provides the application developers to create Spark based
applications using a standard API interface. Spark provides API for
Scala, Java, and Python programming languages
Resource Management:
• Spark can be deployed as a Stand-alone server or it can be on a
distributed computing framework like Mesos or YARN
8
Resilient Distributed Datasets
• Resilient Distributed Datasets is the core concept in Spark framework
• Spark stores data in RDD on different partitions
• They help with rearranging the computations and optimizing the data
processing
• They are also fault tolerance because an RDD know how to recreate
and recompute the datasets
• RDDs are immutable. You can modify an RDD with a transformation
but the transformation returns you a new RDD whereas the original
RDD remains the same
9
Resilient Distributed Datasets
• It provides API for various transformations and materializations of
data as well as for control over caching and partitioning of elements
to optimize data placement
• RDD can be created either from external storage or from another RDD
and stores information about its parents recompute partition in case
of failure
10
Resilient Distributed Datasets
RDD supports two types of operations:
• Transformation: Transformations don't return a single value, they return a
new RDD. Nothing gets evaluated when you call a Transformation function,
it just takes an RDD and return a new RDD
• Some of the Transformation functions are map, filter, flatMap, groupByKey,
reduceByKey, aggregateByKey, pipe, and coalesce
• Action: Action operation evaluates and returns a new value. When an
Action function is called on a RDD object, all the data processing queries
are computed at that time and the result value is returned
• Some of the Action operations are reduce, collect, count, first, take,
countByKey, and foreach
11
RDD Persistence
• One of the most important capabilities in Spark is persisting (or
caching) a dataset in memory across operations
• When you persist an RDD, each node stores any partitions of it that it
computes in memory and reuses them in other actions on that
dataset. This allows future actions to be much faster
• Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations that
originally created it
12
Components
13
Components
• Spark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object in main program (called the driver
program)
• The SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager, Mesos or YARN), which allocate
resources across applications
• Spark acquires executors on nodes in the cluster, which are processes that
run computations and store data for application
• Next, it sends application code (defined by JAR or Python files passed to
SparkContext) to the executors
• Finally, SparkContext sends tasks to the executors to run
14
Components
There are several useful things to note about this architecture:
• Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads
• The driver program must listen for and accept incoming connections
from its executors throughout its lifetime. As such, the driver program
must be network addressable from the worker nodes
• Because the driver schedules tasks on the cluster, it should be run
close to the worker nodes, preferably on the same local area network
15
Spark Streaming
• Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
• Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP
sockets, and can be processed using complex algorithms expressed with
high-level functions like map, reduce, join and window
• Finally, processed data can be pushed out to filesystems
16
Spark Streaming
• The way Spark Streaming works is it divides the live stream of data
into batches (called micro batches) of a pre-defined interval (N
seconds) and then treats each batch of data as RDDs
• It's important to decide the time interval for Spark Streaming, based
on your use case and data processing requirements
• If the value of N is too low, then the micro-batches will not have
enough data to give meaningful results during the analysis
17
Spark Streaming
Figure . How Spark Streaming works
18
Spark Streaming
• Spark Streaming receives live input data streams and divides the data
into batches, which are then processed by the Spark engine to
generate the final stream of results in batches
• Spark Streaming provides a high-level abstraction called discretized
stream or DStream, which represents a continuous stream of data.
Internally, a DStream is represented as a sequence of RDDs
19
Discretized Streams (DStreams)
• It represents a continuous stream of data, either the input data
stream received from source, or the processed data stream generated
by transforming the input stream
• Internally, a DStream is represented by a continuous series of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset
• Each RDD in a DStream contains data from a certain interval
20
Spark runtime components
21
Figure 1: Spark runtime components in cluster deploy mode. Elements of a Spark application are in blue
boxes and an application’s tasks running inside task slots are labeled with a “T”. Unoccupied task slots
are in white boxes.
Responsibilities of the client process
component
• The client process starts the driver program
• For example, the client process can be a spark-submit script for
running applications, a spark-shell script, or a custom application
using Spark API
• The client process prepares the class path and all configuration
options for the Spark application
• It also passes application arguments, if any, to the application running
inside the driver
22
Responsibilities of the driver component
• The driver orchestrates and monitors execution of a Spark application
• There’s always one driver per Spark application
• The Spark context and scheduler – are responsible for:
• Requesting memory and CPU resources from cluster managers
• Breaking application logic into stages and tasks
• Sending tasks to executors
• Collecting the results
23
Responsibilities of the driver component
24
Figure 2: Spark runtime components in client deploy mode. The driver is running inside the client’s
JVM process.
Responsibilities of the driver component
Two basic ways the driver program can be run are:
• Cluster deploy mode is depicted in figure 1. In this mode, the driver
process runs as a separate JVM process inside a cluster, and the
cluster manages its resources
• Client deploy mode is depicted in figure 2. In this mode, the driver’s
running inside the client’s JVM process and communicates with the
executors managed by the cluster
25
Responsibilities of the executors
• The executors, which JVM processes, accept tasks from the driver,
execute those tasks, and return the results to the driver
• Each executor has several task slots (or CPU cores) for running tasks in
parallel
• Although these task slots are often referred to as CPU cores in Spark,
they’re implemented as threads and don’t need to correspond to the
number of physical CPU cores on the machine
26
Creation of the Spark context
• Once the driver’s started, it configures an instance of SparkContext
• When running a standalone Spark application by submitting a jar file,
or by using Spark API from another program, your Spark application
starts and configures the Spark context
• There can be only one Spark context per JVM
27
High-level architecture
• Spark provides a well-defined and layered architecture where all its
layers and components are loosely coupled and integration with
external components/libraries/extensions is performed using well-
defined contracts
28
High-level architecture
• Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These
nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage.
• Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark
jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster
memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold
the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as
local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch.
• Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated
applications. Spark applications can leverage cluster managers such as YARN and Mesos for the allocation and deallocation
of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the
APIs that are used to request for the allocation and deallocation of available resource across the cluster.
• Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the
Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a
wide variety of applications and languages.
• Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the
Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to
perform ad hoc queries and interactive analysis over large datasets.
29
Spark execution model – master worker view
31
Spark execution model – master worker view
• Spark is built around the concepts of Resilient Distributed Datasets
and Direct Acyclic Graph representing transformations and
dependencies between them
32
Spark execution model – master worker view
• Spark Application (often referred to as Driver Program or Application
Master) at high level consists of SparkContext and user code which
interacts with it creating RDDs and performing series of
transformations to achieve final result
• These transformations of RDDs are then translated into DAG and
submitted to Scheduler to be executed on set of worker nodes
33
Execution workflow
• User code containing RDD transformations forms Direct Acyclic Graph
which is then split into stages of tasks by DAGScheduler
• Tasks run on workers and results then return to client
34
Execution workflow
37
Execution workflow
• SparkContext
• represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast
variables on that cluster
• DAGScheduler
• computes a DAG of stages for each job and submits them to TaskScheduler
• determines preferred locations for tasks (based on cache status or shuffle files locations) and finds minimum
schedule to run the jobs
• TaskScheduler
• responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating
stragglers
• SchedulerBackend
• backend interface for scheduling systems that allows plugging in different implementations(Mesos, YARN,
Standalone, local)
• BlockManager
• provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory,
disk, and off-heap)
38
Reference
• Data Stream Management Systems: Apache Spark Streaming
• http://freecontent.manning.com/running-spark-an-overview-of-sparks-runtime-
architecture/
• https://www.packtpub.com/books/content/spark-%E2%80%93-architecture-and-
first-program
• https://0x0fff.com/spark-architecture/
• http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/
• https://github.com/apache/spark
• http://spark.apache.org/docs/latest/
• https://github.com/JerryLead/SparkInternals
39
THANKS !
40

More Related Content

What's hot

Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
Robert Evans
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
Sonal Raj
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
DataWorks Summit
 
Storm
StormStorm
Storm
nathanmarz
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
Tiziano De Matteis
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
Farzad Nozarian
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
P. Taylor Goetz
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
Trong Ton
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
What’s expected in Java 9
What’s expected in Java 9What’s expected in Java 9
What’s expected in Java 9
Gal Marder
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
DataWorks Summit/Hadoop Summit
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
DataWorks Summit
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
T Jake Luciani
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
Uday Vakalapudi
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
Lester Martin
 
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Humoyun Ahmedov
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Storm
StormStorm
Storm
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
What’s expected in Java 9
What’s expected in Java 9What’s expected in Java 9
What’s expected in Java 9
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 

Similar to Apache Spark

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Dona Mary Philip
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Aishg4
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 

Similar to Apache Spark (20)

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache spark
Apache sparkApache spark
Apache spark
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

Recently uploaded

BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
RajdeepPaul47
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdfOrange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
RealDarrah
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
model sexy
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile OfferHiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
$A19
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
taqyea
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
taqyea
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
SanelaNikodinoska1
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
KiranKumar139571
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 

Recently uploaded (20)

BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
 
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdfOrange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile OfferHiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
Hiranandani Gardens @Call @Girls Whatsapp 9833363713 With High Profile Offer
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 

Apache Spark

  • 2. Apache Spark • Apache Spark is a lightning-fast cluster computing designed for fast computation • It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing • Spark is not a modified version of Hadoop and is not, really, dependent on Hadoop because it has its own cluster management • Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only 2
  • 3. Apache Spark • The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application • Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming • Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools 3
  • 4. Features of Apache Spark • Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory • Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages • Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms 4
  • 5. Components of Spark • The following illustration depicts the different components of Spark Apache Spark Core • Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems 5
  • 6. Components of Spark Spark SQL • Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data Spark Streaming • Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data MLlib (Machine Learning Library) • MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout GraphX • GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API 6
  • 7. Spark Architecture Spark Architecture includes following three main components: • Data Storage • API • Management Framework Data Storage: • Spark uses HDFS file system for data storage purposes. It works with any Hadoop compatible data source including HDFS, HBase, Cassandra, etc. 7
  • 8. Spark Architecture API: • The API provides the application developers to create Spark based applications using a standard API interface. Spark provides API for Scala, Java, and Python programming languages Resource Management: • Spark can be deployed as a Stand-alone server or it can be on a distributed computing framework like Mesos or YARN 8
  • 9. Resilient Distributed Datasets • Resilient Distributed Datasets is the core concept in Spark framework • Spark stores data in RDD on different partitions • They help with rearranging the computations and optimizing the data processing • They are also fault tolerance because an RDD know how to recreate and recompute the datasets • RDDs are immutable. You can modify an RDD with a transformation but the transformation returns you a new RDD whereas the original RDD remains the same 9
  • 10. Resilient Distributed Datasets • It provides API for various transformations and materializations of data as well as for control over caching and partitioning of elements to optimize data placement • RDD can be created either from external storage or from another RDD and stores information about its parents recompute partition in case of failure 10
  • 11. Resilient Distributed Datasets RDD supports two types of operations: • Transformation: Transformations don't return a single value, they return a new RDD. Nothing gets evaluated when you call a Transformation function, it just takes an RDD and return a new RDD • Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce • Action: Action operation evaluates and returns a new value. When an Action function is called on a RDD object, all the data processing queries are computed at that time and the result value is returned • Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach 11
  • 12. RDD Persistence • One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations • When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset. This allows future actions to be much faster • Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it 12
  • 14. Components • Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in main program (called the driver program) • The SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications • Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for application • Next, it sends application code (defined by JAR or Python files passed to SparkContext) to the executors • Finally, SparkContext sends tasks to the executors to run 14
  • 15. Components There are several useful things to note about this architecture: • Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads • The driver program must listen for and accept incoming connections from its executors throughout its lifetime. As such, the driver program must be network addressable from the worker nodes • Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network 15
  • 16. Spark Streaming • Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams • Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window • Finally, processed data can be pushed out to filesystems 16
  • 17. Spark Streaming • The way Spark Streaming works is it divides the live stream of data into batches (called micro batches) of a pre-defined interval (N seconds) and then treats each batch of data as RDDs • It's important to decide the time interval for Spark Streaming, based on your use case and data processing requirements • If the value of N is too low, then the micro-batches will not have enough data to give meaningful results during the analysis 17
  • 18. Spark Streaming Figure . How Spark Streaming works 18
  • 19. Spark Streaming • Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches • Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. Internally, a DStream is represented as a sequence of RDDs 19
  • 20. Discretized Streams (DStreams) • It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream • Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset • Each RDD in a DStream contains data from a certain interval 20
  • 21. Spark runtime components 21 Figure 1: Spark runtime components in cluster deploy mode. Elements of a Spark application are in blue boxes and an application’s tasks running inside task slots are labeled with a “T”. Unoccupied task slots are in white boxes.
  • 22. Responsibilities of the client process component • The client process starts the driver program • For example, the client process can be a spark-submit script for running applications, a spark-shell script, or a custom application using Spark API • The client process prepares the class path and all configuration options for the Spark application • It also passes application arguments, if any, to the application running inside the driver 22
  • 23. Responsibilities of the driver component • The driver orchestrates and monitors execution of a Spark application • There’s always one driver per Spark application • The Spark context and scheduler – are responsible for: • Requesting memory and CPU resources from cluster managers • Breaking application logic into stages and tasks • Sending tasks to executors • Collecting the results 23
  • 24. Responsibilities of the driver component 24 Figure 2: Spark runtime components in client deploy mode. The driver is running inside the client’s JVM process.
  • 25. Responsibilities of the driver component Two basic ways the driver program can be run are: • Cluster deploy mode is depicted in figure 1. In this mode, the driver process runs as a separate JVM process inside a cluster, and the cluster manages its resources • Client deploy mode is depicted in figure 2. In this mode, the driver’s running inside the client’s JVM process and communicates with the executors managed by the cluster 25
  • 26. Responsibilities of the executors • The executors, which JVM processes, accept tasks from the driver, execute those tasks, and return the results to the driver • Each executor has several task slots (or CPU cores) for running tasks in parallel • Although these task slots are often referred to as CPU cores in Spark, they’re implemented as threads and don’t need to correspond to the number of physical CPU cores on the machine 26
  • 27. Creation of the Spark context • Once the driver’s started, it configures an instance of SparkContext • When running a standalone Spark application by submitting a jar file, or by using Spark API from another program, your Spark application starts and configures the Spark context • There can be only one Spark context per JVM 27
  • 28. High-level architecture • Spark provides a well-defined and layered architecture where all its layers and components are loosely coupled and integration with external components/libraries/extensions is performed using well- defined contracts 28
  • 29. High-level architecture • Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage. • Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch. • Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated applications. Spark applications can leverage cluster managers such as YARN and Mesos for the allocation and deallocation of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the APIs that are used to request for the allocation and deallocation of available resource across the cluster. • Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a wide variety of applications and languages. • Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to perform ad hoc queries and interactive analysis over large datasets. 29
  • 30. Spark execution model – master worker view 31
  • 31. Spark execution model – master worker view • Spark is built around the concepts of Resilient Distributed Datasets and Direct Acyclic Graph representing transformations and dependencies between them 32
  • 32. Spark execution model – master worker view • Spark Application (often referred to as Driver Program or Application Master) at high level consists of SparkContext and user code which interacts with it creating RDDs and performing series of transformations to achieve final result • These transformations of RDDs are then translated into DAG and submitted to Scheduler to be executed on set of worker nodes 33
  • 33. Execution workflow • User code containing RDD transformations forms Direct Acyclic Graph which is then split into stages of tasks by DAGScheduler • Tasks run on workers and results then return to client 34
  • 35. Execution workflow • SparkContext • represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster • DAGScheduler • computes a DAG of stages for each job and submits them to TaskScheduler • determines preferred locations for tasks (based on cache status or shuffle files locations) and finds minimum schedule to run the jobs • TaskScheduler • responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating stragglers • SchedulerBackend • backend interface for scheduling systems that allows plugging in different implementations(Mesos, YARN, Standalone, local) • BlockManager • provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap) 38
  • 36. Reference • Data Stream Management Systems: Apache Spark Streaming • http://freecontent.manning.com/running-spark-an-overview-of-sparks-runtime- architecture/ • https://www.packtpub.com/books/content/spark-%E2%80%93-architecture-and- first-program • https://0x0fff.com/spark-architecture/ • http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/ • https://github.com/apache/spark • http://spark.apache.org/docs/latest/ • https://github.com/JerryLead/SparkInternals 39