This presentation aims to be useful by covering the following topics:
- Modern Data Processing System Architectures and Models,
- Batch and Stream Processing Pipelines' details,
- Apache Spark Architecture and Internals,
- Real life use cases used with Apache Spark.
2. AGENDA
¡ Data Processing System Architectures
¡ What is Apache Spark?
¡ Ecosystem & Terminology
¡ How to create DataSets?
¡ Operation Types (Transforms & Actions) & Codes on hands
¡ Execution Tiers
¡ Spark onYARN
¡ Real Life Use Cases
3. BIO
¡ B.Sc & M.Sc on Electronics & Control Engineering
¡ Data Engineer @
¡ Working on Data Analytics (DataTransformations & Cleaning)
¡ Open Source Contributor @
{ Apache Spark | Pulsar | Heron }
erenavsarogullari
4. DATA PROCESSING SYSTEM ARCHITECTURES
Traditional Modern
App
MySQLProcess
Read
Write
Cluster
Manager
Worker I
Worker 2
Worker
N
App
Distributed
File System
Code Data Code Data
5. DATA PROCESSING SYSTEMS
Batch StreamingMicro Batch
High-Throughput Low-Latency / Velocity
Real-TimeNear
Real-Time
Parameters Batch Systems Streaming Systems
Data Bounded Unbounded
Job Execution Finite Infinite
Throughput High Low
Latency High Low
6. BATCH VS STREAMING PIPELINES
Batch Processing
Pipeline
Stream Processing
Pipeline
Persistence
DataSet
(e.g: csv, json, parquet)
Batch Processor
(e.g: MapReduce, Spark)
Serving
(e.g: HBase)
Application
IoT
Social
Media
Logs
Pub/Sub
(e.g: Kafka, Pulsar)
Stream Processor
(e.g: Spark, Flink, Storm, Heron)
Serving
(e.g: HBase)
Application
SOURCE TRANSFORMATION SINK
SOURCE TRANSFORMATION SINK
Data comes with
different formats
and velocity
7. WHAT IS APACHE SPARK?
¡ Distributed Compute Engine
¡ Project started in 2009 at UC Berkley
¡ First version(v0.5) is released on June 2012
¡ Moved to Apache Software Foundation in 2013
¡ +1300 Contributors / +18K Forks on Github
¡ Supported Languages: Java, Scala, Python and R
¡ spark-packages.org => ~440 Extensions
¡ Apache Bahir => http://bahir.apache.org/
¡ Community vs Enterprise Editions =>
https://databricks.com/product/comparing-databricks-to-apache-spark
9. TERMINOLOGY
¡ RDD: Resilient Distributed Dataset, immutable, resilient and partitioned.
¡ Application: An instance of Spark Context. Single per JVM.
¡ Job: An action operator triggering computation.
¡ DAG: Direct Acyclic Graph.An execution plan of a job
(a.k.a RDD dependency graph)
¡ Driver: The program for running the Job over Spark Engine
¡ Executor: The process executing a task
¡ Worker: The node running executors.
10. HOWTO CREATE
DATASET?
By loading file (spark.read.format("csv").load())
SparkSession.createDataSet(collection or RDD)
SparkSession.createDataFrame(RDD, schema).as[Model]
11. OPERATION TYPES
RDD
RDD
RDD
Two types of Spark operations on DataSet
¡ Transformations: lazy evaluated (not computed immediately)
¡ Actions: triggers the computation and returns value
Transformations DataSet Actions ValueData
High Level Spark Data Processing Pipeline
Source Transformation Operators Sink
13. JOB EXECUTION FLOW
Job I
Stages I
TaskVII
TasksV
Task I TaskSet I
TaskSet II
TaskSet III
Driver
Worker N
Partition
Partition
Worker 2
Partition
Partition
Cluster
Worker 1
Partition
Partition
14. EXECUTION TIERS
=>The main program is executed on Spark Driver
=>Transformations are executed on SparkWorker
=> Action returns the results fromWorkers to Driver
val wordCountTuples:Array[(String, Int)] = sparkSession.sparkContext
.textFile("src/main/resources/vivaldi_life.txt")
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.collect()
wordCountTuples.foreach(println)
17. REAL LIFE USE CASES
DataEditor: DataTransformation Platform
PRODTOPOLOGY
¡ Running on 5 DCs (Private Cloud & AWS)
¡ Running with 19 Nodes
¡ ~400K submitted Spark jobs per week
¡ ~1.5M Http Calls per week
SPARK RESOURCES
¡ 8 GB Memory per Executor
¡ 5 Core per Executor
¡ 4 executor per SparkContext
PROD STACK
18. DATAEDITOR SERVICES
UI
Micro
Services
DataGrid
Service
File Service
Http Operations Description
/grids Create new grid
/count Get Grid Count
/content Get Grid Content
/save Save Grid to HDFS/S3
/join Join multiple Grids
/groupBy Groups related columns
…
DataEditor Cluster
Validation
Service