Data Processing with Apache Spark Meetup Talk

DATA PROCESSING WITH
APACHE SPARK
EREN AVŞAROĞULLARI
DATA SCIENCE & ENGINEERING CLUB
DUBLIN, 23 MARCH 2019

AGENDA
¡ Data Processing System Architectures
¡ What is Apache Spark?
¡ Ecosystem & Terminology
¡ How to create DataSets?
¡ Operation Types (Transforms & Actions) & Codes on hands
¡ Execution Tiers
¡ Spark onYARN
¡ Real Life Use Cases

BIO
¡ B.Sc & M.Sc on Electronics & Control Engineering
¡ Data Engineer @
¡ Working on Data Analytics (DataTransformations & Cleaning)
¡ Open Source Contributor @
{ Apache Spark | Pulsar | Heron }
erenavsarogullari

DATA PROCESSING SYSTEM ARCHITECTURES
Traditional Modern
App
MySQLProcess
Read
Write
Cluster
Manager
Worker I
Worker 2
Worker
N
App
Distributed
File System
Code Data Code Data

DATA PROCESSING SYSTEMS
Batch StreamingMicro Batch
High-Throughput Low-Latency / Velocity
Real-TimeNear
Real-Time
Parameters Batch Systems Streaming Systems
Data Bounded Unbounded
Job Execution Finite Infinite
Throughput High Low
Latency High Low

BATCH VS STREAMING PIPELINES
Batch Processing
Pipeline
Stream Processing
Pipeline
Persistence
DataSet
(e.g: csv, json, parquet)
Batch Processor
(e.g: MapReduce, Spark)
Serving
(e.g: HBase)
Application
IoT
Social
Media
Logs
Pub/Sub
(e.g: Kafka, Pulsar)
Stream Processor
(e.g: Spark, Flink, Storm, Heron)
Serving
(e.g: HBase)
Application
SOURCE TRANSFORMATION SINK
SOURCE TRANSFORMATION SINK
Data comes with
different formats
and velocity

WHAT IS APACHE SPARK?
¡ Distributed Compute Engine
¡ Project started in 2009 at UC Berkley
¡ First version(v0.5) is released on June 2012
¡ Moved to Apache Software Foundation in 2013
¡ +1300 Contributors / +18K Forks on Github
¡ Supported Languages: Java, Scala, Python and R
¡ spark-packages.org => ~440 Extensions
¡ Apache Bahir => http://bahir.apache.org/
¡ Community vs Enterprise Editions =>
https://databricks.com/product/comparing-databricks-to-apache-spark

SPARK ECOSYSTEM
Spark SQL
Spark Streaming
MLlib GraphX
Spark Core Engine
Standalone YARN MesosLocal
Cluster ModeLocal Mode
Kubernetes
Classical Structured

TERMINOLOGY
¡ RDD: Resilient Distributed Dataset, immutable, resilient and partitioned.
¡ Application: An instance of Spark Context. Single per JVM.
¡ Job: An action operator triggering computation.
¡ DAG: Direct Acyclic Graph.An execution plan of a job
(a.k.a RDD dependency graph)
¡ Driver: The program for running the Job over Spark Engine
¡ Executor: The process executing a task
¡ Worker: The node running executors.

HOWTO CREATE
DATASET?
By loading file (spark.read.format("csv").load())
SparkSession.createDataSet(collection or RDD)
SparkSession.createDataFrame(RDD, schema).as[Model]

OPERATION TYPES
RDD
RDD
RDD
Two types of Spark operations on DataSet
¡ Transformations: lazy evaluated (not computed immediately)
¡ Actions: triggers the computation and returns value
Transformations DataSet Actions ValueData
High Level Spark Data Processing Pipeline
Source Transformation Operators Sink

TRANSFORMATIONS’ CLASSIFICATIONS
Partition 1
Partition 2
Partition 3
Partition 1
Partition 2
Partition 3
RDD 1 RDD 2
Narrow Transformations
1 x 1
Partition 1
Partition 2
Partition 1
Partition 2
Partition 3
RDD 1
RDD 2
Wide Transformations
(Shuffles)
1 x n
Shuffles
Requires:
- Disk I/O
- Ser/De
- Network I/O

JOB EXECUTION FLOW
Job I
Stages I
TaskVII
TasksV
Task I TaskSet I
TaskSet II
TaskSet III
Driver
Worker N
Partition
Partition
Worker 2
Partition
Partition
Cluster
Worker 1
Partition
Partition

EXECUTION TIERS
=>The main program is executed on Spark Driver
=>Transformations are executed on SparkWorker
=> Action returns the results fromWorkers to Driver
val wordCountTuples:Array[(String, Int)] = sparkSession.sparkContext
.textFile("src/main/resources/vivaldi_life.txt")
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.collect()
wordCountTuples.foreach(println)

JOB SCHEDULING
¡ Single Application
¡ FIFO
¡ FAIR
¡ Across Applications
¡ Static Allocation
¡ Dynamic Allocation (Auto / Elastic Scaling)
¡ spark.dynamicAllocation.enabled
¡ spark.dynamicAllocation.executorIdleTimeout
¡ spark.dynamicAllocation.initialExecutors
¡ spark.dynamicAllocation.minExecutors
¡ spark.dynamicAllocation.maxExecutors

REAL LIFE USE CASES
DataEditor: DataTransformation Platform
PRODTOPOLOGY
¡ Running on 5 DCs (Private Cloud & AWS)
¡ Running with 19 Nodes
¡ ~400K submitted Spark jobs per week
¡ ~1.5M Http Calls per week
SPARK RESOURCES
¡ 8 GB Memory per Executor
¡ 5 Core per Executor
¡ 4 executor per SparkContext
PROD STACK

DATAEDITOR SERVICES
UI
Micro
Services
DataGrid
Service
File Service
Http Operations Description
/grids Create new grid
/count Get Grid Count
/content Get Grid Content
/save Save Grid to HDFS/S3
/join Join multiple Grids
/groupBy Groups related columns
…
DataEditor Cluster
Validation
Service

DATATRANSFORMATIONS
GRID
REVISION 1
GRID
REVISION 1I
Transformations
AddColumn
UpdateColumn
DeleteColumn
RenameColumn
DeleteRow
SINGLE RULE EXECUTION (User Defined)
SameTemplate

SOURCE
GRIDS N
TARGET
GRIDS N
MAPPING
RULESV
MAPPINGS
BATCH RULES EXECUTION
(User Defined)
SOURCE
GRIDS I MAPPING
RULES I
TARGET
GRIDS I
DifferentTemplate

REFERENCES
¡ https://spark.apache.org/docs/latest/
¡ https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
¡ https://stackoverflow.com/questions/36215672/spark-yarn-architecture
¡ https://pulsar.apache.org/

Data Processing with Apache Spark Meetup Talk

More Related Content

Data Processing with Apache Spark Meetup Talk