Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
DATA PROCESSING WITH
APACHE SPARK
EREN AVŞAROĞULLARI
DATA SCIENCE & ENGINEERING CLUB
DUBLIN, 23 MARCH 2019
AGENDA
¡ Data Processing System Architectures
¡ What is Apache Spark?
¡ Ecosystem & Terminology
¡ How to create DataSets?
¡ Operation Types (Transforms & Actions) & Codes on hands
¡ Execution Tiers
¡ Spark onYARN
¡ Real Life Use Cases
BIO
¡ B.Sc & M.Sc on Electronics & Control Engineering
¡ Data Engineer @
¡ Working on Data Analytics (DataTransformations & Cleaning)
¡ Open Source Contributor @
{ Apache Spark | Pulsar | Heron }
erenavsarogullari
DATA PROCESSING SYSTEM ARCHITECTURES
Traditional Modern
App
MySQLProcess
Read
Write
Cluster
Manager
Worker I
Worker 2
Worker
N
App
Distributed
File System
Code Data Code Data
DATA PROCESSING SYSTEMS
Batch StreamingMicro Batch
High-Throughput Low-Latency / Velocity
Real-TimeNear
Real-Time
Parameters Batch Systems Streaming Systems
Data Bounded Unbounded
Job Execution Finite Infinite
Throughput High Low
Latency High Low
BATCH VS STREAMING PIPELINES
Batch Processing
Pipeline
Stream Processing
Pipeline
Persistence
DataSet
(e.g: csv, json, parquet)
Batch Processor
(e.g: MapReduce, Spark)
Serving
(e.g: HBase)
Application
IoT
Social
Media
Logs
Pub/Sub
(e.g: Kafka, Pulsar)
Stream Processor
(e.g: Spark, Flink, Storm, Heron)
Serving
(e.g: HBase)
Application
SOURCE TRANSFORMATION SINK
SOURCE TRANSFORMATION SINK
Data comes with
different formats
and velocity
WHAT IS APACHE SPARK?
¡ Distributed Compute Engine
¡ Project started in 2009 at UC Berkley
¡ First version(v0.5) is released on June 2012
¡ Moved to Apache Software Foundation in 2013
¡ +1300 Contributors / +18K Forks on Github
¡ Supported Languages: Java, Scala, Python and R
¡ spark-packages.org => ~440 Extensions
¡ Apache Bahir => http://bahir.apache.org/
¡ Community vs Enterprise Editions =>
https://databricks.com/product/comparing-databricks-to-apache-spark
SPARK ECOSYSTEM
Spark SQL
Spark Streaming
MLlib GraphX
Spark Core Engine
Standalone YARN MesosLocal
Cluster ModeLocal Mode
Kubernetes
Classical Structured
TERMINOLOGY
¡ RDD: Resilient Distributed Dataset, immutable, resilient and partitioned.
¡ Application: An instance of Spark Context. Single per JVM.
¡ Job: An action operator triggering computation.
¡ DAG: Direct Acyclic Graph.An execution plan of a job
(a.k.a RDD dependency graph)
¡ Driver: The program for running the Job over Spark Engine
¡ Executor: The process executing a task
¡ Worker: The node running executors.
HOWTO CREATE
DATASET?
By loading file (spark.read.format("csv").load())
SparkSession.createDataSet(collection or RDD)
SparkSession.createDataFrame(RDD, schema).as[Model]
OPERATION TYPES
RDD
RDD
RDD
Two types of Spark operations on DataSet
¡ Transformations: lazy evaluated (not computed immediately)
¡ Actions: triggers the computation and returns value
Transformations DataSet Actions ValueData
High Level Spark Data Processing Pipeline
Source Transformation Operators Sink
TRANSFORMATIONS’ CLASSIFICATIONS
Partition 1
Partition 2
Partition 3
Partition 1
Partition 2
Partition 3
RDD 1 RDD 2
Narrow Transformations
1 x 1
Partition 1
Partition 2
Partition 1
Partition 2
Partition 3
RDD 1
RDD 2
Wide Transformations
(Shuffles)
1 x n
Shuffles
Requires:
- Disk I/O
- Ser/De
- Network I/O
JOB EXECUTION FLOW
Job I
Stages I
TaskVII
TasksV
Task I TaskSet I
TaskSet II
TaskSet III
Driver
Worker N
Partition
Partition
Worker 2
Partition
Partition
Cluster
Worker 1
Partition
Partition
EXECUTION TIERS
=>The main program is executed on Spark Driver
=>Transformations are executed on SparkWorker
=> Action returns the results fromWorkers to Driver
val wordCountTuples:Array[(String, Int)] = sparkSession.sparkContext
.textFile("src/main/resources/vivaldi_life.txt")
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.collect()
wordCountTuples.foreach(println)
JOB SCHEDULING
¡ Single Application
¡ FIFO
¡ FAIR
¡ Across Applications
¡ Static Allocation
¡ Dynamic Allocation (Auto / Elastic Scaling)
¡ spark.dynamicAllocation.enabled
¡ spark.dynamicAllocation.executorIdleTimeout
¡ spark.dynamicAllocation.initialExecutors
¡ spark.dynamicAllocation.minExecutors
¡ spark.dynamicAllocation.maxExecutors
SPARK ON YARN
REAL LIFE USE CASES
DataEditor: DataTransformation Platform
PRODTOPOLOGY
¡ Running on 5 DCs (Private Cloud & AWS)
¡ Running with 19 Nodes
¡ ~400K submitted Spark jobs per week
¡ ~1.5M Http Calls per week
SPARK RESOURCES
¡ 8 GB Memory per Executor
¡ 5 Core per Executor
¡ 4 executor per SparkContext
PROD STACK
DATAEDITOR SERVICES
UI
Micro
Services
DataGrid
Service
File Service
Http Operations Description
/grids Create new grid
/count Get Grid Count
/content Get Grid Content
/save Save Grid to HDFS/S3
/join Join multiple Grids
/groupBy Groups related columns
…
DataEditor Cluster
Validation
Service
DATATRANSFORMATIONS
GRID
REVISION 1
GRID
REVISION 1I
Transformations
AddColumn
UpdateColumn
DeleteColumn
RenameColumn
DeleteRow
SINGLE RULE EXECUTION (User Defined)
SameTemplate
SOURCE
GRIDS N
TARGET
GRIDS N
MAPPING
RULESV
MAPPINGS
BATCH RULES EXECUTION
(User Defined)
SOURCE
GRIDS I MAPPING
RULES I
TARGET
GRIDS I
DifferentTemplate
REFERENCES
¡ https://spark.apache.org/docs/latest/
¡ https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
¡ https://stackoverflow.com/questions/36215672/spark-yarn-architecture
¡ https://pulsar.apache.org/
Q & A
?

More Related Content

Data Processing with Apache Spark Meetup Talk

  • 1. DATA PROCESSING WITH APACHE SPARK EREN AVŞAROĞULLARI DATA SCIENCE & ENGINEERING CLUB DUBLIN, 23 MARCH 2019
  • 2. AGENDA ¡ Data Processing System Architectures ¡ What is Apache Spark? ¡ Ecosystem & Terminology ¡ How to create DataSets? ¡ Operation Types (Transforms & Actions) & Codes on hands ¡ Execution Tiers ¡ Spark onYARN ¡ Real Life Use Cases
  • 3. BIO ¡ B.Sc & M.Sc on Electronics & Control Engineering ¡ Data Engineer @ ¡ Working on Data Analytics (DataTransformations & Cleaning) ¡ Open Source Contributor @ { Apache Spark | Pulsar | Heron } erenavsarogullari
  • 4. DATA PROCESSING SYSTEM ARCHITECTURES Traditional Modern App MySQLProcess Read Write Cluster Manager Worker I Worker 2 Worker N App Distributed File System Code Data Code Data
  • 5. DATA PROCESSING SYSTEMS Batch StreamingMicro Batch High-Throughput Low-Latency / Velocity Real-TimeNear Real-Time Parameters Batch Systems Streaming Systems Data Bounded Unbounded Job Execution Finite Infinite Throughput High Low Latency High Low
  • 6. BATCH VS STREAMING PIPELINES Batch Processing Pipeline Stream Processing Pipeline Persistence DataSet (e.g: csv, json, parquet) Batch Processor (e.g: MapReduce, Spark) Serving (e.g: HBase) Application IoT Social Media Logs Pub/Sub (e.g: Kafka, Pulsar) Stream Processor (e.g: Spark, Flink, Storm, Heron) Serving (e.g: HBase) Application SOURCE TRANSFORMATION SINK SOURCE TRANSFORMATION SINK Data comes with different formats and velocity
  • 7. WHAT IS APACHE SPARK? ¡ Distributed Compute Engine ¡ Project started in 2009 at UC Berkley ¡ First version(v0.5) is released on June 2012 ¡ Moved to Apache Software Foundation in 2013 ¡ +1300 Contributors / +18K Forks on Github ¡ Supported Languages: Java, Scala, Python and R ¡ spark-packages.org => ~440 Extensions ¡ Apache Bahir => http://bahir.apache.org/ ¡ Community vs Enterprise Editions => https://databricks.com/product/comparing-databricks-to-apache-spark
  • 8. SPARK ECOSYSTEM Spark SQL Spark Streaming MLlib GraphX Spark Core Engine Standalone YARN MesosLocal Cluster ModeLocal Mode Kubernetes Classical Structured
  • 9. TERMINOLOGY ¡ RDD: Resilient Distributed Dataset, immutable, resilient and partitioned. ¡ Application: An instance of Spark Context. Single per JVM. ¡ Job: An action operator triggering computation. ¡ DAG: Direct Acyclic Graph.An execution plan of a job (a.k.a RDD dependency graph) ¡ Driver: The program for running the Job over Spark Engine ¡ Executor: The process executing a task ¡ Worker: The node running executors.
  • 10. HOWTO CREATE DATASET? By loading file (spark.read.format("csv").load()) SparkSession.createDataSet(collection or RDD) SparkSession.createDataFrame(RDD, schema).as[Model]
  • 11. OPERATION TYPES RDD RDD RDD Two types of Spark operations on DataSet ¡ Transformations: lazy evaluated (not computed immediately) ¡ Actions: triggers the computation and returns value Transformations DataSet Actions ValueData High Level Spark Data Processing Pipeline Source Transformation Operators Sink
  • 12. TRANSFORMATIONS’ CLASSIFICATIONS Partition 1 Partition 2 Partition 3 Partition 1 Partition 2 Partition 3 RDD 1 RDD 2 Narrow Transformations 1 x 1 Partition 1 Partition 2 Partition 1 Partition 2 Partition 3 RDD 1 RDD 2 Wide Transformations (Shuffles) 1 x n Shuffles Requires: - Disk I/O - Ser/De - Network I/O
  • 13. JOB EXECUTION FLOW Job I Stages I TaskVII TasksV Task I TaskSet I TaskSet II TaskSet III Driver Worker N Partition Partition Worker 2 Partition Partition Cluster Worker 1 Partition Partition
  • 14. EXECUTION TIERS =>The main program is executed on Spark Driver =>Transformations are executed on SparkWorker => Action returns the results fromWorkers to Driver val wordCountTuples:Array[(String, Int)] = sparkSession.sparkContext .textFile("src/main/resources/vivaldi_life.txt") .flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) .collect() wordCountTuples.foreach(println)
  • 15. JOB SCHEDULING ¡ Single Application ¡ FIFO ¡ FAIR ¡ Across Applications ¡ Static Allocation ¡ Dynamic Allocation (Auto / Elastic Scaling) ¡ spark.dynamicAllocation.enabled ¡ spark.dynamicAllocation.executorIdleTimeout ¡ spark.dynamicAllocation.initialExecutors ¡ spark.dynamicAllocation.minExecutors ¡ spark.dynamicAllocation.maxExecutors
  • 17. REAL LIFE USE CASES DataEditor: DataTransformation Platform PRODTOPOLOGY ¡ Running on 5 DCs (Private Cloud & AWS) ¡ Running with 19 Nodes ¡ ~400K submitted Spark jobs per week ¡ ~1.5M Http Calls per week SPARK RESOURCES ¡ 8 GB Memory per Executor ¡ 5 Core per Executor ¡ 4 executor per SparkContext PROD STACK
  • 18. DATAEDITOR SERVICES UI Micro Services DataGrid Service File Service Http Operations Description /grids Create new grid /count Get Grid Count /content Get Grid Content /save Save Grid to HDFS/S3 /join Join multiple Grids /groupBy Groups related columns … DataEditor Cluster Validation Service
  • 20. SOURCE GRIDS N TARGET GRIDS N MAPPING RULESV MAPPINGS BATCH RULES EXECUTION (User Defined) SOURCE GRIDS I MAPPING RULES I TARGET GRIDS I DifferentTemplate
  • 21. REFERENCES ¡ https://spark.apache.org/docs/latest/ ¡ https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ ¡ https://stackoverflow.com/questions/36215672/spark-yarn-architecture ¡ https://pulsar.apache.org/